MySQL Bugs: #1818: Replication failed between 4.0.16-4.0.16 on Linux. Reproductible.

Bug #1818	Replication failed between 4.0.16-4.0.16 on Linux. Reproductible.
Submitted:	12 Nov 2003 10:05	Modified:	21 Jun 2004 11:11
Reporter:	Renato Weiner	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.0.19	OS:	Linux (RedHat 7.2 or 7.3)
Assigned to:	Guilhem Bichot	CPU Architecture:	Any

Description:
Set up 2 servers, both 4.0.16 - 4.0.16 for replication. Replication fails after a while with the message on the log.

031105 12:08:23  Slave I/O thread: connected to master 'xxxxxx',  replication started in log 'ib_logbin1.001' at position 45827132
031105 12:08:23  Error reading packet from server: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master (server_errno=1236)
031105 12:08:23  Got fatal error 1236: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master' from master when reading data from binary log

How to repeat:
On the master set the parameter on /etc/my.cnf

set-variable    = max_allowed_packet=512M ( not necessary, but I did according to the message... )
set-variable    = max_binlog_size=512M

Execute lots of simple insert/updates/deletes. When the ib_logbin1.001 reaches approx 40MB, you will see the message above.

Suggested fix:
Right now, I'm using the following workaround which seems to work well so far:

set-variable    = max_binlog_size=40M

But with this, I have lots of annoying 40 MB size ib_logbin1.xxx files.

It looks like my 'solution' of split up the binlog didn't work either. Today I had another failure. It lasted a bit longer, but still replication doesn't work in a good way. Message:

031113  0:35:26  Error reading packet from server: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master (server_errno=1236)
031113  0:35:26  Got fatal error 1236: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master' from master when reading data from binary log
031113  0:35:26  Slave I/O thread exiting, read up to log 'ib_logbin1.004', position 15978017

Hi!

I'm looking forward to know if using our official binaries solved the problem.

Regards,
Guilhem

I tested with the binaries provided in the website and it didn't work yet.

I tried replication with version 4.0.18 and 4.1.1-alpha and got the exactly same error. 

I have a version with debug on and I´m thinking what functions should I put in the stack trace ? Maybe something like:

-#d:f,mysql_binlog_send:F:L:t,20

Please advise me, so I can provide more feedback.

Doing some more tests with Mr. Renato Weiner.

Continuing tests with Mr. Weiner

User is testing on different hardware/OS.

Hi Guilhem,

As you recommended I completely switched my OS and now everything is working.

In case anybody have this problem:
I was using a RedHat AS 3.0 with the aacraid module. Randomly it truncates the master binary logs, causing the error described it this bug. By using the aic7xxx module, replication is working ok now. 

I recommend to check your OS in case you have this error.

Thanks Guilhem for all the patience and help !!

Glad that your system is now working fine, and that it was not a MySQL problem!

Hi, Guihem

Could you tell us what suggestion you gave to Renato? And what did you modified for the testing?

Thanks

Hello Shengyong,
With Renato I think we didn't get complete knowledge of what was wrong: the problem appeared on Redhat 7.3 while there were no problems with Redhat AS 3.0. So it may have been a kernel/glibc issue.
We ruled out a MySQL bug by demonstrating that the binlog was shrinking (which MySQL cannot be responsible for as it never calls ftruncate() on such files): some statements disappeared from the binlog while they were there the second before. For this, Renato set up a script which prints the size of the last binlog every second, to a file. Something like
while true
do
ls your_binlogs | tail -n1 >> list.txt
sleep 1
done
Then when the error occured on slave, he inspected list.txt and found out that at some moment the binlog had its size decreased.
So we supposed that it was an issue with some hard drive OS driver, glibc...
Good luck!

The reporter have not provided what kind of query was stuck in his binlog. but i
can guess:
the case resembles bug#9822 and bug#19402 where there are queries size of max_allowed_packet
in binlog.