MySQL Bugs: #29232: Replication not usable because of corrupted slave relay logs

Bug #29232	Replication not usable because of corrupted slave relay logs
Submitted:	20 Jun 2007 8:49	Modified:	27 Mar 2008 17:29
Reporter:	Stef Cant	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	Server version: 5.0.42 Gentoo Linux	OS:	Linux (Gentoo Linux)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	corrupt relay, replication

Description:
I know there already are several bugreports about replication problems with corrupted relay log files, but they don't seem to get fixed. I hope I can help looking for the cause of the problems...

Situation: 1 mysql master, 3 mysql slaves. 1 of those slaves is used for backup purposes, the 2 others we want to use as read-only slaves for load-balanced webservers. The master handles 350 queries per second (average, sometimes there are more than 750 qps). All servers are located in the same rack, connected through a 1000Mbps switch, so the network should be reliable.

Having to restore the replication every 16 hours is not workable, so at this moment, because of these problems, we can't trust nor use replication anymore.

How to repeat:
We can't trigger the corruption ourselves, but at least once every 24 hours one or more of the slaves fail because of a bad query, caused by a corrupted relay log. Usually, at least 1 of the slaves has no problem, so the cause is not located in the master bin log.

I'll try to add some files of the last time we had a replication problem. The slave query at position 1281262 is corrupted.

Suggested fix:
There are 2 issues here:
1) what causes the corruption
2) how can the failure of the slaves be solved

1) the cause
in other tickets, people mention slave_net_timeout, logfile rotation, bad quality networks, ... But we can't really identify one of those suggestions as the cause of our problems. So we are not sure about the cause

2) workaround
since a query is only replicated when it does not cause an error on the master, the slave can be sure that when there is a syntax error in a query it wants to execute, there must have been a failure during the read-from-master procedure. In stead of stopping replication all together, perhaps the slave can retry to retreive queries starting from the position of the faulty query?

slave status after failure

Attachment: show_slave_status.txt (text/plain), 2.29 KiB.

part of the master bin log file

Attachment: mysqlbinlog_master_firefoot-bin.000010.txt (text/plain), 25.09 KiB.

part of the slave relay log file

Attachment: mysqlbinlog_slave_relay-strider.000022.txt (text/plain), 6.48 KiB.

my thoughts about what happened

Attachment: investigation_faulty_query.txt (text/plain), 2.85 KiB.

Hi Stef.

There was Bug #27583 which might relate and it appeared to be fixed only in 5.1 version. Before we merge that fix you can check the patch I putting here.

*** /tmp/geta22959	2007-06-20 12:30:49.000000000 +0300
--- log_event.cc	2007-06-20 12:30:45.000000000 +0300
*************** int Rotate_log_event::exec_event(struct 
*** 3397,3402 ****
--- 3397,3404 ----
      memcpy(rli->group_master_log_name, new_log_ident, ident_len+1);
      rli->notify_group_master_log_name_update();
      rli->group_master_log_pos= pos;
+     strmake(rli->group_relay_log_name, rli->event_relay_log_name,
+             sizeof(rli->group_relay_log_name) - 1);
      rli->group_relay_log_pos= rli->event_relay_log_pos;
      DBUG_PRINT("info", ("group_master_log_name: '%s' group_master_log_pos:\
  %lu",

Stef,

please apply patch as Andrei Elkin suggested, try with it and inform us if patch is solved (or doesn't) the problem.

Thanks in advance.

Ok, I have patched the master and the 3 slaves. I'll update this ticket when replication fails again, or in a few days when everything turns out to be ok.

Stef, thank you for the feedback.

Will wait results of your tests and set status of the report to "Need feedback".

Bad news, 2 slaves failed again, and the cause is almost identical. I'll add some files concerning this new failure, but you will see the scenario is almost a perfect copy of the first set of files.

output of 'show slave status'

Attachment: 2_show_slave_status.txt (text/plain), 2.29 KiB.

part of mysqlbinlog (master )

Attachment: 2_mysqlbinlog_master_firefoot-bin.000008.txt (text/plain), 48.79 KiB.

part of mysqlbinlog (slave)

Attachment: 2_mysqlbinlog_slave_relay-strider.000018.txt (text/plain), 22.68 KiB.

3 relevant queries

Attachment: 2_investigation_faulty_query.txt (text/plain), 2.46 KiB.

It's been a week now. Each day at least one slave failed because of a bad query. So that patch didn't help at all ... any other ideas or code/patches I can try?

Still no answers ... I would like to mention one more thing: the patch you suggested concerns an incoming rotation event from the master, but I'm not sure there was such an event, because there is a gap of at least 90 seconds between the failing query and the slave rotation.

Stef,

thank you for the feedback. We currently thinking about ways to reproduce this bug and which additional information can be needed if any. This is why we don't answer.

Hi,
We avec strictly the same problem.
Here is our architecture : 
2 masters with cross replication
18 slaves connected to of one of the two masters.

Approximatively once a day, the replication crash on one of the 18 slave with an error like :

070919  2:39:57 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'ó' at line 1' on query. Default database: 'monitoring'. Query: 'insert low_priority into hotdownload(sere,datemaj) values ('ftpclubic37.clubic.com','firmware_ts02_3517.exe','1213241',1190160423)ó', Error_code: 1064

We can see that the query is corrupted, the good one was : 
insert low_priority into hotdownload(serveur,fichier, taille,datemaj) values ('ftpclubic37.clubic.com','firmware_ts02_3517.exe','1213241',1190160423)

Some characters are missing and others are added ("ó")

We use the latest available production release ie : 5.0.45-community

We are available to test thinks as soon as it is very difficule for us to work like that.
Few weeks ago, we used 4.1.22 relase for month with no problem of this kind.

Thank you for your help.

Has there been any progress on this?  Or is it because no one can reproduce the problem?

looks related to bug #26489

Thank you for the report.

Your case looks like similar to bug #26489 which has been fixed in version 5.0.56. Please upgrade to this version (or wait next Community release), try with it and say inform us if problem still exists.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

If you get a replication error beginning with "You have an error in your SQL syntax;" please try getting a version that has the fix for bug #26489 in it, 5.0.56, 5.1.24, and 6.0.5 and later. We think that this will fix almost all problems of this type.

If you do not have that fix yet, you can do this as a temporary workaround each time it happens:

SHOW SLAVE STATUS
STOP SLAVE;
CHANGE MASTER TO master_log_file='insert the value from Relay_Master_Log_File',
  master_log_pos=insert the value from Exec_master_log_pos;
START SLAVE;

If you have a version with the fix for bug #26489 in it and still regularly see this problem please open a new bug report. It's expected that you'll see it occasionally due to corruption in transit but this shouldn't be many times a day or even many times a week.

James Day, Support Engineer, MySQL/Sun