Bug #54610 Query gets corrupted by one character in replication
Submitted: 18 Jun 2010 9:51 Modified: 27 Aug 2010 8:47
Reporter: Charlton Kao Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.90 OS:Linux (Gentoo 2.6.31-hardened-r10)
Assigned to: CPU Architecture:Any
Tags: replication query corruption transpose character

[18 Jun 2010 9:51] Charlton Kao
Description:
Replication will fail because the query that the slave tries to run will be different from the query that the master ran by exactly one (1) character, resulting in an invalid query.  The most recent error was "s" being transposed to " ".  I've had "3" transposed to "x" in the past.  Doesn't seem like there's a pattern.

We have mysql running on 4 servers, and this error happens on all 4.  Our mysql configuration is as follows:
  Database 1 and Database 2 are in a master-master configuration
  Database 3 is a slave of Database 1
  Database 4 is a slave of Database 2

We were running essentially an identical environment with mysql 5.0 without any issues.  We potentially have tons of other replication data corruption that don't result in bad queries.

How to repeat:
No idea how to recreate the issue.  After skipping the bad query (following steps listed in suggested fix), replication will catch up and then run smoothly for a few days.  However, the error will come back again after 2-5 days.  

Suggested fix:
When this happens, we currently perform the following workaround:

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;
[18 Jun 2010 9:57] Andrei Elkin
There is no repeating pattern and you're saying restart helps. That sounds like
reading from the disk issue.
May I suggest to run an appropriate hd checking utility?
[18 Jun 2010 10:16] Charlton Kao
The reason I don't think it's a hardware issue is because the same thing's happening on all 4 servers.  These servers are also less than 1 month old.  

I don't know if this matters, but I'm running hardware RAID-1 for my OS (Dell R510 w/ PERC6i card).  My database is using a storage array with RAID-10.
[18 Jun 2010 19:11] Sveta Smirnova
Thank you for the feedback.

This also looks similar to bug #54380. When this error happen next time please send us output of SHOW SLAVE STATUS\G, problem binary log and configuration files for both master and slave.
[18 Jul 2010 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[19 Jul 2010 17:21] Charlton Kao
We received the error again over the weekend.  A "2" has been transposed to an "e", corrupting one of our int fields and causing the query to fail on the slave.  In reference to this bug being similar to confirmed bug #54380, the default character set of both databases is utf8.  All the server settings are utf8 as well, so I think it's a separate issue.  We downgraded from 5.1.46 to 5.0.90, and the issue is still occurring.  

Still unable to replicate the issue, but it happens whenever we're loading LARGE amounts of data.
[19 Jul 2010 19:44] Charlton Kao
Oh, by the way, I have run HD and RAM utilities on all the servers without any errors.  I've run Dell's DSET utility and they can't find any hw problems either.
[20 Jul 2010 5:47] Sveta Smirnova
Thank you for the feedback.

Please send us output of SHOW SLAVE STATUS\G, problem binary log and configuration files for both master and slave.
[20 Jul 2010 18:17] Charlton Kao
The entire bin log is 167 MB compressed.  Let me know if you need me to upload the full bin logs.  I've extracted the problem query from the master bin log and slave relay log.  I'll attach both to the forum.  

I didn't save the show slave status output at the time of the error, but I'll also attach the last few entries of mysqld.err.  The log has the error message, the bin log name, and the position of failure.  Look for this line: 
  "100717 22:51:42 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'la-dbb1-bin.000867' position 983282945".  
Right above that, it states "Illegal double '9376e63812' value".  The "e" was a "2" in the master bin log.  But somehow it got transposed to an "e" in the slave relay log.
[21 Jul 2010 5:38] Sveta Smirnova
Thank you for the feedback.

Please send us more output from master binary log: we need information about character set and variables which were set before problem query was executed. Please also send us output of SHOW CREATE TABLE problem_table.
[23 Jul 2010 8:35] Sveta Smirnova
Thank you for the feedback.

Yes, there is corrupted character in the slave relay log, but to repeat the problem we need information about character sets used on master and slave. This is why I asked for binary log from master. Please send us configuration files from both master and slave, so we can try to repeat the problem. Please send master error log file also: probably it contains information about communication problems.
[23 Jul 2010 17:20] Charlton Kao
Yeah, I understand.  I'll post the full binlogs if the error happens again before it gets flushed out.

In the meantime, I've posted both my.cnf files as well as the master mysqld.err file.  We're using the utf-8 character set for everything.
[27 Jul 2010 8:47] Sveta Smirnova
Thank you for the feedback.

I still can not repeat described behavior. So it looks like binary log is really needed.

Please also check your network interface for possible errors.
[27 Aug 2010 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".