MySQL Bugs: #26123: Relay logs occasionally get corrupted

Bug #26123	Relay logs occasionally get corrupted
Submitted:	6 Feb 2007 19:26	Modified:	17 May 2007 8:28
Reporter:	Jeremy Cole (Basic Quality Contributor) (OCA)	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any
Tags:	qc

Description:
With a few different systems, I'm occasionally seeing replication on individual slaves stop with corrupted relay logs.

I've been trying to produce a test case, but so far to no avail. I'm opening this bug as a feeler to try and see if others have seen the same thing.

The error that shows is:

Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by is suing 'SHOW SLAVE STATUS' on this slave.

Examining the slave's relay logs with mysqlbinlog yields:

ERROR: Error in Log_event::read_log_event(): 'read error', data_len:
27368, event_type: 2
Could not read entry at offset 104857049:Error in log format or read error

Examining the master's binary logs shows no corruption.

Restarting replication from the Exec_Master_Log_Pos works perfectly, and sails right through the same parts of the logs that were previously being corrupted. In a system with multiple slaves, this may occur on only one slave at a time (the other slaves continue without a problem) for any given log position.

How to repeat:
Unknown at this time.

It appears that this occurs only on systems using BLOB or TEXT, but that is not confirmed completely.

We have seen similar behavior.  This is exactly why I opened bug #25737, because I'm worried that binlog corruption gets through more often than it's caught (if the data is just corrupted but not in a way that causes a SQL error).  Is anyone else interested in putting a checksum on each binlog entry?  Please comment on that bug, too.

Thank you for the report.

Please indicate accurate version of master and slave you are using

Please also provide master binary log file with corrupted query and corrupted relay log.

Bug #22889 may be correlated to this one

A client encountered this problem when replicating approximately 6GB per day across the Atlantic Ocean. "Back of an envelope" calculations indicate that the corruption may be packet corruption. Given the sheer volume of data, it is possible that corrupt logs occur when a packet is corrupted but its checksum is valid.

If this is the case then it is a network problem not an application problem. Therefore, the problem should be fixed in the network configuration and not in applications.

Try using a VPN or a tunnel to add an additional level of checksums.

I didn't say anything about oceans in my original comment.  The corruption I've seen is within a single building.

Nonetheless, for something as critical as data replication, don't you think it's actually the application's job to ensure that what it's about to do is legit?  It's not that hard to do, but makes the application much safer.  How would you even know if you're having network problems and experiencing silent corruption?

Jeremy, 

> How would you
> even know if you're having network problems and experiencing silent corruption?

as you already know there is verified feature request (Bug #25737) about checksum to binlog events. It can be helpfull in case of corruptions caused by network problems.

But if your case is not because network problems, we should know which data were in your master and slave log files. So, please, indicate accurate version of master and slave you are using and provide master binary log file with corrupted query and corrupted relay log. (Or part of log files contains corrupted query, may be even mysqlbinlog output.)

See also comment "[16 Oct 2006 10:59] Mats Kindahl" to the Bug #22889.

I believe I have seen the same behavior.  I have a master running 5.0.33-log to a slave also running 5.0.33-log.  The masters replication log file looks correct, but the slaves relay log file does not.  Although mysqlbinlog does not error out, the sql in the relay log is corrupt.  The sql in the master log file looks correct

Here is an excert of the error log file from the slave at the time of corruption:
070215 10:47:43 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
070215 10:47:43 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'dbmaster-bin.000021' position 784591003
070215 10:47:43 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for
the right syntax to use near '_Number = 1976"^A' at line 1' on query. Default database: 'MiraServf_Number = 1976"^A'. Query: '_Number = 1976"^A', Error_code: 1064

Please note that the connection from master to slave is over a VPN

There is similar Bug #26489

Ver 4.0.24 for portbld-freebsd5.4 on i386 (FreeBSD port: mysql-server-4.0.24)

I noticed this the other day. We did a clean reboot on both master and 3 slaves in preparing for daylight savings time. 2 slaves started replication without a hitch, the 3rd experienced the exact error as this report. Both master and relay binary logs reported no errors. CHANGE MASTER TO .. on the slave, set to the exact same  Master_Log_File and Exec_master_log_pos as reported in SHOW SLAVE STATUS worked fine.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Jeremy,

do you have any chance to find if your problem is because network problem? Please confirm or refuse it.

This continues to be an issue with 5.0.38-Ubuntu_0ubuntu1-log (2 servers, same version, twin-master, one in the office, one in a colo)

Typical error:

Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '' at line 1' on query. Default database: 'mango'. Query: 'update mailManager_jobQueue set lastJobQueueID='26195''

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

We've apparently stabilised this situation (I'm tempting fate there) in part by wrapping server replication communications in an open VPN tunnel. This appears to have stopped the corruption, so I assume that the extra layer of packet verification provided by the tunnel's encryption is making transmission cleaner.

However, even if the problem is due to network errors, surely MySQL should be able to cope with this real-world situation (see #25737)?

Richard,

thank you for the feedback.

There is verified bug #26489 about connection problem and with hidden "How to repeat" instruction. So I mark this one as duplicate of bug #26489.

Jeremy, if you doesn't agree feel free to reopen the report.

WRT Richards post,

[quote]

We've apparently stabilised this situation (I'm tempting fate there) in part by
wrapping server replication communications in an open VPN tunnel. This appears
to have stopped the corruption, so I assume that the extra layer of packet
verification provided by the tunnel's encryption is making transmission
cleaner.

[/quote]

I have problems with log corruption (as per Bug #26489)...and I've been running over OpenVPN for ages and it still happens....that it works for you over OVPN is coincidental.

D.

I've noticed the same behavior when we lose connections to the master.  

In this case the box was physically rebooted and MySQL failed to come back online.

We're trying to keep these machines reliable so that when they crash they can come online without any problems.  sync-bin is enabled as well so this is becoming a problem.

I'm pretty sure we could duplciate this as it's been happening left and right.

This is on mysql 4.0.22