MySQL Bugs: #26489: Corruption in relay logs

Bug #26489	Corruption in relay logs
Submitted:	20 Feb 2007 6:14	Modified:	6 Feb 2008 16:28
Reporter:	Daniel Fiske	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.27 & 5.0.37, 5.1.22	OS:	Any (Linux, Win32)
Assigned to:	Zhenxing He	CPU Architecture:	Any
Tags:	bfsm_2007_10_18, bfsm_2007_11_01, bfsm_2007_12_06

Description:
I currently have replication set up between 2 innodb servers (the remote one being very remote) over a VPN on a high latency connection. The relay logs are OFTEN corrupted with odd characters very similar as described by Bug #14471. 

How to repeat:
It is hard to reproduce.....but I'll suggest a couple of ways.

Try and get a replication setup in which the slave has a fair amount of data to read on the IO thread and where the connection speed is poor. 

Once the read connection is made from the slave, try the following

1. Start + stop the slave process a couple of times (stop slave/start slave)
2. Start + stop the slave mysql process (using service manager) a couple of times.
3. Kill the network connection.

...you should get some corruption.

Suggested fix:
I am relatively surprised that there can be corruption in the relay log, but I don't know the replication internals very well, I'm guessing the replication IO protocol/binary log needs some sort of error detection (CRC, checksum, etc, etc).

Some additional information.

It seems to be down to the connection. Overnight the relay log cycled several times...and there was OFTEN corruption during the last entry of the relay log.

I have seen this problem regularly with Linux servers running 5.0.26.  I have just upgraded some to 5.0.27 but don't have much faith that it's been solved if you're seeing the same problem in 5.0.27 on Windows.  It happens frequently when the slave server loses connection to the master.  The slave logs:

[ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)

And it looks like the network buffer gets written to the relay log as-is.  I've seen the end of the SQL statement being replicated at that moment get replaced with fragments of previous statements and various random garbage.  When the SQL slave thread tries to execute the statement, it usually stops replication due to a SQL syntax error, but yesterday the random garbage in the relay log was bad enough to cause a mysqld segmentation fault.  This put the server into a crash loop since the SQL thread would read the same garbage from the relay log every time it started.  I would not call this an S3 -- it is a pretty severe problem!

Thats pretty much what I'm getting (minus the crash!). We'll have to see when it gets assigned what the developers say. I'm looking a the 5.0.33 source code, to see where things might go wrong.

D.

I'm having the same problem (including the crash). Using MySQL 5.027 on windows, over an unreliable link.

070226 15:12:06 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
070226 15:12:06 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log '<hostname>-bin.000033' position 564503495
070226 15:12:06 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near <query> , Error_code: 1064.

Also getting invalid table names, an error saying 'Could not parse relay log' due to corruption in different parts of the packets.

Tried it with a self-compiled 5.0.33 and the problem is still there.

Everyone at MySQL on holiday? 8 days....and it hasn't even been assigned! Pretty poor.

Setting my slave_net_timeout to a much smaller value (currently 60 seconds) has let my slave run for over 6 hours now, whereas a copy with it set to default(3600) stops at least once an hour (running at same time).

Thank you for the report.

Are you sure it is network problem? Only relay logs are corrupted? Don't master binary log files?

We also have verified feature request about checksum for bilog events: Bug #25737

program to repeat network error

Attachment: bug26489.c (text/plain), 2.37 KiB.

Network error is easy to repeat with attached simple proxy. Just compile it, run as `./bug26489 host master_port proxy_port` and setup replication via proxy.

Results are unpredictable.

For example.

master:
mysql> create table t1(f1 varchar(100));
Query OK, 0 rows affected (0.26 sec)

mysql> insert into t1 values('foo');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('bar');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('bug26489');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('baz');
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1;
+----------+
| f1       |
+----------+
| foo      | 
| bar      | 
| bug26489 | 
| baz      | 
+----------+
4 rows in set (0.26 sec)

slave:
mysql> show slave status\G
*************************** 1. row ***************************
...
              Relay_Log_Pos: 245
      Relay_Master_Log_File: mysqld501-apple-bin.000001
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
...
1 row in set (0.01 sec)

mysql> select * from t1;
+------+
| f1   |
+------+
| foo  | 
| bar  | 
+------+
2 rows in set (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Reconnecting after a failed master event read
                Master_Host: localhost
                Master_User: repl
                Master_Port: 33050
              Connect_Retry: 60
            Master_Log_File: mysqld501-apple-bin.000001
        Read_Master_Log_Pos: 471
             Relay_Log_File: mysqld502-apple-relay-bin.000003
              Relay_Log_Pos: 245
      Relay_Master_Log_File: mysqld501-apple-bin.000001
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB: 
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 1105
                 Last_Error: Error 'MyISAM table 't1' is in use (most likely by a MERGE table). Try FLUSH TABLES.' on query. Default database: 'test'. Query: 'create table t1(f1 varchar(100))'
...
1 row in set (0.00 sec)
mysql> set global SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----------+
| f1       |
+----------+
| foo      | 
| bar      | 
| foo      | 
| bar      | 
| bug26489 | 
| baz      | 
+----------+
6 rows in set (0.00 sec)

I'm pretty sure nothings actually getting corrupted on the network - the link is unreliable, but nothing else is getting corrupted, and changing the slave_net_timeout has completly fixed it (over a month without stopping now - setting to default it stops every hour or so). It's certainly not a server problem  (slave_net_timeout fixes it, if we do 'change master to  MASTER_LOG_POS=' with the current exec_log_pos on the slave to reset the slave relay logs it also fixes it till the next time the connection dies). The error is always the last query before  the connection drop is reported in the error log.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I still see this problem all the time in 5.0.37.  If the connection drops (due to problems with our internet provider), the replication slave can write junk into its relay log file.  I have some samples of the junk if it would be helpful, but it looks like it just writes its buffer to the relay log file as-is, which frequently results in a malformed SQL statement that requires me to fix it manually.

Kevin, thank you for the feedback.

Verified as described in the initial post and last coment.

Instructions how to repeat see in the comment "[6 Apr 14:23] Sveta Smirnova"

It seems that waiting for #25737 to be implemented could take a LONG time, it could be "fixed" simply, that when a disconnect is detected (When it would normally cycle the relay-log) the slave relay log could be rolled back 1 transaction. i.e the last transaction is not written *or* is actually deleted and the previous transaction pointer used.

D.

Bug #26123 was marked as duplicate of this one.

Upgrading to an S2

7/5 NoFb->Ver
24/5 Ver->Prg
29/5 Prg->Ver
31/5 Ver->Prg
6/6 Prg->Ver

What's going on?

D.

Daniel:

This happens because there is some sort of internal debate going on over at MySQL.  Their notification script isn't smart enough to tell you that there *was* a comment left, but you just can't see it.  So, instead, you get notifications of just the status change with no explanation.

I have tried the proxy program but it doesn't demonstrate the error. So I'm returning this to support for further analysis.

Workarounds:

1. Use SSL connections to get the extra level of connection checking.

2. If it does happen:

SHOW SLAVE STATUS
STOP SLAVE;
CHANGE MASTER TO master_log_file='insert the value from Relay_Master_Log_File',
  master_log_pos=insert the value from Exec_master_log_pos;
START SLAVE;

opened bug #31374 while i was trying to repeat this bug.

All reporters!

Please, indicate which switch do you use. (Exact model of hardware.)

For the general issue of corruption of data in transit see bug #25737. We know that TCP's CRC checks aren't good enough to protect the integrity of the data being transferred over replication links and that corruption can happen and produce changed characters even with a valid CRC.

This specific problem may not be due to that, unless the ending of the logs coincides with a broken connection, which may have a period of bad connectivity and errors preceding it.

We worked around this bug by installing a point-to-point circuit between the master and slave.  That has been rock solid, the replication TCP connection hasn't been interrupted, and we haven't had any corrupted since.  The PowerConnect 2716 switch doesn't seem to be involved.  I hope this bug is eventually solved because I'd like to be able to go back to replicating across unstable VPNs in the future.

Kevin, thanks for confirming that improving network reliability solved the problem for you; that helps us to confirm that TCPs 32 bit CRC isn't sufficiently good for our needs. The work will be done, just the when part that's undecided at present.

Set to "Verified" again according to last comments. See also bug #25737 as well.

As the original poster of this bug I thought I'd pipe in. I've been reading through the posts again and I'm somewhat astounded. Here are some of my thoughts/comments/rants

1. Reported 20 Feb 2007....its been 8 months and we are still debating the cause. How about someone actually DO some work on this?
2. I'm not sure who James Day is, but he seems to be involved on some level with MySQL development (and James I apologise if you feel this is a harsh personal attack...please believe me that I'm trying to be constructive). James' comments are alarming!!!

"thanks for confirming that improving network reliability solved the problem for
you; that helps us to confirm that TCPs 32 bit CRC isn't sufficiently good for our needs."

I'm sure James is experienced and I don't want to get into a debate about who has better credentials, so I'll just say what I think is wrong with that statement and anyone is free to comment on where they think I'm wrong.

This statement is questionable because I don't see how this confirms anything. This bug is DIRECTLY related to network stability. Of course if you increase stability of the network its not going to show up as often. Here a nice little analogy to drive my point home.

You got a boat design. People complain it capsizes regularly in rough seas and perhaps its an unstable design. So you say "sail it in good weather". So people do and they say "it works perfectly when I sail on calm days". You say "See. That proves there is nothing wrong with our design. Its the seas fault!".

Now I'm sure you are qualified and have experience, but I'm worried if you are taking part in debugging a system if your logic is like that.

Secondly, lets look at our 2 likely points of failure. TCP/IP standard OR MySQL software. Seriously? TCP/IP's error correction is FAILING....and MySQL's replication code is bomb proof? VPN (with its own error detection and correction) over TCP/IP is FAILING....and MySQL's replication code is working. Let's assume it is corruption on the line and the corruption produces a valid CRC (anyone know the odds of producing a CRC32 collision?). The odds of that happening and being repeated by multiple MySQL installations? I should go play the lottery....I'm obviously very lucky.

...

Frankly people, we are barking up the wrong tree.

Could TCP's CRC cause it to fail? Yes.
Is there still a HUGE probability that there is a bug in MySQL code? Yes.

See bug #29813 "replication errors on a unstable network" for another report of corruption on unstable VPN connections.

The publicly visible worklog item for checking the integrity of replication events is at http://forge.mysql.com/worklog/task.php?id=2540 , which was created in April 2005.

Daniel, I'm one of the support engineers at MySQL. Proceeding with your analogy the boat has a pump that's supposed to remove water that enters the boat in rough weather but there are still reports that the boat takes on water and becomes unstable. It's possible to correct the boat design so it's stable with lots of water but making sure that the pump is working is also worth doing.

We've seen more than 200 relay log corruption incidents over the last few years. Yet even with the bug fixes from those investigations it's still happening and it's now the most common replication problem we see (because we've fixed so many of the others). One of the things we quite often see in setups with several slaves is that a single slave has corruption while the others are fine. Lots of possible causes for that but one of them is what some people have proved does happen: corruption in transit. At the moment a high proportion of the issues we're seeing where this is reported are for unreliable networks. When people use SSL connections that tends to solve the problem for them. Since one of the changes in SSL connections is improved error checking, that tends to support the view that corruption in transit is one of the causes.

That doesn't rule out other bugs, though. Take a look at Bug #4053 that was fixed in 5.0.30 and Bug #24192 fixed in 5.0.50.

Since there will also be some bugs in the MySQL code the plan is to record the binary log around any broken events as well as retry. That way the replication setup doesn't break and we can still look for other bugs.

For your case, if you've changed your mind about "I'm guessing the replication IO protocol/binary log needs some sort of error detection (CRC, checksum, etc, etc)" please collect and upload a mysqldump of the relay log events just before and around the corruption and ideally also the corresponding master event so we can see the nature of the corruption and whether it's bad on the master as well as the slave. That'll let us check for errors in the log but not bugs in MySQL's network layer handling.

Hi James,

Thanks for the response.

1. Bug #29813, sounds identical to this bug. However I contest your assumption that because its happening on an unstable VPN (with more ED/EC) that it is still a network issue. If anything the data provided from a VPN endpoint is less likely to be corrupted, which points to a problem in MySQL code.
2. I still believe that binlog/replication needs ED/EC. But perhaps thats more of a feature request. My suspicion is that the fundamental cause of this bug will remain in the code if ED is implemented. Any errors picked up by ED algorithms will in the future be passed off as transmission errors and the actual problem will never be addressed. Basically I'm saying I think there is 1 bug and its not transmission. It should be solved AND ED/EC should be implemented. I think this would fit in line with your extended boat analogy. I think we are blaming the wrong thing here....which brings me to my next point.
3. I have a theory on what is causing this error and I've spent a fair amount of time looking through MySQL code (on other bugs and general interest). However, I'll NEVER presume to know more about that code than ANYONE who actually works at MySQL, so I've been very hesitant to spew my theories to you guys until I can spent time in the code really looking at this bug. The reality is I don't have the time. So I'm very cautiously going to lay out what I think is going wrong and why I don't think its an error on the line.

Have a look at bug #29813. It is a perfect example. The slave log is corrupted as such.

#070711 11:02:28 server id 2  end_log_pos 350   Query   thread_id=137801       
exec_time=0     error_code=0
SET TIMESTAMP=1184122948;
INSERT,^@^@^A^@^@^@^@^@^D^A^@^@^@+^@^@^@^@^@^@^@^@^@?ù^H^@^@^@^@^@mysql-bin.000075_^@^@^B
^@^Q?<93>F^O^A^@^@^@^^@^@^@^@^@^@^@^@^@^D^5.0.24a-standard-log^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^S8^M^@^H^@^R^@^D^D^D^D^R^@^@K^@^D^ZA^@^@^C^@<95>E
<94>F^B^B^@^@^@@^@^@^@?ù^H^@^@^@<8e>^Z^B^@^@^@^@^@^@^@^@^Z^@^@^@@^@^@^A^@^@^@^@^@^@^@^@^C

^@^B^@^D!^@!^@!^@^@BEGIN^]^@^@^D^@hE<94>F^E^B^@^@^@^\^@^@^@^\^@^@^@;
# at 14351

Now if this had been a line error the garbage i.e. everything after the INSERT, would have to produce the exact same CRC as what should have been there, for it to be passed on to MySQL. I don't buy that the CRC for INSERT INTO....would happen to be the same as the CRC for INSERT,^@^@^A^@^...

My experience of this bug is that it occurs on a disconnect/timeout of the master-slave connection, whether or not its protected by a VPN or SSL. That would suggest it may be the way that MySQL is handling the last "received" packet from the master after a disconnect. Looking at the "rubbish" part of that transaction I further can't believe that corruption on the line would miraculously spell out "mysql-bin.000075"...unless ethernet has become sentient and is have a little laugh at us. It looks strongly like the end of a memory buffer (probably on the slave end), that was not written to....i.e. not all the data was received (due to disconnect) but the packet/transaction was *still* written to the binlog on the slave.

...comments/criticism/flaws/help...will be appreciated.

D.

Daniel, yes, I also think that there's a bug in that area. Various people here have tried to reproduce it but haven't succeeded so far. The MySQL developers aren't perfect so if you do think that you see a bug in the source code please do let us know.

We need a step-by-step procedure to repeat the problem reported in order to be able to rectify it. Thanks in advance for your efforts.

Georgi, there are two different issues here:

1. Whether there's a specific corruption and test case for the original report. That may lead to a solution for one of the causes of corruption.

2. The general experience of corruption that can be resolved by bug #25737, which is currently misclassified as a feature request rather than a meta-bug report requesting the solution to a common problem that can't be reproduced with practical test cases.

Regardless of what happens to this specific report, the general solution of a check for integrity and automatic retry is required and is one of the highest support priorities for the replication team because of the volume of issues we receive.

I'm attaching a patch that Support would like to propose for this bug, which implements a new variable called sync_relay_log - which acts just as syn_binlog does - by flushing and performing an fsync() on the relay log every nth event (where n is set by the sync_relay_log variable. 

Example:

mysql> show variables like 'sync_relay_log';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| sync_relay_log | 1     | 
+----------------+-------+
1 row in set (0.00 sec)

mysql> set global sync_relay_log = 2;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'sync_relay_log';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| sync_relay_log | 2     | 
+----------------+-------+
1 row in set (0.00 sec)

mysql> set session sync_relay_log = 3;
ERROR 1229 (HY000): Variable 'sync_relay_log' is a GLOBAL variable and should be set with SET GLOBAL

Along with the patch, I'll also attach a kdump (OSX strace) to show that the fsync() does happen every 2 events with the above setting (as it does with setting it to 1 and an fsync() for every event).

Patch to add sync_relay_log variable

Attachment: sync_relay_log.patch (application/octet-stream, text), 5.46 KiB.

ktrace/kdump output verifying the fsync

Attachment: sync_relay_log.dump (application/octet-stream, text), 93.08 KiB.

Mark, I believe that it flushes and fsyncs but how does doing that fix the problem this bug is about?

The corruptions occur when a timeout is encountered.  So far, I found the most effective way to reproduce the timeout is to change the master's date/time backwards and forwards a few times.  Then the error is reproducible within a few minutes.

The patch was to handle the possible case where the cause may be OS or MySQL corruption of an event, based on a case where previously seen replication events were encountered in the buffer. That's a fair theory for at least some of the cases.

However, the tests by Shane from the MySQL Support team have now reproduced this in a way that suggests that buffer corruption or incomplete data transfer due to timeout produces this symptom as well, so his investigations continue and seem likely to produce a neater solution based on locating and correcting that problem.

Shane's initial test successfully reproduced the report in bug #21923.

Bug #21923 was marked as a duplicate of this one.

According to Shane's last findings see also bug #22047

I'm being bitten by this bug as well.  My analysis so far is in agreement with the comments earlier (re writing a garbage memory buffer to disk on a short read/error when it shouldn't) and Shane's linking it to #21923.

One of the more interesting instances I've seen was a case where the slave lost the connection to the master (I believe I induced this by having huge network buffers and spamming events at it so fast that it couldn't possibly keep up, causing large send queue backlogs on the master), but immediately succeeded on its first attempt to reconnect.  The I/O thread continued on after reconnect as if nothing was wrong, but the SQL thread noticed the corrupted event in the eventlog (last event in that particular file, I'm assuming because a new one is opened on reconnect?).

These are the syslog entries from that event:

Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'x_binlog.000005' position 57280419
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '^^' at line 1' on query. Default database: 'foo'. Query: 'insert into f values (49418^^', Error_code: 1064
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'x_binlog.000005' position 57280253
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [Note] Slave: connected to master 'dbmaster:3306',replication resumed in log 'x_binlog.000005' at position 57280419

I'm guessing somewhere in sql/slave.cc:handle_slave_io, there's some nefarious logic errors what with all of the conditionals and gotos.  It seems that when we got a short read of a master event due to disconnect, we wrote the partial event plus random leftover gargbage from the buffer into the relaylog (and then (or afterwards?) reconnected to the master and kept on grabbing new events).

I haven't been able to nail it down precisely yet, the code is rough to follow mentally, but I'm sure it's in there somewhere (or in the underlying net_ and vio_ funcs).

(by the way, I'm using a debian etch build of 5.0.33 on x86_64, and it doesn't look like they've patched this stuff in any substantial way).

I suspected the same thing, so I made a simple test.

The error message given earlier in the bug report ("Lost connection to MySQL server during query ( server_errno=2013)") is printed when an error is returned from cli_safe_read() in slave.cc and that error is CR_SERVER_LOST (which is 2013). So, to test it, I added error injection code just after the call to cli_safe_read() that corrupts every 10th event by overwriting the tail of the event with "GARBAGE/TRASH" and changes the returned error to CR_SERVER_LOST. The result is that the error is generated (identical to the error mentioned earlier in the report), but no corrupt event is written to the relay log.

In other word, a lost server where the end of the package is corrupt is handled correctly, and a reconnection is made and the package re-fetched.

Since there is only a single thread working with the buffer where the event data is stored, there is no multi-tasking issues involved. The only multi-tasking available is between the I/O and SQL thread via the IO_CACHE that is used for the relay-log, so I am now suspecting an error or a misuse of the IO_CACHE, causing the event to be written short. There are however mutexes in place for serializing access to the relay-log IO_CACHE, so I am leaning more towards a bug in the IO_CACHE.

We are also having this issue here. Identical behavior. We also get instances where the relay log contains incomplete queries, like "UPDATE `table name` WHERE id=". This happens only on a single slave at a time, and appears to be related to the relay log corruptions issues, which are definitely occurring. We run Enterprise 5.0.45 or thereabouts on most of our servers. We have upwards of 40-60 slaves replicating from the master at the moment.

see also bug #29232

We have seen this problem as well sporadically. We have several hundred masters in production, each of them with around 5 to 40 slaves. They are connected with a local area network. Up to now we suspected the users to just flick the power switch - even though they usually deny that. Maybe we were wrong to not believe them.
However it is not happening too often - but this might be credit to our very stable networks.

I get this error too after upgrading masters from 4.0.26 to 5.0.37. The slaves were running 5.0.37 to the 4.0.26 master without problems. Since the upgrade, slaves connecting to the master over a flaky network (high rate of packet loss) get this error.

On every occurrence of the error, the slave IO thread gets an error that makes it disconnect/reconnect.

The corrupt SQL frequently includes text from an earlier binlog event.

When the error occurs, the master *.err log includes one of the following. There are a few places that may print 'Failed on my_net_write()'.
  080119 10:04:51 [ERROR] Failed on my_net_write()
  080119 10:08:43 [ERROR] failed on net_flush()

Now my editorial, as I have earned it after a few days reading MySQL replication and networking code. net_real_write() and my_real_read() really, really need to be refactored. I don't know if they contribute to the problem, but they are not easy to follow and using a read or write with a timeout is much simpler than the current approach of first trying a non-blocking call (after changing socket options) and then setting an alarm and doing blocking call (after again changing socket options).

This can't be from a TCP checksum error. But I did have a problem once caused by bad network switches where the SQL text sent to a server was modified by the switch and the checksum did not catch the error. One character in a query was changed.

update: in my tests, the slave breaks when the following error is found in net_serv.cc: Error: Packets out of order (Found: 105, expected 182)
More details later.

Can we change the code to log the value when errors like this occur? I assume this error is found on the slave side.

FYI, I've also been running into this problem when the slave mysqld crashes.  If it dies due to SIGSEGV or is killed by an Out of Memory, the slave can write an incomplete statement to the relay log.  After mysqld restarts, the SQL thread errors due to the syntax error in the relay log and replication is down until someone fixes it manually.  It should be smart enough to just grab the broken statement from the master again..

Patch that is being investigated:

--- net_serv.cc.orig    2008-01-25 15:07:49.000000000 +0100
+++ net_serv.cc 2008-01-25 15:20:08.000000000 +0100
@@ -534,13 +534,13 @@
       vio_trace_print(net->vio, "flushing buffer left_length=%lu bytes",
                       left_length);
       memcpy((char*) net->write_pos,packet,left_length);
-      if (net_real_write(net, net->buff,
-                        (size_t) (net->write_pos - net->buff) + left_length)) {
+      size_t const bytes_to_write = (size_t) (net->write_pos - net->buff) + left_length
+      net->write_pos= net->buff;
+      if (net_real_write(net, net->buff, bytes_to_write) {
         vio_trace_print(net->vio,
                         "leave net_write_buff since net_real_write()!=0");
        return 1;
       }
-      net->write_pos= net->buff;
       packet+= left_length;
       len-= left_length;
     }

I will get this running no test servers to see if it fixes the problem. But I would like to know why this fixes a problem. The call chain is:
 my_net_write() -> net_write_buff() -> net_real_write()
When net_real_write() returns an error, it is passed back to the my_net_write() caller without retrying.

When my_net_write() returns an error to the caller in mysql_binlog_send(), the code branches to err: and the connection to the slave is closed.

What am I missing?

I am inclined to agree with Mark that the patch is probably wrong.

My reasoning for this is the following:

If net_real_write() fails with a timeout, it's setting net->error to 2.
After that, there will be no more writes on the socket, which can be verified
by the following code at start of net_real_write():

  if (net->error == 2)
    DBUG_RETURN(-1);				/* socket can't be used */

The idea is that if there has been a write error or timeout on the connection,
we will not send the client any error messages or other information.

In this context the slave should get a read error on the socket, skip the
last even for the readlog and reopen the connection.  In other words, the
master should not send any wrong information to the client.

Setting net->write_pos, as the patch does, has no effect in the
context of the above logic.

From this we can conclude that the bug is in likely to be in one of
the following areas:

- The master does not set net->error == 2 in the case of a net write error
  or timeout.
  Fix:  Set net->error to 2

- The master sets net->error to 2, but still it does somehow write to the
  connection.
  Fix: Ensure we don't write to socket if net->error is 2.

- The slave client doesn't properly ignore the last 'wrong' event in case
  of dropped connection and writes both the read part and the not yet
  read part to the relay log. (Note that a packet starts with packet length
  this is a likely bug if the error condition is not properly checked).
  Fix: Ignore the wrong packet

This analyse is somewhat conflicting with Lars analyse of the likely
source of the problem; Lars, can you use some of the test cases that
has been used to verify the bug to check the usage of net->error and see
why the code doesn't work as intended ?

Note also that the patch as such is sub optimal in any case:
- Not portable (as net_serv.cc needs to be compiled with a C compiler).
- Doesn't take into account other calls to net_real_write()
- Longer than needs to be; A better method is to save the value of
  net_real_write() in a value, set net->write_pos and then test
  the return value if there was an error.

I get the same error message as Shane on the slave. This is immediately before the errror occurs:
Packets out of order (Found: 46, expected 74)

I did not have this problem when using masters running 4.0.26 and slaves running 5.0.37. Once the masters were upgraded to 5.0.37, the error happens frequently. I have diff'd the code for writing to the net (my_net_write, net_write_buff, net_real_write, vio_write) between these versions. The diffs are minor:
 1) my_net_write return type changed to my_bool
 2) net_real_write in MySQL5 doesn't have an 'if' before query_cache_insert call
 3) net_real_write in MySQL5 sets net->report_error on error
 4) net_real_write calls vio_timeout() in MySQL4. MySQL5 has a comment that this is done in net_set_write_timeout

For the write path, not using vio_timeout() is the only significant change I see for MySQL5. I did not look at all of the vio and thread alarm functions that were called by the functions above.

For the read path (read_event -> *_safe-read -> my_net_read -> my_real_read)
 1) read_event in MySQL5 checks for eof packet, mysql4 just checks len == 1
 2) there are many differences between cli_safe_read in MySQL5 and mc_net_safe_read in MySQL4
 3) my_net_read for MySQL5 sets net->report_error when my_uncompress fails
 4) my_real_read calls vio_timeout for MySQL4
 5) my_real_read sets net->report_error on error for MySQL5
 6) my_real_read uses vio_should_retry where MySQl5 uses vio_errno() == SOCKET_EINTR to determine whether to retry on interrupt

I think the problem is as Lars described, but I don't like the patch, I'd prefer to fix it by this:

--- sql_parse.cc.orig   2008-01-28 14:54:29.000000000 +0800
+++ sql_parse.cc        2008-01-28 14:54:40.000000000 +0800
@@ -1166,7 +1166,6 @@
       unregister_slave(thd,1,1);
       /*  fake COM_QUIT -- if we get here, the thread needs to terminate */
       error = TRUE;
-      net->error = 0;
       break;
     }
 #endif

Since we have a network problem, we should not try to send the error message, just close the network .

Just for your information, here is the scenario that causes the failure.

1. The to-be corrupt log event (let's call it X), is split into two
   packets B and C on the network level (net_write_buff()). The parts
   are X = (x',x''). The part x' ends up in packet B and part x''
   ends up in packet C. Prior to the corrupt event X, the event Y has
   been written successfully, but has been split into two packets as
   well, which we call (y',y'').
2. The master sends packet A = (y'',x') to the slave, increases the
   packet sequence number, the slave receives the packet, but fails
   to reply before the master gets a timeout.
3. Since the master got a timeout, it reports failure, and aborts
   sending the binary log by exiting mysql_binlog_send(). However, it
   leaves the buffer intact, still holding y'' (but not x', since the
   write_pos is not increased).
4. After exiting mysql_binlog_send(), the master does a
   disconnection of the client thread, which involves sending an
   error message e to the client (i.e., the slave).
5. In this case, net_write_buff() is used again, but this time the
   old contents of the packet is used so that the new packet is
   D = (y'',e). Note that this will use a new packet sequence number,
   since the packet number was increased in step 2.
6. The slave receives the tail y'' of the Y log event, concatenates
   this with x' (which it already received), and writes the event
   (x',y'') it to the relay log since it hasn't noticed anything is
   amiss.
7. It then tries to read more bytes, which is either e (if the length
   given for X just happened to match the length given for Y, or just
   plain garbage because the slave is out of sync with what is
   actually sent.
8. After a while, the SQL thread tries to execute the event (x',y''),
   which is very likely to be just nonsense.

New patch being investigated:

--- sql_parse.cc.orig   2008-01-28 14:54:29.000000000 +0800
+++ sql_parse.cc        2008-01-28 14:54:40.000000000 +0800
@@ -1166,7 +1166,6 @@
       unregister_slave(thd,1,1);
       /*  fake COM_QUIT -- if we get here, the thread needs to terminate */
       error = TRUE;
-      net->error = 0;
       break;
     }
 #endif

Patch looks ok. Lets go with this one!

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41342

ChangeSet@1.2565, 2008-01-29 11:56:48+08:00, hezx@mail.hezx.com +1 -0
  BUG#26489 Corruption in relay logs
  
  Here is the scenario that causes the failure.(by Mats)
  
  1. The to-be corrupt log event (let's call it X), is split into two
     packets B and C on the network level (net_write_buff()). The parts
     are X = (x',x''). The part x' ends up in packet B and part x''
     ends up in packet C. Prior to the corrupt event X, the event Y has
     been written successfully, but has been split into two packets as
     well, which we call (y',y'').
  2. The master sends packet A = (y'',x') to the slave, increases the
     packet sequence number, the slave receives the packet, but fails
     to reply before the master gets a timeout.
  3. Since the master got a timeout, it reports failure, and aborts
     sending the binary log by exiting mysql_binlog_send(). However, it
     leaves the buffer intact, still holding y'' (but not x', since the
     write_pos is not increased).
  4. After exiting mysql_binlog_send(), the master does a
     disconnection of the client thread, which involves sending an
     error message e to the client (i.e., the slave).
  5. In this case, net_write_buff() is used again, but this time the
     old contents of the packet is used so that the new packet is
     D = (y'',e). Note that this will use a new packet sequence number,
     since the packet number was increased in step 2.
  6. The slave receives the tail y'' of the Y log event, concatenates
     this with x' (which it already received), and writes the event
     (x',y'') it to the relay log since it hasn't noticed anything is
     amiss.
  7. It then tries to read more bytes, which is either e (if the length
     given for X just happened to match the length given for Y, or just
     plain garbage because the slave is out of sync with what is
     actually sent.
  8. After a while, the SQL thread tries to execute the event (x',y''),
     which is very likely to be just nonsense.
  
  The problem can be fixed by not resetting net->error after the call of 
  mysql_binlog_send, so the error message will not be sent and the connection
  will be closed.

Pushed into mysql-5.0-rpl and mysql-5.1-new-rpl team trees.

For some reason the auto-bug-state-updater did not work, so I've set
the bug to "patch queued" manually.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41540

ChangeSet@1.2513, 2008-02-01 12:24:43+08:00, hezx@mail.hezx.com +1 -0
  BUG#26489 Corruption in relay logs
    
  Here is the scenario that causes the failure.(by Mats)
  
    1. The to-be corrupt log event (let's call it X), is split into two
       packets B and C on the network level (net_write_buff()). The parts
       are X = (x',x''). The part x' ends up in packet B and part x''
       ends up in packet C. Prior to the corrupt event X, the event Y has
       been written successfully, but has been split into two packets as
       well, which we call (y',y'').
    2. The master sends packet A = (y'',x') to the slave, increases the
       packet sequence number, the slave receives the packet, but fails
       to reply before the master gets a timeout.
    3. Since the master got a timeout, it reports failure, and aborts
       sending the binary log by exiting mysql_binlog_send(). However, it
       leaves the buffer intact, still holding y'' (but not x', since the
       write_pos is not increased).
    4. After exiting mysql_binlog_send(), the master does a
       disconnection of the client thread, which involves sending an
       error message e to the client (i.e., the slave).
    5. In this case, net_write_buff() is used again, but this time the
       old contents of the packet is used so that the new packet is
       D = (y'',e). Note that this will use a new packet sequence number,
       since the packet number was increased in step 2.
    6. The slave receives the tail y'' of the Y log event, concatenates
       this with x' (which it already received), and writes the event
       (x',y'') it to the relay log since it hasn't noticed anything is
       amiss.
    7. It then tries to read more bytes, which is either e (if the length
       given for X just happened to match the length given for Y, or just
       plain garbage because the slave is out of sync with what is
       actually sent.
    8. After a while, the SQL thread tries to execute the event (x',y''),
       which is very likely to be just nonsense.
  
  The problem can be fixed by not resetting net->error after the call of 
  mysql_binlog_send, so the error message will not be sent and the connection
  will be closed.

Pushed into 5.0.56

Pushed into 5.1.24-rc

Pushed into 6.0.5-alpha

Documented in the 5.0.56, 5.1.24, and 6.0.5 changelogs as follows:

        Network timeouts between the master and the slave could result
        in corruption of the relay log.

bug #34545 is a duplicate of this bug here.

Also documented fix for 5.1.23-ndb-6.2.14.

Also documented fix for 5.1.23-ndb-6.3.11.

You must upgrade your master server to get this fix. The bug fix is in the part of replication that runs on the master.

Bug #35629 is set as duplicate of this bug here.

Is it possible that this bug had been reintroduced somehow later? We're running mysql-5.1.45 and we experience exactly the same relay log corruption. The workaround proposed here (for refetching binary logs from master) works the same way.

Seva, probably not this one but there could be others that were less common and masked by this one. If you want an easy life try a later 5.1 version or look at 5.5, which has an option to automatically reload the logs from the master if corruption is found.

A little known option exists in 5.5 and higher that will automatically reread relay logs from the master upon startup, to avoid processing potentially corrupted relay logs on slave (thus causing repeating crashes in some cases):

--relay-log-recovery=1
           Enables automatic relay log recovery right after the
           database startup, which means that the IO Thread starts
           re-fetching from the master right after the last
           transaction processed

It may add another level of resistance in order to archive a more crash-proof slave :)

How to Recover from Replication Relay Log Corruption (Doc ID 1638949.1)