Bug #26489 Corruption in relay logs
Submitted: 20 Feb 2007 7:14 Modified: 6 Feb 2008 17:28
Reporter: Daniel Fiske
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:5.0.27 & 5.0.37, 5.1.22 OS:Any (Linux, Win32)
Assigned to: Zhenxing He Target Version:5.0.56,5.1.24
Tags: bfsm_2007_10_18, bfsm_2007_11_01, bfsm_2007_12_06
Triage: D1 (Critical) / R3 (Medium) / E5 (Major)

[20 Feb 2007 7:14] Daniel Fiske
Description:
I currently have replication set up between 2 innodb servers (the remote one being very
remote) over a VPN on a high latency connection. The relay logs are OFTEN corrupted with
odd characters very similar as described by Bug #14471. 

How to repeat:
It is hard to reproduce.....but I'll suggest a couple of ways.

Try and get a replication setup in which the slave has a fair amount of data to read on
the IO thread and where the connection speed is poor. 

Once the read connection is made from the slave, try the following

1. Start + stop the slave process a couple of times (stop slave/start slave)
2. Start + stop the slave mysql process (using service manager) a couple of times.
3. Kill the network connection.

...you should get some corruption.

Suggested fix:
I am relatively surprised that there can be corruption in the relay log, but I don't know
the replication internals very well, I'm guessing the replication IO protocol/binary log
needs some sort of error detection (CRC, checksum, etc, etc).
[21 Feb 2007 9:55] Daniel Fiske
Some additional information.

It seems to be down to the connection. Overnight the relay log cycled several times...and
there was OFTEN corruption during the last entry of the relay log.
[22 Feb 2007 3:19] Kevin Kwast
I have seen this problem regularly with Linux servers running 5.0.26.  I have just
upgraded some to 5.0.27 but don't have much faith that it's been solved if you're seeing
the same problem in 5.0.27 on Windows.  It happens frequently when the slave server loses
connection to the master.  The slave logs:

[ERROR] Error reading packet from server: Lost connection to MySQL server during query (
server_errno=2013)

And it looks like the network buffer gets written to the relay log as-is.  I've seen the
end of the SQL statement being replicated at that moment get replaced with fragments of
previous statements and various random garbage.  When the SQL slave thread tries to
execute the statement, it usually stops replication due to a SQL syntax error, but
yesterday the random garbage in the relay log was bad enough to cause a mysqld
segmentation fault.  This put the server into a crash loop since the SQL thread would
read the same garbage from the relay log every time it started.  I would not call this an
S3 -- it is a pretty severe problem!
[22 Feb 2007 6:38] Daniel Fiske
Thats pretty much what I'm getting (minus the crash!). We'll have to see when it gets
assigned what the developers say. I'm looking a the 5.0.33 source code, to see where
things might go wrong.

D.
[26 Feb 2007 18:01] Robert Stacey
I'm having the same problem (including the crash). Using MySQL 5.027 on windows, over an
unreliable link.

070226 15:12:06 [ERROR] Error reading packet from server: Lost connection to MySQL server
during query ( server_errno=2013)
070226 15:12:06 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry,
log '<hostname>-bin.000033' position 564503495
070226 15:12:06 [ERROR] Slave: Error 'You have an error in your SQL syntax; check the
manual that corresponds to your MySQL server version for the right syntax to use near
<query> , Error_code: 1064.

Also getting invalid table names, an error saying 'Could not parse relay log' due to
corruption in different parts of the packets.

Tried it with a self-compiled 5.0.33 and the problem is still there.
[28 Feb 2007 9:09] Daniel Fiske
Everyone at MySQL on holiday? 8 days....and it hasn't even been assigned! Pretty poor.
[9 Mar 2007 17:01] Robert Stacey
Setting my slave_net_timeout to a much smaller value (currently 60 seconds) has let my
slave run for over 6 hours now, whereas a copy with it set to default(3600) stops at
least once an hour (running at same time).
[6 Apr 2007 14:17] Sveta Smirnova
Thank you for the report.

Are you sure it is network problem? Only relay logs are corrupted? Don't master binary
log files?

We also have verified feature request about checksum for bilog events: Bug #25737
[6 Apr 2007 14:18] Sveta Smirnova
program to repeat network error

Attachment: bug26489.c (text/plain), 2.37 KiB.

[6 Apr 2007 14:23] Sveta Smirnova
Network error is easy to repeat with attached simple proxy. Just compile it, run as
`./bug26489 host master_port proxy_port` and setup replication via proxy.

Results are unpredictable.

For example.

master:
mysql> create table t1(f1 varchar(100));
Query OK, 0 rows affected (0.26 sec)

mysql> insert into t1 values('foo');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('bar');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('bug26489');
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values('baz');
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1;
+----------+
| f1       |
+----------+
| foo      | 
| bar      | 
| bug26489 | 
| baz      | 
+----------+
4 rows in set (0.26 sec)

slave:
mysql> show slave status\G
*************************** 1. row ***************************
...
              Relay_Log_Pos: 245
      Relay_Master_Log_File: mysqld501-apple-bin.000001
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
...
1 row in set (0.01 sec)

mysql> select * from t1;
+------+
| f1   |
+------+
| foo  | 
| bar  | 
+------+
2 rows in set (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Reconnecting after a failed master event read
                Master_Host: localhost
                Master_User: repl
                Master_Port: 33050
              Connect_Retry: 60
            Master_Log_File: mysqld501-apple-bin.000001
        Read_Master_Log_Pos: 471
             Relay_Log_File: mysqld502-apple-relay-bin.000003
              Relay_Log_Pos: 245
      Relay_Master_Log_File: mysqld501-apple-bin.000001
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB: 
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
                 Last_Errno: 1105
                 Last_Error: Error 'MyISAM table 't1' is in use (most likely by a MERGE
table). Try FLUSH TABLES.' on query. Default database: 'test'. Query: 'create table t1(f1
varchar(100))'
...
1 row in set (0.00 sec)
mysql> set global SQL_SLAVE_SKIP_COUNTER = 1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1;
+----------+
| f1       |
+----------+
| foo      | 
| bar      | 
| foo      | 
| bar      | 
| bug26489 | 
| baz      | 
+----------+
6 rows in set (0.00 sec)
[11 Apr 2007 10:06] Robert Stacey
I'm pretty sure nothings actually getting corrupted on the network - the link is
unreliable, but nothing else is getting corrupted, and changing the slave_net_timeout has
completly fixed it (over a month without stopping now - setting to default it stops every
hour or so). It's certainly not a server problem  (slave_net_timeout fixes it, if we do
'change master to  MASTER_LOG_POS=' with the current exec_log_pos on the slave to reset
the slave relay logs it also fixes it till the next time the connection dies). The error
is always the last query before  the connection drop is reported in the error log.
[7 May 2007 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[7 May 2007 1:10] Kevin Kwast
I still see this problem all the time in 5.0.37.  If the connection drops (due to problems
with our internet provider), the replication slave can write junk into its relay log file.
 I have some samples of the junk if it would be helpful, but it looks like it just writes
its buffer to the relay log file as-is, which frequently results in a malformed SQL
statement that requires me to fix it manually.
[7 May 2007 8:18] Sveta Smirnova
Kevin, thank you for the feedback.

Verified as described in the initial post and last coment.
[7 May 2007 8:19] Sveta Smirnova
Instructions how to repeat see in the comment "[6 Apr 14:23] Sveta Smirnova"
[7 May 2007 8:50] Daniel Fiske
It seems that waiting for #25737 to be implemented could take a LONG time, it could be
"fixed" simply, that when a disconnect is detected (When it would normally cycle the
relay-log) the slave relay log could be rolled back 1 transaction. i.e the last
transaction is not written *or* is actually deleted and the previous transaction pointer
used.

D.
[17 May 2007 10:29] Sveta Smirnova
Bug #26123 was marked as duplicate of this one.
[17 May 2007 14:06] Mark Leith
Upgrading to an S2
[6 Jun 2007 9:21] Daniel Fiske
7/5 NoFb->Ver
24/5 Ver->Prg
29/5 Prg->Ver
31/5 Ver->Prg
6/6 Prg->Ver

What's going on?

D.
[8 Jun 2007 16:44] Jeremy Cole
Daniel:

This happens because there is some sort of internal debate going on over at MySQL.  Their
notification script isn't smart enough to tell you that there *was* a comment left, but
you just can't see it.  So, instead, you get notifications of just the status change with
no explanation.
[14 Jun 2007 10:40] Georgi Kodinov
I have tried the proxy program but it doesn't demonstrate the error. So I'm returning this
to support for further analysis.
[22 Aug 2007 23:45] James Day
Workarounds:

1. Use SSL connections to get the extra level of connection checking.

2. If it does happen:

SHOW SLAVE STATUS
STOP SLAVE;
CHANGE MASTER TO master_log_file='insert the value from Relay_Master_Log_File',
  master_log_pos=insert the value from Exec_master_log_pos;
START SLAVE;
[3 Oct 2007 17:07] Shane Bester
opened bug #31374 while i was trying to repeat this bug.
[8 Oct 2007 22:08] Sveta Smirnova
All reporters!

Please, indicate which switch do you use. (Exact model of hardware.)
[9 Oct 2007 2:22] James Day
For the general issue of corruption of data in transit see bug #25737. We know that TCP's
CRC checks aren't good enough to protect the integrity of the data being transferred over
replication links and that corruption can happen and produce changed characters even with
a valid CRC.

This specific problem may not be due to that, unless the ending of the logs coincides
with a broken connection, which may have a period of bad connectivity and errors
preceding it.
[9 Oct 2007 2:52] Kevin Kwast
We worked around this bug by installing a point-to-point circuit between the master and
slave.  That has been rock solid, the replication TCP connection hasn't been interrupted,
and we haven't had any corrupted since.  The PowerConnect 2716 switch doesn't seem to be
involved.  I hope this bug is eventually solved because I'd like to be able to go back to
replicating across unstable VPNs in the future.
[10 Oct 2007 1:06] James Day
Kevin, thanks for confirming that improving network reliability solved the problem for
you; that helps us to confirm that TCPs 32 bit CRC isn't sufficiently good for our needs.
The work will be done, just the when part that's undecided at present.
[10 Oct 2007 1:14] Sveta Smirnova
Set to "Verified" again according to last comments. See also bug #25737 as well.
[13 Oct 2007 16:18] Daniel Fiske
As the original poster of this bug I thought I'd pipe in. I've been reading through the
posts again and I'm somewhat astounded. Here are some of my thoughts/comments/rants

1. Reported 20 Feb 2007....its been 8 months and we are still debating the cause. How
about someone actually DO some work on this?
2. I'm not sure who James Day is, but he seems to be involved on some level with MySQL
development (and James I apologise if you feel this is a harsh personal attack...please
believe me that I'm trying to be constructive). James' comments are alarming!!! 

"thanks for confirming that improving network reliability solved the problem for
you; that helps us to confirm that TCPs 32 bit CRC isn't sufficiently good for our
needs."

I'm sure James is experienced and I don't want to get into a debate about who has better
credentials, so I'll just say what I think is wrong with that statement and anyone is
free to comment on where they think I'm wrong.

This statement is questionable because I don't see how this confirms anything. This bug
is DIRECTLY related to network stability. Of course if you increase stability of the
network its not going to show up as often. Here a nice little analogy to drive my point
home.

You got a boat design. People complain it capsizes regularly in rough seas and perhaps
its an unstable design. So you say "sail it in good weather". So people do and they say
"it works perfectly when I sail on calm days". You say "See. That proves there is nothing
wrong with our design. Its the seas fault!".

Now I'm sure you are qualified and have experience, but I'm worried if you are taking
part in debugging a system if your logic is like that.

Secondly, lets look at our 2 likely points of failure. TCP/IP standard OR MySQL software.
Seriously? TCP/IP's error correction is FAILING....and MySQL's replication code is bomb
proof? VPN (with its own error detection and correction) over TCP/IP is FAILING....and
MySQL's replication code is working. Let's assume it is corruption on the line and the
corruption produces a valid CRC (anyone know the odds of producing a CRC32 collision?).
The odds of that happening and being repeated by multiple MySQL installations? I should
go play the lottery....I'm obviously very lucky.

...

Frankly people, we are barking up the wrong tree. 

Could TCP's CRC cause it to fail? Yes.
Is there still a HUGE probability that there is a bug in MySQL code? Yes.

D.
[14 Oct 2007 3:46] James Day
See bug #29813 "replication errors on a unstable network" for another report of corruption
on unstable VPN connections.
[14 Oct 2007 4:04] James Day
The publicly visible worklog item for checking the integrity of replication events is at
http://forge.mysql.com/worklog/task.php?id=2540 , which was created in April 2005.
[14 Oct 2007 5:19] James Day
Daniel, I'm one of the support engineers at MySQL. Proceeding with your analogy the boat
has a pump that's supposed to remove water that enters the boat in rough weather but
there are still reports that the boat takes on water and becomes unstable. It's possible
to correct the boat design so it's stable with lots of water but making sure that the
pump is working is also worth doing.

We've seen more than 200 relay log corruption incidents over the last few years. Yet even
with the bug fixes from those investigations it's still happening and it's now the most
common replication problem we see (because we've fixed so many of the others). One of the
things we quite often see in setups with several slaves is that a single slave has
corruption while the others are fine. Lots of possible causes for that but one of them is
what some people have proved does happen: corruption in transit. At the moment a high
proportion of the issues we're seeing where this is reported are for unreliable networks.
When people use SSL connections that tends to solve the problem for them. Since one of the
changes in SSL connections is improved error checking, that tends to support the view that
corruption in transit is one of the causes.

That doesn't rule out other bugs, though. Take a look at Bug #4053 that was fixed in
5.0.30 and Bug #24192 fixed in 5.0.50.

Since there will also be some bugs in the MySQL code the plan is to record the binary log
around any broken events as well as retry. That way the replication setup doesn't break
and we can still look for other bugs.

For your case, if you've changed your mind about "I'm guessing the replication IO
protocol/binary log needs some sort of error detection (CRC, checksum, etc, etc)" please
collect and upload a mysqldump of the relay log events just before and around the
corruption and ideally also the corresponding master event so we can see the nature of
the corruption and whether it's bad on the master as well as the slave. That'll let us
check for errors in the log but not bugs in MySQL's network layer handling.
[14 Oct 2007 12:21] Daniel Fiske
Hi James,

Thanks for the response.

1. Bug #29813, sounds identical to this bug. However I contest your assumption that
because its happening on an unstable VPN (with more ED/EC) that it is still a network
issue. If anything the data provided from a VPN endpoint is less likely to be corrupted,
which points to a problem in MySQL code.
2. I still believe that binlog/replication needs ED/EC. But perhaps thats more of a
feature request. My suspicion is that the fundamental cause of this bug will remain in
the code if ED is implemented. Any errors picked up by ED algorithms will in the future
be passed off as transmission errors and the actual problem will never be addressed.
Basically I'm saying I think there is 1 bug and its not transmission. It should be solved
AND ED/EC should be implemented. I think this would fit in line with your extended boat
analogy. I think we are blaming the wrong thing here....which brings me to my next
point.
3. I have a theory on what is causing this error and I've spent a fair amount of time
looking through MySQL code (on other bugs and general interest). However, I'll NEVER
presume to know more about that code than ANYONE who actually works at MySQL, so I've
been very hesitant to spew my theories to you guys until I can spent time in the code
really looking at this bug. The reality is I don't have the time. So I'm very cautiously
going to lay out what I think is going wrong and why I don't think its an error on the
line.

Have a look at bug #29813. It is a perfect example. The slave log is corrupted as such.

#070711 11:02:28 server id 2  end_log_pos 350   Query   thread_id=137801       
exec_time=0     error_code=0
SET TIMESTAMP=1184122948;
INSERT,^@^@^A^@^@^@^@^@^D^A^@^@^@+^@^@^@^@^@^@^@^@^@?ù^H^@^@^@^@^@mysql-bin.000075_^@^@^B
^@^Q?<93>F^O^A^@^@^@^^@^@^@^@^@^@^@^@^@^D^5.0.24a-standard-log^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^S8^M^@^H^@^R^@^D^D^D^D^R^@^@K^@^D^ZA^@^@^C^@<95>E
<94>F^B^B^@^@^@@^@^@^@?ù^H^@^@^@<8e>^Z^B^@^@^@^@^@^@^@^@^Z^@^@^@@^@^@^A^@^@^@^@^@^@^@^@^C

^@^B^@^D!^@!^@!^@^@BEGIN^]^@^@^D^@hE<94>F^E^B^@^@^@^\^@^@^@^\^@^@^@;
# at 14351

Now if this had been a line error the garbage i.e. everything after the INSERT, would
have to produce the exact same CRC as what should have been there, for it to be passed on
to MySQL. I don't buy that the CRC for INSERT INTO....would happen to be the same as the
CRC for INSERT,^@^@^A^@^...

My experience of this bug is that it occurs on a disconnect/timeout of the master-slave
connection, whether or not its protected by a VPN or SSL. That would suggest it may be
the way that MySQL is handling the last "received" packet from the master after a
disconnect. Looking at the "rubbish" part of that transaction I further can't believe
that corruption on the line would miraculously spell out "mysql-bin.000075"...unless
ethernet has become sentient and is have a little laugh at us. It looks strongly like the
end of a memory buffer (probably on the slave end), that was not written to....i.e. not
all the data was received (due to disconnect) but the packet/transaction was *still*
written to the binlog on the slave.

...comments/criticism/flaws/help...will be appreciated.

D.
[15 Oct 2007 20:05] James Day
Daniel, yes, I also think that there's a bug in that area. Various people here have tried
to reproduce it but haven't succeeded so far. The MySQL developers aren't perfect so if
you do think that you see a bug in the source code please do let us know.
[19 Oct 2007 17:46] Georgi Kodinov
We need a step-by-step procedure to repeat the problem reported in order to be able to
rectify it. Thanks in advance for your efforts.
[25 Oct 2007 19:31] James Day
Georgi, there are two different issues here:

1. Whether there's a specific corruption and test case for the original report. That may
lead to a solution for one of the causes of corruption.

2. The general experience of corruption that can be resolved by bug #25737, which is
currently misclassified as a feature request rather than a meta-bug report requesting the
solution to a common problem that can't be reproduced with practical test cases.

Regardless of what happens to this specific report, the general solution of a check for
integrity and automatic retry is required and is one of the highest support priorities
for the replication team because of the volume of issues we receive.
[26 Oct 2007 16:15] Mark Leith
I'm attaching a patch that Support would like to propose for this bug, which implements a
new variable called sync_relay_log - which acts just as syn_binlog does - by flushing and
performing an fsync() on the relay log every nth event (where n is set by the
sync_relay_log variable. 

Example:

mysql> show variables like 'sync_relay_log';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| sync_relay_log | 1     | 
+----------------+-------+
1 row in set (0.00 sec)

mysql> set global sync_relay_log = 2;
Query OK, 0 rows affected (0.00 sec)

mysql> show variables like 'sync_relay_log';
+----------------+-------+
| Variable_name  | Value |
+----------------+-------+
| sync_relay_log | 2     | 
+----------------+-------+
1 row in set (0.00 sec)

mysql> set session sync_relay_log = 3;
ERROR 1229 (HY000): Variable 'sync_relay_log' is a GLOBAL variable and should be set with
SET GLOBAL

Along with the patch, I'll also attach a kdump (OSX strace) to show that the fsync() does
happen every 2 events with the above setting (as it does with setting it to 1 and an
fsync() for every event).
[26 Oct 2007 16:16] Mark Leith
Patch to add sync_relay_log variable

Attachment: sync_relay_log.patch (application/octet-stream, text), 5.46 KiB.

[26 Oct 2007 16:17] Mark Leith
ktrace/kdump output verifying the fsync

Attachment: sync_relay_log.dump (application/octet-stream, text), 93.08 KiB.

[28 Oct 2007 20:16] James Day
Mark, I believe that it flushes and fsyncs but how does doing that fix the problem this
bug is about?
[31 Oct 2007 8:04] Shane Bester
The corruptions occur when a timeout is encountered.  So far, I found the most effective
way to reproduce the timeout is to change the master's date/time backwards and forwards a
few times.  Then the error is reproducible within a few minutes.
[31 Oct 2007 11:18] James Day
The patch was to handle the possible case where the cause may be OS or MySQL corruption of
an event, based on a case where previously seen replication events were encountered in the
buffer. That's a fair theory for at least some of the cases.

However, the tests by Shane from the MySQL Support team have now reproduced this in a way
that suggests that buffer corruption or incomplete data transfer due to timeout produces
this symptom as well, so his investigations continue and seem likely to produce a neater
solution based on locating and correcting that problem.

Shane's initial test successfully reproduced the report in bug #21923.
[31 Oct 2007 14:21] Shane Bester
Bug #21923 was marked as a duplicate of this one.
[2 Nov 2007 18:26] Sveta Smirnova
According to Shane's last findings see also bug #22047
[13 Nov 2007 2:43] Brandon Black
I'm being bitten by this bug as well.  My analysis so far is in agreement with the
comments earlier (re writing a garbage memory buffer to disk on a short read/error when
it shouldn't) and Shane's linking it to #21923.

One of the more interesting instances I've seen was a case where the slave lost the
connection to the master (I believe I induced this by having huge network buffers and
spamming events at it so fast that it couldn't possibly keep up, causing large send queue
backlogs on the master), but immediately succeeded on its first attempt to reconnect.  The
I/O thread continued on after reconnect as if nothing was wrong, but the SQL thread
noticed the corrupted event in the eventlog (last event in that particular file, I'm
assuming because a new one is opened on reconnect?).

These are the syslog entries from that event:

Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Error reading packet from
server: Lost connection to MySQL server during query ( server_errno=2013)
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [Note] Slave I/O thread: Failed
reading log event, reconnecting to retry, log 'x_binlog.000005' position 57280419
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Slave: Error 'You have an
error in your SQL syntax; check the manual that corresponds to your MySQL server version
for the right syntax to use near '^^' at line 1' on query. Default database: 'foo'.
Query: 'insert into f values (49418^^', Error_code: 1064
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [ERROR] Error running query, slave
SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START".
We stopped at log 'x_binlog.000005' position 57280253
Nov 12 16:03:19 db01 mysqld[22443]: 071112 16:03:19 [Note] Slave: connected to master
'dbmaster:3306',replication resumed in log 'x_binlog.000005' at position 57280419

I'm guessing somewhere in sql/slave.cc:handle_slave_io, there's some nefarious logic
errors what with all of the conditionals and gotos.  It seems that when we got a short
read of a master event due to disconnect, we wrote the partial event plus random leftover
gargbage from the buffer into the relaylog (and then (or afterwards?) reconnected to the
master and kept on grabbing new events).

I haven't been able to nail it down precisely yet, the code is rough to follow mentally,
but I'm sure it's in there somewhere (or in the underlying net_ and vio_ funcs).

(by the way, I'm using a debian etch build of 5.0.33 on x86_64, and it doesn't look like
they've patched this stuff in any substantial way).
[15 Nov 2007 9:30] Mats Kindahl
I suspected the same thing, so I made a simple test.

The error message given earlier in the bug report ("Lost connection to MySQL server
during query ( server_errno=2013)") is printed when an error is returned from
cli_safe_read() in slave.cc and that error is CR_SERVER_LOST (which is 2013). So, to test
it, I added error injection code just after the call to cli_safe_read() that corrupts
every 10th event by overwriting the tail of the event with "GARBAGE/TRASH" and changes
the returned error to CR_SERVER_LOST. The result is that the error is generated
(identical to the error mentioned earlier in the report), but no corrupt event is written
to the relay log.

In other word, a lost server where the end of the package is corrupt is handled
correctly, and a reconnection is made and the package re-fetched.

Since there is only a single thread working with the buffer where the event data is
stored, there is no multi-tasking issues involved. The only multi-tasking available is
between the I/O and SQL thread via the IO_CACHE that is used for the relay-log, so I am
now suspecting an error or a misuse of the IO_CACHE, causing the event to be written
short. There are however mutexes in place for serializing access to the relay-log
IO_CACHE, so I am leaning more towards a bug in the IO_CACHE.
[18 Dec 2007 0:07] Gerad Coles
We are also having this issue here. Identical behavior. We also get instances where the
relay log contains incomplete queries, like "UPDATE `table name` WHERE id=". This happens
only on a single slave at a time, and appears to be related to the relay log corruptions
issues, which are definitely occurring. We run Enterprise 5.0.45 or thereabouts on most
of our servers. We have upwards of 40-60 slaves replicating from the master at the
moment.
[16 Jan 2008 9:20] Shane Bester
see also bug #29232
[16 Jan 2008 20:23] Daniel Schneller
We have seen this problem as well sporadically. We have several hundred masters in
production, each of them with around 5 to 40 slaves. They are connected with a local area
network. Up to now we suspected the users to just flick the power switch - even though
they usually deny that. Maybe we were wrong to not believe them.
However it is not happening too often - but this might be credit to our very stable
networks.
[22 Jan 2008 7:59] Mark Callaghan
I get this error too after upgrading masters from 4.0.26 to 5.0.37. The slaves were
running 5.0.37 to the 4.0.26 master without problems. Since the upgrade, slaves
connecting to the master over a flaky network (high rate of packet loss) get this error.

On every occurrence of the error, the slave IO thread gets an error that makes it
disconnect/reconnect.

The corrupt SQL frequently includes text from an earlier binlog event.

When the error occurs, the master *.err log includes one of the following. There are a
few places that may print 'Failed on my_net_write()'.
  080119 10:04:51 [ERROR] Failed on my_net_write()
  080119 10:08:43 [ERROR] failed on net_flush()

Now my editorial, as I have earned it after a few days reading MySQL replication and
networking code. net_real_write() and my_real_read() really, really need to be
refactored. I don't know if they contribute to the problem, but they are not easy to
follow and using a read or write with a timeout is much simpler than the current approach
of first trying a non-blocking call (after changing socket options) and then setting an
alarm and doing blocking call (after again changing socket options).
[22 Jan 2008 8:04] Mark Callaghan
This can't be from a TCP checksum error. But I did have a problem once caused by bad
network switches where the SQL text sent to a server was modified by the switch and the
checksum did not catch the error. One character in a query was changed.
[22 Jan 2008 22:22] Shane Bester
update: in my tests, the slave breaks when the following error is found in net_serv.cc:
Error: Packets out of order (Found: 105, expected 182)
More details later.
[22 Jan 2008 23:35] Mark Callaghan
Can we change the code to log the value when errors like this occur? I assume this error
is found on the slave side.
[24 Jan 2008 0:51] Kevin Kwast
FYI, I've also been running into this problem when the slave mysqld crashes.  If it dies
due to SIGSEGV or is killed by an Out of Memory, the slave can write an incomplete
statement to the relay log.  After mysqld restarts, the SQL thread errors due to the
syntax error in the relay log and replication is down until someone fixes it manually. 
It should be smart enough to just grab the broken statement from the master again..
[25 Jan 2008 18:04] Lars Thalmann
Patch that is being investigated:

--- net_serv.cc.orig    2008-01-25 15:07:49.000000000 +0100
+++ net_serv.cc 2008-01-25 15:20:08.000000000 +0100
@@ -534,13 +534,13 @@
       vio_trace_print(net->vio, "flushing buffer left_length=%lu bytes",
                       left_length);
       memcpy((char*) net->write_pos,packet,left_length);
-      if (net_real_write(net, net->buff,
-                        (size_t) (net->write_pos - net->buff) + left_length)) {
+      size_t const bytes_to_write = (size_t) (net->write_pos - net->buff) + left_length
+      net->write_pos= net->buff;
+      if (net_real_write(net, net->buff, bytes_to_write) {
         vio_trace_print(net->vio,
                         "leave net_write_buff since net_real_write()!=0");
        return 1;
       }
-      net->write_pos= net->buff;
       packet+= left_length;
       len-= left_length;
     }
[26 Jan 2008 2:32] Mark Callaghan
I will get this running no test servers to see if it fixes the problem. But I would like
to know why this fixes a problem. The call chain is:
 my_net_write() -> net_write_buff() -> net_real_write()
When net_real_write() returns an error, it is passed back to the my_net_write() caller
without retrying.

When my_net_write() returns an error to the caller in mysql_binlog_send(), the code
branches to err: and the connection to the slave is closed.

What am I missing?
[26 Jan 2008 8:57] Michael Widenius
I am inclined to agree with Mark that the patch is probably wrong.

My reasoning for this is the following:

If net_real_write() fails with a timeout, it's setting net->error to 2.
After that, there will be no more writes on the socket, which can be verified
by the following code at start of net_real_write():

  if (net->error == 2)
    DBUG_RETURN(-1);				/* socket can't be used */

The idea is that if there has been a write error or timeout on the connection,
we will not send the client any error messages or other information.

In this context the slave should get a read error on the socket, skip the
last even for the readlog and reopen the connection.  In other words, the
master should not send any wrong information to the client.

Setting net->write_pos, as the patch does, has no effect in the
context of the above logic.

From this we can conclude that the bug is in likely to be in one of
the following areas:

- The master does not set net->error == 2 in the case of a net write error
  or timeout.
  Fix:  Set net->error to 2

- The master sets net->error to 2, but still it does somehow write to the
  connection.
  Fix: Ensure we don't write to socket if net->error is 2.

- The slave client doesn't properly ignore the last 'wrong' event in case
  of dropped connection and writes both the read part and the not yet
  read part to the relay log. (Note that a packet starts with packet length
  this is a likely bug if the error condition is not properly checked).
  Fix: Ignore the wrong packet

This analyse is somewhat conflicting with Lars analyse of the likely
source of the problem; Lars, can you use some of the test cases that
has been used to verify the bug to check the usage of net->error and see
why the code doesn't work as intended ?

Note also that the patch as such is sub optimal in any case:
- Not portable (as net_serv.cc needs to be compiled with a C compiler).
- Doesn't take into account other calls to net_real_write()
- Longer than needs to be; A better method is to save the value of
  net_real_write() in a value, set net->write_pos and then test
  the return value if there was an error.
[26 Jan 2008 20:04] Mark Callaghan
I get the same error message as Shane on the slave. This is immediately before the errror
occurs:
Packets out of order (Found: 46, expected 74)
[26 Jan 2008 22:54] Mark Callaghan
I did not have this problem when using masters running 4.0.26 and slaves running 5.0.37.
Once the masters were upgraded to 5.0.37, the error happens frequently. I have diff'd the
code for writing to the net (my_net_write, net_write_buff, net_real_write, vio_write)
between these versions. The diffs are minor:
 1) my_net_write return type changed to my_bool
 2) net_real_write in MySQL5 doesn't have an 'if' before query_cache_insert call
 3) net_real_write in MySQL5 sets net->report_error on error
 4) net_real_write calls vio_timeout() in MySQL4. MySQL5 has a comment that this is done
in net_set_write_timeout

For the write path, not using vio_timeout() is the only significant change I see for
MySQL5. I did not look at all of the vio and thread alarm functions that were called by
the functions above.

For the read path (read_event -> *_safe-read -> my_net_read -> my_real_read)
 1) read_event in MySQL5 checks for eof packet, mysql4 just checks len == 1
 2) there are many differences between cli_safe_read in MySQL5 and mc_net_safe_read in
MySQL4
 3) my_net_read for MySQL5 sets net->report_error when my_uncompress fails
 4) my_real_read calls vio_timeout for MySQL4
 5) my_real_read sets net->report_error on error for MySQL5
 6) my_real_read uses vio_should_retry where MySQl5 uses vio_errno() == SOCKET_EINTR to
determine whether to retry on interrupt
[28 Jan 2008 7:57] Zhenxing He
I think the problem is as Lars described, but I don't like the patch, I'd prefer to fix it
by this:

--- sql_parse.cc.orig   2008-01-28 14:54:29.000000000 +0800
+++ sql_parse.cc        2008-01-28 14:54:40.000000000 +0800
@@ -1166,7 +1166,6 @@
       unregister_slave(thd,1,1);
       /*  fake COM_QUIT -- if we get here, the thread needs to terminate */
       error = TRUE;
-      net->error = 0;
       break;
     }
 #endif

Since we have a network problem, we should not try to send the error message, just close
the network .
[28 Jan 2008 8:58] Mats Kindahl
Just for your information, here is the scenario that causes the failure.

1. The to-be corrupt log event (let's call it X), is split into two
   packets B and C on the network level (net_write_buff()). The parts
   are X = (x',x''). The part x' ends up in packet B and part x''
   ends up in packet C. Prior to the corrupt event X, the event Y has
   been written successfully, but has been split into two packets as
   well, which we call (y',y'').
2. The master sends packet A = (y'',x') to the slave, increases the
   packet sequence number, the slave receives the packet, but fails
   to reply before the master gets a timeout.
3. Since the master got a timeout, it reports failure, and aborts
   sending the binary log by exiting mysql_binlog_send(). However, it
   leaves the buffer intact, still holding y'' (but not x', since the
   write_pos is not increased).
4. After exiting mysql_binlog_send(), the master does a
   disconnection of the client thread, which involves sending an
   error message e to the client (i.e., the slave).
5. In this case, net_write_buff() is used again, but this time the
   old contents of the packet is used so that the new packet is
   D = (y'',e). Note that this will use a new packet sequence number,
   since the packet number was increased in step 2.
6. The slave receives the tail y'' of the Y log event, concatenates
   this with x' (which it already received), and writes the event
   (x',y'') it to the relay log since it hasn't noticed anything is
   amiss.
7. It then tries to read more bytes, which is either e (if the length
   given for X just happened to match the length given for Y, or just
   plain garbage because the slave is out of sync with what is
   actually sent.
8. After a while, the SQL thread tries to execute the event (x',y''),
   which is very likely to be just nonsense.
[28 Jan 2008 19:35] Lars Thalmann
New patch being investigated:

--- sql_parse.cc.orig   2008-01-28 14:54:29.000000000 +0800
+++ sql_parse.cc        2008-01-28 14:54:40.000000000 +0800
@@ -1166,7 +1166,6 @@
       unregister_slave(thd,1,1);
       /*  fake COM_QUIT -- if we get here, the thread needs to terminate */
       error = TRUE;
-      net->error = 0;
       break;
     }
 #endif
[29 Jan 2008 1:21] Michael Widenius
Patch looks ok. Lets go with this one!
[29 Jan 2008 5:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41342

ChangeSet@1.2565, 2008-01-29 11:56:48+08:00, hezx@mail.hezx.com +1 -0
  BUG#26489 Corruption in relay logs
  
  Here is the scenario that causes the failure.(by Mats)
  
  1. The to-be corrupt log event (let's call it X), is split into two
     packets B and C on the network level (net_write_buff()). The parts
     are X = (x',x''). The part x' ends up in packet B and part x''
     ends up in packet C. Prior to the corrupt event X, the event Y has
     been written successfully, but has been split into two packets as
     well, which we call (y',y'').
  2. The master sends packet A = (y'',x') to the slave, increases the
     packet sequence number, the slave receives the packet, but fails
     to reply before the master gets a timeout.
  3. Since the master got a timeout, it reports failure, and aborts
     sending the binary log by exiting mysql_binlog_send(). However, it
     leaves the buffer intact, still holding y'' (but not x', since the
     write_pos is not increased).
  4. After exiting mysql_binlog_send(), the master does a
     disconnection of the client thread, which involves sending an
     error message e to the client (i.e., the slave).
  5. In this case, net_write_buff() is used again, but this time the
     old contents of the packet is used so that the new packet is
     D = (y'',e). Note that this will use a new packet sequence number,
     since the packet number was increased in step 2.
  6. The slave receives the tail y'' of the Y log event, concatenates
     this with x' (which it already received), and writes the event
     (x',y'') it to the relay log since it hasn't noticed anything is
     amiss.
  7. It then tries to read more bytes, which is either e (if the length
     given for X just happened to match the length given for Y, or just
     plain garbage because the slave is out of sync with what is
     actually sent.
  8. After a while, the SQL thread tries to execute the event (x',y''),
     which is very likely to be just nonsense.
  
  The problem can be fixed by not resetting net->error after the call of 
  mysql_binlog_send, so the error message will not be sent and the connection
  will be closed.
[31 Jan 2008 4:35] Lars Thalmann
Pushed into mysql-5.0-rpl and mysql-5.1-new-rpl team trees.

For some reason the auto-bug-state-updater did not work, so I've set
the bug to "patch queued" manually.
[1 Feb 2008 5:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41540

ChangeSet@1.2513, 2008-02-01 12:24:43+08:00, hezx@mail.hezx.com +1 -0
  BUG#26489 Corruption in relay logs
    
  Here is the scenario that causes the failure.(by Mats)
  
    1. The to-be corrupt log event (let's call it X), is split into two
       packets B and C on the network level (net_write_buff()). The parts
       are X = (x',x''). The part x' ends up in packet B and part x''
       ends up in packet C. Prior to the corrupt event X, the event Y has
       been written successfully, but has been split into two packets as
       well, which we call (y',y'').
    2. The master sends packet A = (y'',x') to the slave, increases the
       packet sequence number, the slave receives the packet, but fails
       to reply before the master gets a timeout.
    3. Since the master got a timeout, it reports failure, and aborts
       sending the binary log by exiting mysql_binlog_send(). However, it
       leaves the buffer intact, still holding y'' (but not x', since the
       write_pos is not increased).
    4. After exiting mysql_binlog_send(), the master does a
       disconnection of the client thread, which involves sending an
       error message e to the client (i.e., the slave).
    5. In this case, net_write_buff() is used again, but this time the
       old contents of the packet is used so that the new packet is
       D = (y'',e). Note that this will use a new packet sequence number,
       since the packet number was increased in step 2.
    6. The slave receives the tail y'' of the Y log event, concatenates
       this with x' (which it already received), and writes the event
       (x',y'') it to the relay log since it hasn't noticed anything is
       amiss.
    7. It then tries to read more bytes, which is either e (if the length
       given for X just happened to match the length given for Y, or just
       plain garbage because the slave is out of sync with what is
       actually sent.
    8. After a while, the SQL thread tries to execute the event (x',y''),
       which is very likely to be just nonsense.
  
  The problem can be fixed by not resetting net->error after the call of 
  mysql_binlog_send, so the error message will not be sent and the connection
  will be closed.
[5 Feb 2008 13:45] Bugs System
Pushed into 5.0.56
[5 Feb 2008 14:04] Bugs System
Pushed into 5.1.24-rc
[5 Feb 2008 14:08] Bugs System
Pushed into 6.0.5-alpha
[6 Feb 2008 17:28] Jon Stephens
Documented in the 5.0.56, 5.1.24, and 6.0.5 changelogs as follows:

        Network timeouts between the master and the slave could result
        in corruption of the relay log.
[14 Feb 2008 17:07] Susanne Ebrecht
bug #34545 is a duplicate of this bug here.
[6 Mar 2008 6:42] Jon Stephens
Also documented fix for 5.1.23-ndb-6.2.14.
[30 Mar 2008 1:28] Jon Stephens
Also documented fix for 5.1.23-ndb-6.3.11.
[21 Aug 2008 4:02] James Day
You must upgrade your master server to get this fix. The bug fix is in the part of
replication that runs on the master.
[7 Oct 2008 14:43] Susanne Ebrecht
Bug #35629 is set as duplicate of this bug here.