MySQL Bugs: #33602: Relay log corruption when relay_log_space

Bug #33602	Relay log corruption when relay_log_space_limit reached
Submitted:	31 Dec 2007 18:03	Modified:	7 Oct 2008 11:43
Reporter:	Baron Schwartz (Basic Quality Contributor)	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.40	OS:	Linux (gentoo)
Assigned to:		CPU Architecture:	Any
Tags:	qc

Description:
We get relay log corruption when the slave falls far behind and the I/O thread must wait for the SQL thread to free up some space.  We can solve the problem by doing a CHANGE MASTER TO the current position, which throws away the relay logs and re-fetches them from the master.

Here is an example error message:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
                Master_Host: galax
                Master_User: repl
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: galax-bin.001165
        Read_Master_Log_Pos: 1059415844
             Relay_Log_File: halifax-relay-bin.000302
              Relay_Log_Pos: 3165821
      Relay_Master_Log_File: galax-bin.001162
           Slave_IO_Running: Yes
          Slave_SQL_Running: No
            Replicate_Do_DB: 
        Replicate_Ignore_DB: 
         Replicate_Do_Table: 
     Replicate_Ignore_Table: 
    Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: mysql.%,scratch.%
                 Last_Errno: 1064
                 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''dbh=2time = last1time,
         last1bidcents = last0bidcents, last1time = last' at line 3' on query. Default database: 'rkmain'. Query: 'replace into rkdb.processes(connectionid, unixprocid, program, extra)
                  values(connection_id(), 13896, left('/usr/local/lib/site_perl/lib/Runner.pm', 100),
                     left('dbh=2time = last1time,
         last1bidcents = last0bidcents, last1time = last0time,
   '
               Skip_Counter: 0
        Exec_Master_Log_Pos: 3165684
            Relay_Log_Space: 4295943484
            Until_Condition: None
             Until_Log_File: 
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File: 
         Master_SSL_CA_Path: 
            Master_SSL_Cert: 
          Master_SSL_Cipher: 
             Master_SSL_Key: 
      Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

The query is a very straightforward one: replace(cols) values(vals).  I can recognize that there are two queries mixed together in the error message.

Here are the current logs:

-rw-rw----  1 mysql mysql    3166225 Dec 31 09:27 halifax-relay-bin.000302
-rw-rw----  1 mysql mysql 1073478433 Dec 31 10:10 halifax-relay-bin.000303
-rw-rw----  1 mysql mysql        192 Dec 31 10:10 halifax-relay-bin.000304
-rw-rw----  1 mysql mysql     429931 Dec 31 10:10 halifax-relay-bin.000305
-rw-rw----  1 mysql mysql 1072064260 Dec 31 11:26 halifax-relay-bin.000306
-rw-rw----  1 mysql mysql   10139581 Dec 31 11:26 halifax-relay-bin.000307
-rw-rw----  1 mysql mysql        192 Dec 31 11:26 halifax-relay-bin.000308
-rw-rw----  1 mysql mysql 1060705242 Dec 31 12:03 halifax-relay-bin.000309
-rw-rw----  1 mysql mysql       8968 Dec 31 12:13 halifax-relay-bin.000310
-rw-rw----  1 mysql mysql   16534287 Dec 31 12:13 halifax-relay-bin.000311
-rw-rw----  1 mysql mysql        192 Dec 31 12:13 halifax-relay-bin.000312
-rw-rw----  1 mysql mysql 1059415981 Dec 31 12:13 halifax-relay-bin.000313

How to repeat:
I can't repeat it reliably, but if the slave stays far behind, it will happen at least once every day.  I suspect something is wrong with beginning a new relay log when there's not enough space.  (There is *plenty* of free space on the disk).

I'll save the relay logs in case you want to inspect them.

Baron, is the corruption always at the end of the relay log?  If yes, I suspect bug #26489 is related.

I saw that report too, but I thought this was probably a different bug: the corruption is in the middle of the log, and it's not binary-looking garbage data that looks like an uninitialized buffer.  Instead, it always looks like two log events got mixed together.

My current thought is that this might be a related problem, though.  I can't understand the logging code well enough to make anything but a wild guess, but would the following be possible?

* I/O thread reads an event, tries to log it, and finds there's not enough space
* Perhaps it writes part of an event before finding not enough space?

Or maybe something like this:

* I/O thread stops, sleeps and wakes, then tries to write the event again
* Perhaps it has a buffer initialized to the wrong size, and filled with the last event's data?  If last event is bigger than the current event, and it copies the current event into the buffer, then part of last event's SQL would appear to be mixed with the current event's SQL.

I tested my theory about relay_log_space yesterday by setting up a system with a very small relay_log_space and binlog size, so it rotated relay logs constantly, and then just threw a bunch of queries at it.  I couldn't get it to corrupt :-(  but I was also using unrealistic test queries.

Please, check if problem like this ever happens on a newer version, 5.0.54a.

I have removed relay_log_space_limit from my configuration files.

I think this is a problem with the net write timeout on the master.

I had this problem and the corruption would always happen at the relay_log_space_limit.

I increased net_write_timeout on the master to 3600 and the problem went away.

It took me a while to track this down.

I haven't tried it on newer versions of the server, but the client is 5.0.45.

Many thanks for the feedback.

I will close the bug report now. Please feel free to re-open it when you have this problem after increasing net_write_timeout and by using actual version of MySQL 5.0 or 5.1 (at the moment actual are 5.0.67 and 5.1.28-rc).