| Bug #33602 | Relay log corruption when relay_log_space_limit reached | ||
|---|---|---|---|
| Submitted: | 31 Dec 2007 18:03 | Modified: | 7 Oct 2008 11:43 |
| Reporter: | Baron Schwartz (Basic Quality Contributor) | Email Updates: | |
| Status: | Not a Bug | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
| Version: | 5.0.40 | OS: | Linux (gentoo) |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | qc | ||
[2 Jan 2008 16:26]
MySQL Verification Team
Baron, is the corruption always at the end of the relay log? If yes, I suspect bug #26489 is related.
[2 Jan 2008 17:16]
Baron Schwartz
I saw that report too, but I thought this was probably a different bug: the corruption is in the middle of the log, and it's not binary-looking garbage data that looks like an uninitialized buffer. Instead, it always looks like two log events got mixed together. My current thought is that this might be a related problem, though. I can't understand the logging code well enough to make anything but a wild guess, but would the following be possible? * I/O thread reads an event, tries to log it, and finds there's not enough space * Perhaps it writes part of an event before finding not enough space? Or maybe something like this: * I/O thread stops, sleeps and wakes, then tries to write the event again * Perhaps it has a buffer initialized to the wrong size, and filled with the last event's data? If last event is bigger than the current event, and it copies the current event into the buffer, then part of last event's SQL would appear to be mixed with the current event's SQL. I tested my theory about relay_log_space yesterday by setting up a system with a very small relay_log_space and binlog size, so it rotated relay logs constantly, and then just threw a bunch of queries at it. I couldn't get it to corrupt :-( but I was also using unrealistic test queries.
[18 Feb 2008 4:49]
Valeriy Kravchuk
Please, check if problem like this ever happens on a newer version, 5.0.54a.
[18 Feb 2008 12:37]
Baron Schwartz
I have removed relay_log_space_limit from my configuration files.
[4 Apr 2008 12:04]
Robert Schmidt
I think this is a problem with the net write timeout on the master. I had this problem and the corruption would always happen at the relay_log_space_limit. I increased net_write_timeout on the master to 3600 and the problem went away. It took me a while to track this down. I haven't tried it on newer versions of the server, but the client is 5.0.45.
[7 Oct 2008 11:43]
Susanne Ebrecht
Many thanks for the feedback. I will close the bug report now. Please feel free to re-open it when you have this problem after increasing net_write_timeout and by using actual version of MySQL 5.0 or 5.1 (at the moment actual are 5.0.67 and 5.1.28-rc).

Description: We get relay log corruption when the slave falls far behind and the I/O thread must wait for the SQL thread to free up some space. We can solve the problem by doing a CHANGE MASTER TO the current position, which throws away the relay logs and re-fetches them from the master. Here is an example error message: mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space Master_Host: galax Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: galax-bin.001165 Read_Master_Log_Pos: 1059415844 Relay_Log_File: halifax-relay-bin.000302 Relay_Log_Pos: 3165821 Relay_Master_Log_File: galax-bin.001162 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: mysql.%,scratch.% Last_Errno: 1064 Last_Error: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''dbh=2time = last1time, last1bidcents = last0bidcents, last1time = last' at line 3' on query. Default database: 'rkmain'. Query: 'replace into rkdb.processes(connectionid, unixprocid, program, extra) values(connection_id(), 13896, left('/usr/local/lib/site_perl/lib/Runner.pm', 100), left('dbh=2time = last1time, last1bidcents = last0bidcents, last1time = last0time, ' Skip_Counter: 0 Exec_Master_Log_Pos: 3165684 Relay_Log_Space: 4295943484 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL 1 row in set (0.00 sec) The query is a very straightforward one: replace(cols) values(vals). I can recognize that there are two queries mixed together in the error message. Here are the current logs: -rw-rw---- 1 mysql mysql 3166225 Dec 31 09:27 halifax-relay-bin.000302 -rw-rw---- 1 mysql mysql 1073478433 Dec 31 10:10 halifax-relay-bin.000303 -rw-rw---- 1 mysql mysql 192 Dec 31 10:10 halifax-relay-bin.000304 -rw-rw---- 1 mysql mysql 429931 Dec 31 10:10 halifax-relay-bin.000305 -rw-rw---- 1 mysql mysql 1072064260 Dec 31 11:26 halifax-relay-bin.000306 -rw-rw---- 1 mysql mysql 10139581 Dec 31 11:26 halifax-relay-bin.000307 -rw-rw---- 1 mysql mysql 192 Dec 31 11:26 halifax-relay-bin.000308 -rw-rw---- 1 mysql mysql 1060705242 Dec 31 12:03 halifax-relay-bin.000309 -rw-rw---- 1 mysql mysql 8968 Dec 31 12:13 halifax-relay-bin.000310 -rw-rw---- 1 mysql mysql 16534287 Dec 31 12:13 halifax-relay-bin.000311 -rw-rw---- 1 mysql mysql 192 Dec 31 12:13 halifax-relay-bin.000312 -rw-rw---- 1 mysql mysql 1059415981 Dec 31 12:13 halifax-relay-bin.000313 How to repeat: I can't repeat it reliably, but if the slave stays far behind, it will happen at least once every day. I suspect something is wrong with beginning a new relay log when there's not enough space. (There is *plenty* of free space on the disk). I'll save the relay logs in case you want to inspect them.