Bug #32228 | A disk full makes binary log corrupt. | ||
---|---|---|---|
Submitted: | 9 Nov 2007 13:18 | Modified: | 29 Oct 2011 10:27 |
Reporter: | Arnoud Witt | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 4.1, 5.0, 5.1 | OS: | Any |
Assigned to: | Sven Sandberg | CPU Architecture: | Any |
Tags: | bfsm_2007_11_15, bfsm_2007_12_06 |
[9 Nov 2007 13:18]
Arnoud Witt
[13 Nov 2007 15:52]
Sveta Smirnova
Thank you for the report. I can not repeat described behaviour with current version of MySQL. Please upgrade.
[14 Nov 2007 13:48]
David Shrewsbury
Verified against 4.1.12 using InnoDB and transactions. Could not reproduce using MyISAM only.
[5 May 2008 13:11]
Sam Tresler
Confirmed on Debian Etch, Mysql 5.0.32 Disk full on master, after space is freed up we get: May 5 08:29:17 colo2 mysqld[2306]: 080505 8:29:17 [Note] Slave I/O thread: connected to master 'repl@dbmaster:3306', replication started in log 'mysql-bin.010495' at position 693169273 May 5 08:29:17 colo2 mysqld[2306]: 080505 8:29:17 [ERROR] Error reading packet from server: binlog truncated in the middle of event ( server_errno=1236) May 5 08:29:17 colo2 mysqld[2306]: 080505 8:29:17 [ERROR] Got fatal error 1236: 'binlog truncated in the middle of event' from master when reading data from binary log May 5 08:29:17 colo2 mysqld[2306]: 080505 8:29:17 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.010495', position 693169273
[28 May 2008 11:31]
Sean Colombo
Confirmed in: mysql Ver 14.12 Distrib 5.0.22, for redhat-linux-gnu (i686) using readline 5.0 Is there a known workaround to get replication up and running again?
[28 May 2008 11:38]
Sean Colombo
It appears the workaround is to start replication from scratch. http://lists.mysql.com/mysql/208931
[26 Oct 2010 10:18]
Luis Soares
See also: WL#1040.
[26 Oct 2010 13:37]
Sven Sandberg
It's not completely clear what happens. Here, "partial event" refers to an event that is truncated, so it does not have full length. "corrupt event" refers to an event that has full length but contains wrong data. In principle, the server should be able to handle partial events because every event has a field that contains the event length. Here it seems that a partial event is at some point changes to a corrupt event. I think there are three possible scenarios. ==== Scenario 1: master client turns partial event to corrupt event ==== (S1-1) The disk is full on the master. This causes the master client that writes to the binlog to write a partial event and wait for disk space. When disk space is freed, there is a bug in the master client thread that causes it to write a corrupt event to the binlog. (S1-2) The master dump thread reads the corrupt event and sends to the slave. (S1-3) The slave IO thread receives the corrupt event and writes to the relay log. (S1-4) The slave SQL thread reads the corrupt event and fails to execute it. ==== Scenario 2: master dump thread turns partial event to corrupt event ==== (S2-1) The disk is full on master. This causes the master client that writes to the binlog to write a partial event and wait for disk space. (S2-2) The master dump thread cannot handle the partial event correctly, and sends a corrupt event to the slave. (S2-3) The slave IO thread receives the corrupt event and writes to the relay log. (S2-4) The slave SQL thread reads the corrupt event and fails to execute it. ==== Scenario 3: slave IO thread turns partial event to corrupt event ==== (S3-1) The disk is full on master. This causes the master client that writes to the binlog to write a partial event and wait for disk space. (S3-2) The master dump thread sends the partial event to the slave. (S3-3) The slave IO thread receives the partial event but cannot handle it correctly, and writes a corrupt event to the relay log. (S3-4) The slave SQL thread reads the corrupt event and fails to execute it.
[28 Oct 2010 14:00]
MySQL Verification Team
I can repeat this on 5.1.51 using myisam tables. master gets error: [ERROR] mysqld-debug: Disk is full writing '.\anon-bin.000012' (Errcode: 28). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space) [ERROR] mysqld-debug: Retry in 60 secs. Message reprinted in 600 secs [Note] mysqld-debug: Normal shutdown After starting up master again, slave broken: 101028 15:53:38 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event', Error_code: 1236 101028 15:53:38 [Note] Slave I/O thread exiting, read up to log 'anon-bin.000012', position 227848188 Running mysqlbinlog on the binlog shows: [sbester@levovo 5.5]$ mysqlbinlog -v ~/anon-bin.000012 > out ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 405117, event_type: 2 ERROR: Could not read entry at offset 227848188: Error in log format or read error.
[28 Oct 2010 17:23]
MySQL Verification Team
I uploaded my binlog to the ftp.mysql.com/pub/mysql/upload/anon-bin.000012.bz2
[2 Nov 2010 20:42]
Sven Sandberg
The bug shows two different symptoms: (S1) The master seems to send truncated event to the slave: the slave error log contained: 071006 8:29:33 [ERROR] Error reading packet from server: binlog truncated in the middle of event (server_errno=1236) This is strange: it looks like the master knows that the event is truncated (because it sends an error code to the slave), but still the master send the event to slave. It would probably be better that the master did not let the slave know about partial events. In any case, I haven't been able to reproduce this. (S2) Somewhere along the way, a truncated event becomes a corrupt event: 071008 11:32:12 [ERROR] Slave: Error 'You have an error in your SQL syntax; [...] See also my comment from 26 Oct 15:37. I haven't been able to reproduce this either. I have tried to reproduce this bug using a test case similar to the one in BUG#57936, see below. By default, it hits the assertion in BUG#57936. However, if this assertion is commented out, it works gracefully: the master continues without problems and there are no errors on the slave. I don't know what else I could do to reproduce this bug. ==== BEGIN rpl_disk_full ==== --source include/master-slave.inc --source include/have_binlog_format_statement.inc --let $master_dir= `SELECT @@datadir` --let $master_dir= $master_dir/binlogs --let $master_file= $master_dir/bigfile --let $long_string= `SELECT REPEAT('a', 500000)` --eval SELECT REPEAT('a', 750000) INTO DUMPFILE '$master_file' CREATE TABLE t1 (a TEXT(1000000)); --disable_query_log --send --eval INSERT INTO t1 VALUES ('$long_string') --enable_query_log --sleep 25 --remove_file $master_file --reap SELECT COUNT(*) FROM t1; --sync_slave_with_master --let $diff_table_1= master:test.t1 --let $diff_table_2= slave:test.t1 --source include/diff_tables.inc --connection master DROP TABLE t1; --sync_slave_with_master ==== END ==== ==== BEGIN rpl_disk_full-master.sh ==== dir=$MYSQLTEST_VARDIR/mysqld.1/data/binlogs img=$dir.fs echo dir=$dir img=$img mkdir $dir head -c 1000000 /dev/zero > $img mkfs.ext2 -F $img sudo mount -o loop $img $dir ==== END ==== ==== BEGIN rpl_disk_full-master.opt ==== --log-bin=$MYSQLTEST_VARDIR/mysqld.1/data/binlogs/master-bin ==== END ==== ==== BEGIN patch to remove assertion ==== === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2010-10-23 12:55:44 +0000 +++ sql/sql_class.cc 2010-11-02 20:36:59 +0000 @@ -496,11 +496,11 @@ void Diagnostics_area::set_ok_status(THD *thd, ha_rows affected_rows_arg, ulonglong last_insert_id_arg, const char *message_arg) { - DBUG_ASSERT(! is_set()); + //DBUG_ASSERT(! is_set()); #ifdef DBUG_OFF /* In production, refuse to overwrite an error or a custom response with an OK packet. */ ==== END ====
[4 Nov 2010 17:21]
Sven Sandberg
I can reproduce that the slave gets a truncated event from the master as follows: 1. run statement on master such that disk gets full in the middle of binlog event 2. shut down master while it waits for disk space 3. free up disk space on master 4. restart master Use the same test files as above, but replace rpl_disk_full with this file: ==== BEGIN rpl_disk_full.test (version 2) ==== --source include/master-slave.inc --source include/have_binlog_format_statement.inc SET SESSION debug = "+d,short_sleep_on_disk_full"; --let $master_dir= `SELECT @@datadir` --let $master_dir= $master_dir/binlogs --let $master_file= $master_dir/bigfile --let $s= `SELECT REPEAT('a', 100000)` --eval SELECT REPEAT('a', 900000) INTO DUMPFILE '$master_file' --echo # create table... CREATE TABLE t1 (a TEXT(1000000)); --echo # execute big statement... --disable_query_log --send --eval INSERT INTO t1 VALUES ('$s$s') --enable_query_log --sleep 5 --connection master1 --echo # shut down server... --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect wait EOF --shutdown_server 10 --echo # remove file... --remove_file $master_file --echo # removed all files --echo # restart server again... --append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect restart EOF --echo # reconnect master1 again... --enable_reconnect --source include/wait_until_connected_again.inc --disable_reconnect --echo # reconnect master again... --connection master # 2013=CR_SERVER_LOST --error 2013 --reap --enable_reconnect --source include/wait_until_connected_again.inc --disable_reconnect SELECT COUNT(*) FROM t1; --connection slave # 1594=ER_SLAVE_RELAY_LOG_READ_FAILURE --let $slave_sql_errno= 1594 --source include/wait_for_slave_sql_error.inc --let $slave_datadir= `SELECT @@datadir` --exec $MYSQL_BINLOG $slave_datadir/slave-relay-bin.000004 ==== END ====
[4 Nov 2010 17:45]
Sven Sandberg
The bug is in mysql_binlog_send, where it does this: if (!(error = Log_event::read_log_event(&log, packet, log_lock))) [...] else { if (test_for_non_eof_log_read_errors(error, &errmsg)) goto err; /* It's EOF, nothing to do, go on reading next events, the Format_description_log_event will be found naturally if it is written. */ } Here, read_log_event returns LOG_READ_TRUNC when the event is shorter than expected, and test_for_non_eof_log_read_errors returns 0 when error==LOG_READ_TRUNC. So the master dump thread detects that the event is truncated, but ignores the error and pretends everything is fine. It should probably just goto err instead. test_for_non_eof_log_read_errors is used in some other places in this file too. We should check if we need to fix those too.
[29 Oct 2011 10:27]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html
[29 Oct 2011 10:42]
Jon Stephens
Documented fix in the 5.518 and 5.6.4 changelogs as follows: A replication master could send damaged events to slaves after the binary log disk on the master became full. To correct this issue, only complete events are now pushed by the master dump thread to the slave I/O thread. In addition, the error text that the master sends to the slave when an incomplete event is found now states that the incomplete event may have been caused by running out of disk space on the master, and provides coordinates of the first and the last event bytes read. Closed.
[15 Jan 2012 6:12]
Jeremy Cole
The fix for this bug unfortunately introduced even worse bugs filed as Bug 64035. The server now crashes when generating several error messages related to replication.
[28 Jul 2016 14:48]
Laurynas Biveinis
It looks like the testcase for this bug (rpl.rpl_cant_read_event_incident) is only present in the 5.5 tree. Is that intentional? If 5.6 or 5.7 were to regress WRT current bug, would this be caught? Filed bug 82370.