MySQL Bugs: #75746: set log_bin_use_v1_row_events lead to corrupted row event

Bug #75746	set log_bin_use_v1_row_events lead to corrupted row event
Submitted:	3 Feb 2015 12:30	Modified:	10 Mar 2015 1:43
Reporter:	Fungo Wang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.16, 5.6.22	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	binlog, log_bin_use_v1_row_events

Description:
In replication scenario, use row binlog format.

when there are frequently write/update/delete events on master, and we change the global variable log_bin_use_v1_row_events using `set global log_bin_use_v1_row_events = ON/OFF`, after a while, the slave IO and SQL thread stopped and report error as bellow：

Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'master-bin.000001' at 751, the last event read from './master-bin.000001' at 751, the last byte read from './master-bin.000001' at 770.'
Last_SQL_Errno: 1594
Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

and the error log show these ERROR

2015-02-03 09:55:42 40133 [ERROR] /path/to/mysqld: Out of memory (Needed 4294902024 bytes)
2015-02-03 09:55:42 40133 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 40, event_type: 30
2015-02-03 09:55:42 40133 [ERROR] Error reading relay log event for channel '': slave SQL thread aborted because of I/O error
2015-02-03 09:55:42 40133 [ERROR] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 1594

stop and start slave won't solve the problem.

How to repeat:

for the convenience of test case, I added a debug point to the source code, as below

--- a/sql/log_event.cc
+++ b/sql/log_event.cc
@@ -11507,6 +11507,8 @@ bool Rows_log_event::write_data_header(IO_CACHE *file)
   int6store(buf + RW_MAPID_OFFSET, m_table_id.id());
   int2store(buf + RW_FLAGS_OFFSET, m_flags);
   int rc = 0;
+
+  DBUG_EXECUTE_IF("change_log_bin_use_v1_row_events", my_sleep(5000000););
   if (likely(!log_bin_use_v1_row_events))
   {
     /*

the testcase is as below.
## test case
--source include/have_debug.inc
--source include/have_binlog_format_row.inc
--disable_warnings
--source include/master-slave.inc
--enable_warnings
connect(root_con,localhost,root,,test,$MASTER_MYPORT);

connection master;
CREATE TABLE t1(id INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
SET @saved_debug= @@global.debug;
SHOW VARIABLES LIKE 'log_bin_use_v1_row_events';
SET GLOBAL debug="+d,change_log_bin_use_v1_row_events";

connection root_con;
send INSERT INTO t1 VALUES (2);

connection master;
sleep 1;
SET GLOBAL log_bin_use_v1_row_events= ON;

connection root_con;
reap;
connection master;
--sync_slave_with_master

connection slave;
SELECT * FROM t1;

connection master;
sleep 5000;
SET GLOBAL debug= @saved_debug;
SET GLOBAL log_bin_use_v1_row_events= OFF;
DROP TABLE t1;
--sync_slave_with_master
--source include/rpl_end.inc

## master.opt
--log-bin --binlog-format=row

## slave.opt
--log-bin --binlog-format=row

===============================================================
Note: the test case above is for *debug* compile mode, and the master mysqld process crashed, the crash point and stack trace is:

mysqld: /home/xiangluo.wb/Projects/rds_5616/sql/binlog.cc:5708: int MYSQL_BIN_LOG::do_write_cache(IO_CACHE*): Assertion `!do_checksum || remains == 0' failed.
07:00:53 UTC - mysqld got signal 6 ;

Thread 2 (Thread 0x2b8aec040700 (LWP 3502)):
#0  0x00000030cbcda37d in read () from /lib64/libc.so.6
#1  0x00000030cbc711e8 in _IO_new_file_underflow () from /lib64/libc.so.6
#2  0x00000030cbc72cee in _IO_default_uflow_internal () from /lib64/libc.so.6
#3  0x00000030cbc674da in _IO_getline_info_internal () from /lib64/libc.so.6
#4  0x00000030cbc66339 in fgets () from /lib64/libc.so.6
#5  0x0000000000e1cd5b in rds_pstack ()
#6  0x000000000081d341 in handle_fatal_signal ()
#7  <signal handler called>
#8  0x00000030cbc328a5 in raise () from /lib64/libc.so.6
#9  0x00000030cbc34085 in abort () from /lib64/libc.so.6
#10 0x00000030cbc2ba1e in __assert_fail_base () from /lib64/libc.so.6
#11 0x00000030cbc2bae0 in __assert_fail () from /lib64/libc.so.6
#12 0x0000000000d4ec01 in MYSQL_BIN_LOG::do_write_cache(st_io_cache*) ()
#13 0x0000000000d4f764 in MYSQL_BIN_LOG::write_cache(THD*, binlog_cache_data*) ()
#14 0x0000000000d39ef9 in binlog_cache_data::flush(THD*, unsigned long long*, bool*) ()
#15 0x0000000000d609ff in binlog_cache_mngr::flush(THD*, unsigned long long*, bool*) ()
#16 0x0000000000d52316 in MYSQL_BIN_LOG::flush_thread_caches(THD*) ()
#17 0x0000000000d528c4 in MYSQL_BIN_LOG::process_flush_stage_queue(unsigned long long*, bool*, THD**, THD*) ()
#18 0x0000000000d543fd in MYSQL_BIN_LOG::ordered_commit(THD*, bool, bool) ()
#19 0x0000000000d5216c in MYSQL_BIN_LOG::commit(THD*, bool) ()
#20 0x0000000000681dd9 in ha_commit_trans(THD*, bool, bool) ()
#21 0x0000000000aab728 in trans_commit_stmt(THD*) ()
#22 0x000000000095e6bc in mysql_execute_command(THD*) ()
#23 0x000000000096391e in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#24 0x000000000094ca11 in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#25 0x000000000094acc8 in do_command(THD*) ()
#26 0x00000000008ed3e6 in do_handle_one_connection(THD*) ()
#27 0x00000000008ecb54 in handle_one_connection ()
#28 0x00000000013c6894 in pfs_spawn_thread ()
#29 0x00000030cc007851 in start_thread () from /lib64/libpthread.so.0
#30 0x00000030cbce767d in clone () from /lib64/libc.so.6

===============================================================
in *release* compile mode, if you want to repeat, the modification to the code is like:
--- a/sql/log_event.cc
+++ b/sql/log_event.cc
@@ -11507,6 +11507,8 @@ bool Rows_log_event::write_data_header(IO_CACHE *file)
   int6store(buf + RW_MAPID_OFFSET, m_table_id.id());
   int2store(buf + RW_FLAGS_OFFSET, m_flags);
   int rc = 0;
+
+  my_sleep(5000000);
   if (likely(!log_bin_use_v1_row_events))
   {
     /*

besides this comment out the debug related code in testcase.

I tried the testcase against 5.6.16 and 5.6.22.

Suggested fix:
the root of problem is inappropriate usage of global variable log_bin_use_v1_row_events, there are two references to this variable for Rows_log_event, one is at the ctor of Write/Delete/Update_rows_log_event, the other is at Rows_log_event::write_data_header(), when the global variable is changed just between these two places, inconsistency happens for the row event.
when slave parse this event in Rows_log_event::Rows_log_event(), the data_size is calculated to a minus value, whose type is unsigned, converted to a extremely large one, case malloc failed.

the fix is simple, use only one reference to the global variable

--- a/sql/log_event.cc
+++ b/sql/log_event.cc
@@ -11507,7 +11507,10 @@ bool Rows_log_event::write_data_header(IO_CACHE *file)
   int6store(buf + RW_MAPID_OFFSET, m_table_id.id());
   int2store(buf + RW_FLAGS_OFFSET, m_flags);
   int rc = 0;
-  if (likely(!log_bin_use_v1_row_events))
+
+  if (likely(!(m_type == WRITE_ROWS_EVENT_V1 ||
+               m_type == UPDATE_ROWS_EVENT_V1 ||
+               m_type == DELETE_ROWS_EVENT_V1 )))
   {
     /*
        v2 event, with variable header portion.

Back Trace

Attachment: backtrace_bug75746.txt (text/plain), 156.41 KiB.

Thank you for the bug report.