Bug #76491 IO thread stop with errno:1595 while rotate
Submitted: 26 Mar 2015 8:51 Modified: 20 Jan 2016 12:42
Reporter: zhang yingqiang (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Replication Severity:S3 (Non-critical)
Version:5.6.16 OS:Linux
Assigned to: CPU Architecture:Any
Tags: binlog; rotate; io thread;

[26 Mar 2015 8:51] zhang yingqiang
Description:
When the pressure of the mysql server's IO is too heavy, it will cost a long time to flush the old binlog and create the new one.
So there will be several seconds between the ROTATE_EVENT and the FORMAT_DESCRIPTION_EVENT. And if the slave_net_timeout/MASTER_HEARTBEAT_PERIOD is
less then the cost time above, there will be a HEARTBEAT_LOG_EVENT between the ROTATE_EVENT and the FORMAT_DESCRIPTION_EVENT.
But the slave can not acccept that, slave process HEARTBEAT_LOG_EVENT in queue_old_event, and fail.

How to repeat:
$cat ../../mysql-test/t/bug_20150326.test 
--source include/master-slave.inc
connection slave;
set global slave_net_timeout=4;
stop slave;
change master to Master_Log_File='master-bin.000001', Master_Log_Pos=120, MASTER_HEARTBEAT_PERIOD=2;
start slave;
--sleep 1

connection master;
SET DEBUG="+d,binlog_after_rotate_before_format";
flush logs;

connection slave;
call mtr.add_suppression("Slave I/O: Relay log write failure: could not queue event from master, Error_code: 1595");
call mtr.add_suppression("Read invalid event from master:.*ound invalid event in binary log.*master could be corrupt but a more likely cause of this is a bug");
--source include/wait_for_slave_io_to_stop.inc
let $slave_io_errno= query_get_value(SHOW SLAVE STATUS, Last_IO_Errno, 1);
--echo !!!Here IO THREAD stoped!!!
--echo slave_io_errno: $slave_io_errno
start slave;

connection slave;
set global slave_net_timeout=120;
--source include/rpl_end.inc

Suggested fix:
Fix the slave logic to compat with the special order of binlog events in queue_event.
[26 Mar 2015 9:01] zhang yingqiang
need a patch to simulate the heavy IO

$git diff
diff --git a/sql/binlog.cc b/sql/binlog.cc
index 28d8c12..bb22d2c 100644
--- a/sql/binlog.cc
+++ b/sql/binlog.cc
@@ -4817,6 +4817,9 @@ int MYSQL_BIN_LOG::new_file_impl(bool need_lock_log, Format_description_log_even
   }
   update_binlog_end_pos();
 
+  /* For bug test by yingqiang.zyq */
+  DBUG_EXECUTE_IF("binlog_after_rotate_before_format", sleep(2););
+
   old_name=name;
   name=0;                              // Don't free name
   close(LOG_CLOSE_TO_BE_OPENED | LOG_CLOSE_INDEX);
[20 Jan 2016 12:42] MySQL Verification Team
Hello zhang.

Thank you for the report and test case.
I'm not seeing the issue with 5.6.28 and previous releases.
Could you please confirm with 5.6.28? If you can provide more information, feel free to add it to this bug and change the status back to 'Open'.

-- Build after patching(to simulate the heavy IO)

[umshastr@hod03]/export/umesh/server/source/bugs/mysql-5.6.28: cmake . -DCMAKE_BUILD_TYPE=Debug -DBUILD_CONFIG=mysql_release
[umshastr@hod03]/export/umesh/server/source/bugs/mysql-5.6.28:make -j16

--
==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
include/master-slave.inc
Warnings:
Note    ####    Sending passwords in plain text without SSL/TLS is extremely insecure.
Note    ####    Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[connection master]
set global slave_net_timeout=4;
Warnings:
Warning 1704    The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout.
stop slave;
change master to Master_Log_File='master-bin.000001', Master_Log_Pos=120, MASTER_HEARTBEAT_PERIOD=2;
start slave;
SET DEBUG="+d,binlog_after_rotate_before_format";
flush logs;
call mtr.add_suppression("Slave I/O: Relay log write failure: could not queue event from master, Error_code: 1595");
call mtr.add_suppression("Read invalid event from master:.*ound invalid event in binary log.*master could be corrupt but a more likely cause of this is a bug");
include/wait_for_slave_io_to_stop.inc
^^ was timing out after 300sec, replaced with wait_for_slave_io_error.inc but still not seeing IO issue.

Thanks,
Umesh