Bug #85158 | heartbeats/fakerotate cause a forced sync_master_info | ||
---|---|---|---|
Submitted: | 23 Feb 2017 18:43 | Modified: | 23 Feb 2017 18:46 |
Reporter: | Trey Raymond | Email Updates: | |
Status: | Open | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.7.17 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | GTID, heartbeat |
[23 Feb 2017 18:43]
Trey Raymond
[23 Feb 2017 18:46]
Trey Raymond
tested in v5.7.17 those links were to an old fork because I'm bad at moving between cli and ui. here are the correct links: https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L8264-L8295 https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L3345-L3378
[1 Aug 2017 21:50]
Sveta Smirnova
test case for MTR
Attachment: rpl_bug85158.test (application/octet-stream, text), 4.45 KiB.
[1 Aug 2017 21:50]
Sveta Smirnova
configuration file
Attachment: rpl_bug85158.cnf (application/octet-stream, text), 13.25 KiB.
[1 Aug 2017 21:53]
Sveta Smirnova
Attached test case can be used to repeat the issue. In the end it will print something like: [connection server_1] select count(*) from t1; count(*) 1325 <<<<<<<< ==== Number of rows on master [connection server_4] start slave; select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info'; OBJECT_NAME count_star AVG_TIMER_WAIT/1024/1024/1024 slave_master_info 36 11.149680108763 <<<< Remember IO numbers select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info'; table_name io_write_requests io_write slave_master_info 11 176.00 KiB <<<< Remeber IO numbers select * from performance_schema.replication_connection_status; CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_HEARTBEATS LAST_HEARTBEAT_TIMESTAMP RECEIVED_TRANSACTION_SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_TIMESTAMP channel_1 922a4d56-7701-11e7-8781-30b5c2208a0f 64 ON 4 2017-08-02 00:43:22 922a4d56-7701-11e7-8781-30b5c2208a0f:1-357 0 0000-00-00 00:00:00 channel_2 925a4dbc-7701-11e7-8691-30b5c2208a0f 70 ON 4 2017-08-02 00:43:24 922a4d56-7701-11e7-8781-30b5c2208a0f:1-53 0 0000-00-00 00:00:00 channel_3 92a8be79-7701-11e7-86b1-30b5c2208a0f 76 ON 4 2017-08-02 00:43:26 922a4d56-7701-11e7-8781-30b5c2208a0f:1-57 0 0000-00-00 00:00:00 select count(*) from t1; count(*) 1325 <<<<<<<< ==== Slave D received all changes from A [connection server_2] start slave; [connection server_3] start slave; [connection server_4] select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info'; OBJECT_NAME count_star AVG_TIMER_WAIT/1024/1024/1024 slave_master_info 46 22.199227280915 <<<<< 46 > 36 select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info'; table_name io_write_requests io_write slave_master_info 15 240.00 KiB <<<<< 15 > 11 select * from performance_schema.replication_connection_status; CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_HEARTBEATS LAST_HEARTBEAT_TIMESTAMP RECEIVED_TRANSACTION_SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_TIMESTAMP channel_1 922a4d56-7701-11e7-8781-30b5c2208a0f 64 ON 7 2017-08-02 00:46:22 922a4d56-7701-11e7-8781-30b5c2208a0f:1-357 0 0000-00-00 00:00:00 channel_2 925a4dbc-7701-11e7-8691-30b5c2208a0f 70 ON 10 2017-08-02 00:45:29 922a4d56-7701-11e7-8781-30b5c2208a0f:1-53:58-81 0 0000-00-00 00:00:00 channel_3 92a8be79-7701-11e7-86b1-30b5c2208a0f 76 ON 7 2017-08-02 00:45:33 922a4d56-7701-11e7-8781-30b5c2208a0f:1-57:59-80 0 0000-00-00 00:00:00 select count(*) from t1; count(*) 1325 <<<<<<<< ==== No new rows added
[21 Sep 2017 13:29]
Vlad Lesin
1) When slave connects to master, it sends COM_BINLOG_DUMP_GTID command with executed gtids set, see: 0 Binlog_sender::Binlog_sender (this=0x7ffff1179560, thd=0x7fff94000b70, start_file=0x7ffff117a1c0 "", start_pos=4, exclude_gtids=0x7ffff1179f90, flag=0) at ./sql/rpl_binlog_sender.cc:55 1 0x0000000001936fcf in mysql_binlog_send (thd=0x7fff94000b70, log_ident=0x7ffff117a1c0 "", pos=4, slave_gtid_executed=0x7ffff1179f90, flags=0) at ./sql/rpl_master.cc:410 2 0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff94000b70, packet=0x7fff9400a761 "", packet_length=30) at ./sql/rpl_master.cc:396 3 0x0000000001646634 in dispatch_command (thd=0x7fff94000b70, com_data=0x7ffff117ade0, command=COM_BINLOG_DUMP_GTID) at ./sql/sql_parse.cc:1709 4 0x00000000016448d6 in do_command (thd=0x7fff94000b70) at ./sql/sql_parse.cc:1021 5 0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312 6 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188 7 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff117b700) at pthread_create.c:333 8 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 2) Then master starts sending events, if it reads binary logs, and if gtid of some event belongs to already executed slaves gtids set, then HEARTBEAT is sent. See: 0 Binlog_sender::send_heartbeat_event (this=0x7ffff11fb560, log_pos=39340) at ./sql/rpl_binlog_sender.cc:1145 1 0x000000000193a362 in Binlog_sender::send_events (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, end_pos=39340) at ./sql/rpl_binlog_sender.cc:517 2 0x0000000001939b41 in Binlog_sender::send_binlog (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, start_pos=123) at ./sql/rpl_binlog_sender.cc:348 3 0x000000000193944a in Binlog_sender::run (this=0x7ffff11fb560) at ./sql/rpl_binlog_sender.cc:225 4 0x0000000001936fde in mysql_binlog_send (thd=0x7fff9c027e10, log_ident=0x7ffff11fc1c0 "", pos=4, slave_gtid_executed=0x7ffff11fbf90, flags=0) at ./sql/rpl_master.cc:412 5 0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff9c027e10, packet=0x7fff9c0110d1 "", packet_length=86) at ./sql/rpl_master.cc:396 6 0x0000000001646634 in dispatch_command (thd=0x7fff9c027e10, com_data=0x7ffff11fcde0, command=COM_BINLOG_DUMP_GTID) at ./sql/sql_parse.cc:1709 7 0x00000000016448d6 in do_command (thd=0x7fff9c027e10) at ./sql/sql_parse.cc:1021 8 0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312 9 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188 10 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333 11 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 If we look into: int Binlog_sender::send_events(IO_CACHE *log_cache, my_off_t end_pos) { ... if (m_exclude_gtid && (in_exclude_group= skip_event(event_ptr, event_len, in_exclude_group))) { ... } ... if (unlikely(in_exclude_group)) { if (send_heartbeat_event(log_pos)) DBUG_RETURN(1); } ... } we will see how hearbeat is being sent. See also Binlog_sender::skip_event() to understand how event is treated to be skipped and Binlog_sender::send_heartbeat_event() to see how hearbeat is sent. 3) Slave receives the heartbeat, see bool queue_event(Master_info* mi,const char* buf, ulong event_len) { ... case binary_log::HEARTBEAT_LOG_EVENT: ... } and writes 'rotate log' event into relay log, see write_ignored_events_info_to_relay_log() from which flush_master_info() is invoked with 'force' parameter equal to true, what means mi repository will be updated ignoring sync_master_info server option. See also the following backtrace: 0 Rpl_info_table::do_flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_table.cc:201 1 0x000000000196537c in Rpl_info_handler::flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_handler.h:94 2 0x0000000001964182 in Master_info::flush_info (this=0x7fffa007a2c0, force=true) at ./sql/rpl_mi.cc:263 3 0x0000000001940f32 in flush_master_info (mi=0x7fffa007a2c0, force=true) at ./sql/rpl_slave.cc:1444 4 0x0000000001946a1f in write_ignored_events_info_to_relay_log (thd=0x7fff9c027e70, mi=0x7fffa007a2c0) at ./sql/rpl_slave.cc:3428 5 0x0000000001957203 in queue_event (mi=0x7fffa007a2c0, buf=0x7fff9c00a9f1 "", event_len=40) at ./sql/rpl_slave.cc:8467 6 0x000000000194ef74 in handle_slave_io (arg=0x7fffa007a2c0) at ./sql/rpl_slave.cc:5861 7 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x7fffa0013800) at ./storage/perfschema/pfs.cc:2188 8 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333 9 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 So if we have the following topology 1->2, 2->3, 1->3 and then stop slave 1->2, generate events on 1, then sync 1->3, and start 1->2, we will see a lot of hearbeats(greater or equal to the number of events, generated on 1) on 3, and a lot of writes(greater or equal to the number of hearbeats) in master info reposirory. Solution: The solution is not to force mi repository flush if hearbeat is received.