Bug #85158 | heartbeats/fakerotate cause a forced sync_master_info | ||
---|---|---|---|
Submitted: | 23 Feb 2017 18:43 | Modified: | 21 Feb 2019 14:17 |
Reporter: | Trey Raymond | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.7.17 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | GTID, heartbeat |
[23 Feb 2017 18:43]
Trey Raymond
[23 Feb 2017 18:46]
Trey Raymond
tested in v5.7.17 those links were to an old fork because I'm bad at moving between cli and ui. here are the correct links: https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L8264-L8295 https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L3345-L3378
[1 Aug 2017 21:50]
Sveta Smirnova
test case for MTR
Attachment: rpl_bug85158.test (application/octet-stream, text), 4.45 KiB.
[1 Aug 2017 21:50]
Sveta Smirnova
configuration file
Attachment: rpl_bug85158.cnf (application/octet-stream, text), 13.25 KiB.
[1 Aug 2017 21:53]
Sveta Smirnova
Attached test case can be used to repeat the issue. In the end it will print something like: [connection server_1] select count(*) from t1; count(*) 1325 <<<<<<<< ==== Number of rows on master [connection server_4] start slave; select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info'; OBJECT_NAME count_star AVG_TIMER_WAIT/1024/1024/1024 slave_master_info 36 11.149680108763 <<<< Remember IO numbers select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info'; table_name io_write_requests io_write slave_master_info 11 176.00 KiB <<<< Remeber IO numbers select * from performance_schema.replication_connection_status; CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_HEARTBEATS LAST_HEARTBEAT_TIMESTAMP RECEIVED_TRANSACTION_SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_TIMESTAMP channel_1 922a4d56-7701-11e7-8781-30b5c2208a0f 64 ON 4 2017-08-02 00:43:22 922a4d56-7701-11e7-8781-30b5c2208a0f:1-357 0 0000-00-00 00:00:00 channel_2 925a4dbc-7701-11e7-8691-30b5c2208a0f 70 ON 4 2017-08-02 00:43:24 922a4d56-7701-11e7-8781-30b5c2208a0f:1-53 0 0000-00-00 00:00:00 channel_3 92a8be79-7701-11e7-86b1-30b5c2208a0f 76 ON 4 2017-08-02 00:43:26 922a4d56-7701-11e7-8781-30b5c2208a0f:1-57 0 0000-00-00 00:00:00 select count(*) from t1; count(*) 1325 <<<<<<<< ==== Slave D received all changes from A [connection server_2] start slave; [connection server_3] start slave; [connection server_4] select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info'; OBJECT_NAME count_star AVG_TIMER_WAIT/1024/1024/1024 slave_master_info 46 22.199227280915 <<<<< 46 > 36 select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info'; table_name io_write_requests io_write slave_master_info 15 240.00 KiB <<<<< 15 > 11 select * from performance_schema.replication_connection_status; CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_HEARTBEATS LAST_HEARTBEAT_TIMESTAMP RECEIVED_TRANSACTION_SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_TIMESTAMP channel_1 922a4d56-7701-11e7-8781-30b5c2208a0f 64 ON 7 2017-08-02 00:46:22 922a4d56-7701-11e7-8781-30b5c2208a0f:1-357 0 0000-00-00 00:00:00 channel_2 925a4dbc-7701-11e7-8691-30b5c2208a0f 70 ON 10 2017-08-02 00:45:29 922a4d56-7701-11e7-8781-30b5c2208a0f:1-53:58-81 0 0000-00-00 00:00:00 channel_3 92a8be79-7701-11e7-86b1-30b5c2208a0f 76 ON 7 2017-08-02 00:45:33 922a4d56-7701-11e7-8781-30b5c2208a0f:1-57:59-80 0 0000-00-00 00:00:00 select count(*) from t1; count(*) 1325 <<<<<<<< ==== No new rows added
[21 Sep 2017 13:29]
Vlad Lesin
1) When slave connects to master, it sends COM_BINLOG_DUMP_GTID command with executed gtids set, see: 0 Binlog_sender::Binlog_sender (this=0x7ffff1179560, thd=0x7fff94000b70, start_file=0x7ffff117a1c0 "", start_pos=4, exclude_gtids=0x7ffff1179f90, flag=0) at ./sql/rpl_binlog_sender.cc:55 1 0x0000000001936fcf in mysql_binlog_send (thd=0x7fff94000b70, log_ident=0x7ffff117a1c0 "", pos=4, slave_gtid_executed=0x7ffff1179f90, flags=0) at ./sql/rpl_master.cc:410 2 0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff94000b70, packet=0x7fff9400a761 "", packet_length=30) at ./sql/rpl_master.cc:396 3 0x0000000001646634 in dispatch_command (thd=0x7fff94000b70, com_data=0x7ffff117ade0, command=COM_BINLOG_DUMP_GTID) at ./sql/sql_parse.cc:1709 4 0x00000000016448d6 in do_command (thd=0x7fff94000b70) at ./sql/sql_parse.cc:1021 5 0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312 6 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188 7 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff117b700) at pthread_create.c:333 8 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 2) Then master starts sending events, if it reads binary logs, and if gtid of some event belongs to already executed slaves gtids set, then HEARTBEAT is sent. See: 0 Binlog_sender::send_heartbeat_event (this=0x7ffff11fb560, log_pos=39340) at ./sql/rpl_binlog_sender.cc:1145 1 0x000000000193a362 in Binlog_sender::send_events (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, end_pos=39340) at ./sql/rpl_binlog_sender.cc:517 2 0x0000000001939b41 in Binlog_sender::send_binlog (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, start_pos=123) at ./sql/rpl_binlog_sender.cc:348 3 0x000000000193944a in Binlog_sender::run (this=0x7ffff11fb560) at ./sql/rpl_binlog_sender.cc:225 4 0x0000000001936fde in mysql_binlog_send (thd=0x7fff9c027e10, log_ident=0x7ffff11fc1c0 "", pos=4, slave_gtid_executed=0x7ffff11fbf90, flags=0) at ./sql/rpl_master.cc:412 5 0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff9c027e10, packet=0x7fff9c0110d1 "", packet_length=86) at ./sql/rpl_master.cc:396 6 0x0000000001646634 in dispatch_command (thd=0x7fff9c027e10, com_data=0x7ffff11fcde0, command=COM_BINLOG_DUMP_GTID) at ./sql/sql_parse.cc:1709 7 0x00000000016448d6 in do_command (thd=0x7fff9c027e10) at ./sql/sql_parse.cc:1021 8 0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312 9 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188 10 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333 11 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 If we look into: int Binlog_sender::send_events(IO_CACHE *log_cache, my_off_t end_pos) { ... if (m_exclude_gtid && (in_exclude_group= skip_event(event_ptr, event_len, in_exclude_group))) { ... } ... if (unlikely(in_exclude_group)) { if (send_heartbeat_event(log_pos)) DBUG_RETURN(1); } ... } we will see how hearbeat is being sent. See also Binlog_sender::skip_event() to understand how event is treated to be skipped and Binlog_sender::send_heartbeat_event() to see how hearbeat is sent. 3) Slave receives the heartbeat, see bool queue_event(Master_info* mi,const char* buf, ulong event_len) { ... case binary_log::HEARTBEAT_LOG_EVENT: ... } and writes 'rotate log' event into relay log, see write_ignored_events_info_to_relay_log() from which flush_master_info() is invoked with 'force' parameter equal to true, what means mi repository will be updated ignoring sync_master_info server option. See also the following backtrace: 0 Rpl_info_table::do_flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_table.cc:201 1 0x000000000196537c in Rpl_info_handler::flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_handler.h:94 2 0x0000000001964182 in Master_info::flush_info (this=0x7fffa007a2c0, force=true) at ./sql/rpl_mi.cc:263 3 0x0000000001940f32 in flush_master_info (mi=0x7fffa007a2c0, force=true) at ./sql/rpl_slave.cc:1444 4 0x0000000001946a1f in write_ignored_events_info_to_relay_log (thd=0x7fff9c027e70, mi=0x7fffa007a2c0) at ./sql/rpl_slave.cc:3428 5 0x0000000001957203 in queue_event (mi=0x7fffa007a2c0, buf=0x7fff9c00a9f1 "", event_len=40) at ./sql/rpl_slave.cc:8467 6 0x000000000194ef74 in handle_slave_io (arg=0x7fffa007a2c0) at ./sql/rpl_slave.cc:5861 7 0x0000000001e7cd03 in pfs_spawn_thread (arg=0x7fffa0013800) at ./storage/perfschema/pfs.cc:2188 8 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333 9 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 So if we have the following topology 1->2, 2->3, 1->3 and then stop slave 1->2, generate events on 1, then sync 1->3, and start 1->2, we will see a lot of hearbeats(greater or equal to the number of events, generated on 1) on 3, and a lot of writes(greater or equal to the number of hearbeats) in master info reposirory. Solution: The solution is not to force mi repository flush if hearbeat is received.
[21 Feb 2019 14:17]
Margaret Fisher
Posted by developer: Changelog entry added for MySQL 8.0.16 and 5.7.26: With GTIDs in use on the server, the master info log on a replication slave was being synchronized every time the master skipped a transaction using the auto-skip function. The process ends with a dummy heartbeat which is sent to the slave and caused a forced flush to the log, and this could have a large cumulative impact on the write load on the slave. The same issue could occur in a circular replication topology with events that originated from the same server and were therefore ignored, which were also handled by the slave with a forced flush to the log. The slave handling code has now been changed to remove the forced flush for heartbeat events and for ignored events received through circular replication, so that the master info log is only synchronized when appropriate (for example, when a CHANGE MASTER statement is issued, or the binary log is rotated).