Bug #85158 heartbeats/fakerotate cause a forced sync_master_info
Submitted: 23 Feb 2017 18:43 Modified: 23 Feb 2017 18:46
Reporter: Trey Raymond Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.17 OS:Any
Assigned to:
Tags: GTID, heartbeat

[23 Feb 2017 18:43] Trey Raymond
Description:
With master_info_repository=TABLE and variable sync_master_info=0, we should only see table mysql.slave_master_info updated on log file rotation.  That is the documented and intended design, and can significantly reduce load on slaves, allowing much greater replication throughput.  Assuming a similar issue would come up with FILE repo, but that ends up as fsyncs of the file vs writes and commits to the table.  Using TABLE for context here.

However, there's a case where some code crosses paths.  The code handling [already-executed] gtid event skips uses a dummy heartbeat (treated as a fake rotate event) to signal the end of a skip period.  Sounds great, until you get to the slave handling code, which takes that event and forces a master info sync on every heartbeat - even these dummy ones.

This can, even in a somewhat simple setup, lead to a ~5X increase in write load on a slave.  Huge amount of wasted resources, very limiting to replication throughput.

fake hb handling:
https://github.com/aradapilot/mysql-server/blob/5.7/sql/rpl_slave.cc#L8091-L8122

which in turn calls:
https://github.com/aradapilot/mysql-server/blob/5.7/sql/rpl_slave.cc#L3196-L3229

note "flush_master_info(mi, TRUE)" - true here being force flush, regardless of sync settings.  force is only intended to be used for things like repo init, stop slave, and *real* log rotates.

How to repeat:
Generate a lot of nonsequential skips to demonstrate.  A very basic msr setup will do that under load:

gtid on, of course
- Server A, writable master.  msr replication from B and C.
- Servers B and C, backup masters.  msr replication from A,C and A,B respectively.
- Server D, msr slave of all three, no log_bin (keep it fast).

Now, apply a significant write load to A (sysbench works fine) - I found it nice to organically lag B and C with load to A, but you can stop their sql threads for a period to simulate lag.  If you do that, restart them both after a while.
I recommend stopping the writes to A here, so that D is caught up and not executing any row changes in the data - this helps to isolate load caused by these heartbeats.

Now, they'll be chugging through a lot of dupes and sending them down to D as heartbeats.  In a pretty short test, I saw ~60 million heartbeats received on D (from p_s.replication_connection_status) - ~30M from each of B and C, only a few from A.  Checking the same values on B and C, still just a few on each channel.

Now there's a lot of IO happening, considering D isn't actually executing any changes.  Checking p_s.table_io_waits_summary_by_table, the vast majority of load is on slave_master_info.  The wait count was 4X as much as for the actual data table being written to.  The total iowait time was 6X as much.

Suggested fix:
Two ways to go about this, depending on the intentions of this process.

First, is it really necessary to send so many heartbeats just for gtid skips?  I don't see that adding value, even if you do sync master info every trx.  They're skips, they're specifically intended to do nothing.  Eliminating them (on the rpl_master.cc side) would be the best solution - but that code was written for a purpose, someone must want them, a setting to control the behavior would be good.  Or it might be very important for a reason I don't see in the code!

Second, if it's actually important, and even turning it off via a setting is impractical...take away the force sync.  Change that "true" in rpl_slave.cc to "false" so it obeys the same rules other events do (the documented rules!)
[23 Feb 2017 18:46] Trey Raymond
tested in v5.7.17
those links were to an old fork because I'm bad at moving between cli and ui.  here are the correct links:

https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L8264-L8295
https://github.com/mysql/mysql-server/blob/5.7/sql/rpl_slave.cc#L3345-L3378
[1 Aug 2017 21:50] Sveta Smirnova
test case for MTR

Attachment: rpl_bug85158.test (application/octet-stream, text), 4.45 KiB.

[1 Aug 2017 21:50] Sveta Smirnova
configuration file

Attachment: rpl_bug85158.cnf (application/octet-stream, text), 13.25 KiB.

[1 Aug 2017 21:53] Sveta Smirnova
Attached test case can be used to repeat the issue.

In the end it will print something like:

[connection server_1]
select count(*) from t1;
count(*)
1325                         <<<<<<<< ==== Number of rows on master
[connection server_4]
start slave;
select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info';
OBJECT_NAME	count_star	AVG_TIMER_WAIT/1024/1024/1024
slave_master_info	36	11.149680108763      <<<< Remember IO numbers
select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info';
table_name	io_write_requests	io_write
slave_master_info	11	176.00 KiB           <<<< Remeber IO numbers
select * from performance_schema.replication_connection_status;
CHANNEL_NAME	GROUP_NAME	SOURCE_UUID	THREAD_ID	SERVICE_STATE	COUNT_RECEIVED_HEARTBEATS	LAST_HEARTBEAT_TIMESTAMP	RECEIVED_TRANSACTION_SET	LAST_ERROR_NUMBER	LAST_ERROR_MESSAGE	LAST_ERROR_TIMESTAMP
channel_1		922a4d56-7701-11e7-8781-30b5c2208a0f	64	ON	4	2017-08-02 00:43:22	922a4d56-7701-11e7-8781-30b5c2208a0f:1-357	0		0000-00-00 00:00:00
channel_2		925a4dbc-7701-11e7-8691-30b5c2208a0f	70	ON	4	2017-08-02 00:43:24	922a4d56-7701-11e7-8781-30b5c2208a0f:1-53	0		0000-00-00 00:00:00
channel_3		92a8be79-7701-11e7-86b1-30b5c2208a0f	76	ON	4	2017-08-02 00:43:26	922a4d56-7701-11e7-8781-30b5c2208a0f:1-57	0		0000-00-00 00:00:00
select count(*) from t1;
count(*)
1325                         <<<<<<<< ==== Slave D received all changes from A
[connection server_2]
start slave;
[connection server_3]
start slave;
[connection server_4]
select OBJECT_NAME, count_star, AVG_TIMER_WAIT/1024/1024/1024 from performance_schema.table_io_waits_summary_by_table where OBJECT_SCHEMA='mysql' and OBJECT_NAME ='slave_master_info';
OBJECT_NAME	count_star	AVG_TIMER_WAIT/1024/1024/1024
slave_master_info	46	22.199227280915        <<<<< 46 > 36
select table_name, io_write_requests, io_write from sys.schema_table_statistics where table_name='slave_master_info';
table_name	io_write_requests	io_write
slave_master_info	15	240.00 KiB             <<<<< 15 > 11
select * from performance_schema.replication_connection_status;
CHANNEL_NAME	GROUP_NAME	SOURCE_UUID	THREAD_ID	SERVICE_STATE	COUNT_RECEIVED_HEARTBEATS	LAST_HEARTBEAT_TIMESTAMP	RECEIVED_TRANSACTION_SET	LAST_ERROR_NUMBER	LAST_ERROR_MESSAGE	LAST_ERROR_TIMESTAMP
channel_1		922a4d56-7701-11e7-8781-30b5c2208a0f	64	ON	7	2017-08-02 00:46:22	922a4d56-7701-11e7-8781-30b5c2208a0f:1-357	0		0000-00-00 00:00:00
channel_2		925a4dbc-7701-11e7-8691-30b5c2208a0f	70	ON	10	2017-08-02 00:45:29	922a4d56-7701-11e7-8781-30b5c2208a0f:1-53:58-81	0		0000-00-00 00:00:00
channel_3		92a8be79-7701-11e7-86b1-30b5c2208a0f	76	ON	7	2017-08-02 00:45:33	922a4d56-7701-11e7-8781-30b5c2208a0f:1-57:59-80	0		0000-00-00 00:00:00
select count(*) from t1;
count(*)
1325                         <<<<<<<< ==== No new rows added
[21 Sep 2017 13:29] Vlad Lesin
1) When slave connects to master, it sends COM_BINLOG_DUMP_GTID command with
    executed gtids set, see:
    
    0  Binlog_sender::Binlog_sender (this=0x7ffff1179560, thd=0x7fff94000b70, start_file=0x7ffff117a1c0 "", start_pos=4, exclude_gtids=0x7ffff1179f90, flag=0)
        at ./sql/rpl_binlog_sender.cc:55
    1  0x0000000001936fcf in mysql_binlog_send (thd=0x7fff94000b70, log_ident=0x7ffff117a1c0 "", pos=4, slave_gtid_executed=0x7ffff1179f90, flags=0)
        at ./sql/rpl_master.cc:410
    2  0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff94000b70, packet=0x7fff9400a761 "", packet_length=30)
        at ./sql/rpl_master.cc:396
    3  0x0000000001646634 in dispatch_command (thd=0x7fff94000b70, com_data=0x7ffff117ade0, command=COM_BINLOG_DUMP_GTID)
        at ./sql/sql_parse.cc:1709
    4  0x00000000016448d6 in do_command (thd=0x7fff94000b70) at ./sql/sql_parse.cc:1021
    5  0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312
    6  0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188
    7  0x00007ffff6f5e6ba in start_thread (arg=0x7ffff117b700) at pthread_create.c:333
    8  0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
    
    2) Then master starts sending events, if it reads binary logs, and if gtid of
    some event belongs to already executed slaves gtids set, then HEARTBEAT is sent.
    
    See:
    0  Binlog_sender::send_heartbeat_event (this=0x7ffff11fb560, log_pos=39340) at ./sql/rpl_binlog_sender.cc:1145
    1  0x000000000193a362 in Binlog_sender::send_events (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, end_pos=39340)
        at ./sql/rpl_binlog_sender.cc:517
    2  0x0000000001939b41 in Binlog_sender::send_binlog (this=0x7ffff11fb560, log_cache=0x7ffff11fafa0, start_pos=123)
        at ./sql/rpl_binlog_sender.cc:348
    3  0x000000000193944a in Binlog_sender::run (this=0x7ffff11fb560) at ./sql/rpl_binlog_sender.cc:225
    4  0x0000000001936fde in mysql_binlog_send (thd=0x7fff9c027e10, log_ident=0x7ffff11fc1c0 "", pos=4, slave_gtid_executed=0x7ffff11fbf90, flags=0)
        at ./sql/rpl_master.cc:412
    5  0x0000000001936e75 in com_binlog_dump_gtid (thd=0x7fff9c027e10, packet=0x7fff9c0110d1 "", packet_length=86)
        at ./sql/rpl_master.cc:396
    6  0x0000000001646634 in dispatch_command (thd=0x7fff9c027e10, com_data=0x7ffff11fcde0, command=COM_BINLOG_DUMP_GTID)
        at ./sql/sql_parse.cc:1709
    7  0x00000000016448d6 in do_command (thd=0x7fff9c027e10) at ./sql/sql_parse.cc:1021
    8  0x0000000001796a22 in handle_connection (arg=0x34bd880) at ./sql/conn_handler/connection_handler_per_thread.cc:312
    9  0x0000000001e7cd03 in pfs_spawn_thread (arg=0x3902460) at ./storage/perfschema/pfs.cc:2188
    10 0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333
    11 0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
    
    If we look into:
    
    int Binlog_sender::send_events(IO_CACHE *log_cache, my_off_t end_pos)
    {
    ...
        if (m_exclude_gtid && (in_exclude_group= skip_event(event_ptr, event_len,
                                                            in_exclude_group)))
        {
    ...
        }
    ...
      if (unlikely(in_exclude_group))
      {
        if (send_heartbeat_event(log_pos))
          DBUG_RETURN(1);
      }
    ...
    }
    
    we will see how hearbeat is being sent.
    
    See also Binlog_sender::skip_event() to understand how event is treated to be
    skipped and Binlog_sender::send_heartbeat_event() to see how hearbeat is sent.
    
    3) Slave receives the heartbeat, see
    
    bool queue_event(Master_info* mi,const char* buf, ulong event_len)
    {
    ...
      case binary_log::HEARTBEAT_LOG_EVENT:
    ...
    }
    
    and writes 'rotate log' event into relay log, see
    write_ignored_events_info_to_relay_log() from which flush_master_info() is
    invoked with 'force' parameter equal to true, what means mi repository will be
    updated ignoring sync_master_info server option.
    
    See also the following backtrace:
    
    0  Rpl_info_table::do_flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_table.cc:201
    1  0x000000000196537c in Rpl_info_handler::flush_info (this=0x7fffa0f39950, force=true) at ./sql/rpl_info_handler.h:94
    2  0x0000000001964182 in Master_info::flush_info (this=0x7fffa007a2c0, force=true) at ./sql/rpl_mi.cc:263
    3  0x0000000001940f32 in flush_master_info (mi=0x7fffa007a2c0, force=true) at ./sql/rpl_slave.cc:1444
    4  0x0000000001946a1f in write_ignored_events_info_to_relay_log (thd=0x7fff9c027e70, mi=0x7fffa007a2c0)
        at ./sql/rpl_slave.cc:3428
    5  0x0000000001957203 in queue_event (mi=0x7fffa007a2c0, buf=0x7fff9c00a9f1 "", event_len=40) at ./sql/rpl_slave.cc:8467
    6  0x000000000194ef74 in handle_slave_io (arg=0x7fffa007a2c0) at ./sql/rpl_slave.cc:5861
    7  0x0000000001e7cd03 in pfs_spawn_thread (arg=0x7fffa0013800) at ./storage/perfschema/pfs.cc:2188
    8  0x00007ffff6f5e6ba in start_thread (arg=0x7ffff11fd700) at pthread_create.c:333
    9  0x00007ffff63f33dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
    
    So if we have the following topology 1->2, 2->3, 1->3 and then stop slave 1->2,
    generate events on 1, then sync 1->3, and start 1->2, we will see a lot of
    hearbeats(greater or equal to the number of events, generated on 1) on 3,
    and a lot of writes(greater or equal to the number of hearbeats) in
    master info reposirory.
    
    Solution:
    
    The solution is not to force mi repository flush if hearbeat is received.