Bug #73727 Cannot do positions sync when auto_position is ON
Submitted: 26 Aug 2014 12:59 Modified: 16 Feb 2015 15:46
Reporter: Venkatesh Duggirala Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.2 OS:Any
Assigned to: CPU Architecture:Any

[26 Aug 2014 12:59] Venkatesh Duggirala
Description:
One of the inputs to "mysql-test/include/sync_slave_sql_with_master.inc" is  "ignore_gtids_on_sync"

#     Forces the use of master file and position, even if $use_gtids is set.
#     This might be used if the slave will not have all the GTIDs of the
#     master but have to read and apply all master events to the end.

But this is broken in 5.7 and this variable is *working* fine in 5.6. 

Analysis:
=========

In WL#5721, We have re factored replication dump thread code a lot.
Now with the refactored code, it seems we are not sending previous_gtid_log_event
all the time. 

In Binlog_sender::send_binlog,
{
....
if (m_check_previous_gtid_event)
  {
    bool has_prev_gtid_ev;
    if (has_previous_gtid_log_event(log_cache, &has_prev_gtid_ev))
      return 1;

    if (!has_prev_gtid_ev)
      return 0;
  }
...
}

If Slave, even though auto_position=1, decide to wait on positions, then
the sync will be timed out as Master will never send previous_gtid_log_event
and Slave is waiting till it receives previous_gtid_log_event.

The above analysis was done as part of bug#19470658 (test failure) and it turned out to be server issue.

How to repeat:
1) Master and Slave with Gtid_mode ON and auto_protocol=1
2) Restart the slave server with Gtid_mode ON and auto_protocol=1
3) Wait on positions to be synced => Will hang 

If you skip 2, there is no problem, because we are sending previous_gtid_log_event in that case. But If we restart, I am guessing that with above pasted code, we are skipping to send previous_gtid_log_event.

MTR Test script:

--source include/have_gtid.inc
--source include/have_binlog_format_statement.inc
#Step1: setup replication
--source include/master-slave.inc

#Step2: restart slave server
--let $rpl_server_number= 1
--source include/rpl_stop_server.inc
--let $rpl_start_with_gtids= 1
--source include/rpl_start_server.inc

# To easily reproduce it (100% probability) below sleep is required. 
sleep 10;  

#Step3: Wait on positions by setting ignore_gtids_on_sync=1.
--connection master
--let $ignore_gtids_on_sync= 1
--source include/sync_slave_sql_with_master.inc
--let $ignore_gtids_on_sync= 0
--source include/rpl_end.inc

Suggested fix:

I am not sure "Is it limitation that we cannot send previous_gtid_log_event" or is it a bug in our code.

If it is limitation in 5.7 that we cannot sync with poistions when auto_position=1 which is working in 5.6, then we should document this limitation and also remove "ignore_gtids_on_sync" logic from test frame work which does exactly the same.
[16 Feb 2015 15:46] David Moss
Thanks for your feedback. The following was added to the 5.7.6 release notes:

In a replication topology where:

    the slave had GTID_MODE=ON and MASTER_AUTO_POSITION=1

    the master had GTID_MODE=ON and had not executed any transactions since it was started 

if the slave used the MASTER_POS_WAIT function to wait until it had received the full binary log from the master while the master had not executed any transactions, then the MASTER_POS_WAIT function would never finish, or would time out. This was caused because after a server restart, the master's binary log ends with a Previous_gtids_log_event, but this event was not being replicated, so the slave was not made aware of the master's binary log position. The fix ensures that the Previous_gtids_log_event is replicated correctly, so that the slave becomes aware of the correct binary log position on the master, ensuring that the MASTER_POS_WAIT function can finish.