Bug #69943 Transactions skipped on slave after "stop/start slave" using GTID replication
Submitted: 7 Aug 2013 0:46 Modified: 30 Oct 2013 8:35
Reporter: Yoshinori Matsunobu (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.6.13 OS:Any
Assigned to: CPU Architecture:Any

[7 Aug 2013 0:46] Yoshinori Matsunobu
Description:
This is similar to bug#69758, but caused by normal "stop/start slave io_thread", and I could repeat 100% on my test servers.

How to repeat:
1. Setup master with gtid.

2. Create table t1
create table t1 (id int auto_increment primary key, value int, value2 varchar(10000)) engine=innodb;

3. Run a very long transaction on master (generating hundreds of megabytes of binary logs). I used a simple perl script tran_insert.pl as attached.
perl tran_insert.pl 5613 root 1 500000 500000
 (insert 500000 rows within one transaction, into table t1. This script generates 500MB binlog)

4. Run a couple of short transactions on master
perl tran_insert.pl 5613 root 500001 501000 1
 (insert 1 row by 1 transaction, 1000 rows in total)

5. Setup a slave with gtid. Lower network bandwidth will make it easy to repeat the bug.
change master to master_host='master_host', master_port=5613, master_user='repl', master_auto_position=1;

6. Start replication, then stop immediately
start slave io_thread;
select sleep(3);
stop slave io_thread;

 (make sure not to fetch all 500MB binary logs. Decrease sleep seconds appropriately)

7. Restart replication
start slave io_thread;

8. Check relay logs or table t1. You'll be able to verify that the entire large transaction (500000 rows) is lost.

Suggested fix:
When stopping io_thread (manually on this bug report, accidentally on #69758), I/O thread may not have received a transaction to the end. When reconnecting with master_auto_position=1, it tries to fetch from the new transaction. This causes transaction loss and should be fixed.
[7 Aug 2013 0:47] Yoshinori Matsunobu
test script to repeat the bug

Attachment: tran_insert.pl (application/octet-stream, text), 943 bytes.

[7 Aug 2013 6:19] MySQL Verification Team
Hello Yoshinori,

Thank you for the bug report and the test case. 
Verified as described on MySQL version 5.6.13

Thanks,
Umesh
[7 Aug 2013 6:20] MySQL Verification Team
// At the end of all steps.. confirms that the entire large transaction (500000 rows) is lost

// On Master

mysql> show master status;
+------------------+-----------+--------------+------------------+---------------------------------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |
+------------------+-----------+--------------+------------------+---------------------------------------------+
| mysql-bin.000003 | 581988168 |              |                  | 4470d4c7-ffe1-11e2-8214-49f3e6a792e3:1-1007 |
+------------------+-----------+--------------+------------------+---------------------------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|   501000 |
+----------+
1 row in set (0.97 sec)

// On slave

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 581988168
               Relay_Log_File: B-relay-bin.000003
                Relay_Log_Pos: 1321401
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 581988168
              Relay_Log_Space: 156422612
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 4470d4c7-ffe1-11e2-8214-49f3e6a792e3
             Master_Info_File: /tmp/B/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 4470d4c7-ffe1-11e2-8214-49f3e6a792e3:1-1007
            Executed_Gtid_Set: 4470d4c7-ffe1-11e2-8214-49f3e6a792e3:1:3-1007
                Auto_Position: 1
1 row in set (0.00 sec)

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|     1000 |
+----------+
1 row in set (0.00 sec)

mysql>
[7 Aug 2013 15:37] MySQL Verification Team
Tried the workaround relay_log_recovery=1 as suggested by venkat, issue is still reproducible:

from step 5 onwards from test case..
// On slave

mysql> change master to master_host='127.0.0.1', master_port=3306, master_user='repl',MASTER_PASSWORD='slavepass', master_auto_position=1;
Query OK, 0 rows affected, 2 warnings (0.04 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> stop slave;
Query OK, 0 rows affected (0.15 sec)

mysql> start slave io_thread;
Query OK, 0 rows affected (0.00 sec)

mysql> select sleep(3);
+----------+
| sleep(3) |
+----------+
|        0 |
+----------+
1 row in set (3.00 sec)

mysql> stop slave io_thread;
Query OK, 0 rows affected (0.02 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 581988168
               Relay_Log_File: B-relay-bin.000003
                Relay_Log_Pos: 202408
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 580869175
              Relay_Log_Space: 63704965
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 119
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 0380caa2-002f-11e3-840f-531e9aab8cbb
             Master_Info_File: /tmp/B/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 0380caa2-002f-11e3-840f-531e9aab8cbb:1-1007
            Executed_Gtid_Set: 0380caa2-002f-11e3-840f-531e9aab8cbb:1:3-154
                Auto_Position: 1
1 row in set (0.04 sec)

mysql> select count(*) from t1;
ERROR 1046 (3D000): No database selected
mysql> use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|     1000 |
+----------+
1 row in set (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 581988168
               Relay_Log_File: B-relay-bin.000004
                Relay_Log_Pos: 448
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 581988168
              Relay_Log_Space: 1321898
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 0380caa2-002f-11e3-840f-531e9aab8cbb
             Master_Info_File: /tmp/B/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 0380caa2-002f-11e3-840f-531e9aab8cbb:1-1007
            Executed_Gtid_Set: 0380caa2-002f-11e3-840f-531e9aab8cbb:1:3-1007
                Auto_Position: 1
1 row in set (0.00 sec)

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|     1000 |
+----------+
1 row in set (0.00 sec)

mysql> show global variables like 'relay%';
+---------------------------+----------------+
| Variable_name             | Value          |
+---------------------------+----------------+
| relay_log                 |                |
| relay_log_basename        |                |
| relay_log_index           |                |
| relay_log_info_file       | relay-log.info |
| relay_log_info_repository | FILE           |
| relay_log_purge           | ON             |
| relay_log_recovery        | ON             |
| relay_log_space_limit     | 0              |
+---------------------------+----------------+
8 rows in set (0.00 sec)

mysql> \! more B.cnf
[mysqld]
port               = 3307
socket             = /tmp/B/mysql.sock
basedir            = /data/ushastry/server/mysql-5.6.13
datadir            = /tmp/B/
tmpdir             = /tmp/B/
pid-file           = /tmp/B/B.pid
log-bin            = mysql-bin
server-id          = 2
log-error          = /tmp/B/B.err
report-host=SlaveB
report-port=3307

#replicate-ignore-table=database.table
log-slave-updates = 1
gtid_mode=ON
enforce-gtid-consistency=ON
relay-log-recovery=1
[7 Aug 2013 16:07] MySQL Verification Team
// with "--relay_log_info_repository=TABLE --master_info_repository=TABLE --sync_master_info=1 --relay-log-recovery=1" on slave.

// from step 5 onwards from test case..

mysql> SHOW GLOBAL VARIABLES WHERE Variable_Name LIKE 'relay_log_info_repository' OR Variable_Name LIKE 'master_info_repository' OR Variable_Name LIKE 'sync_master_info' OR  Variable_Name LIKE 'relay_log_recovery';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| master_info_repository    | TABLE |
| relay_log_info_repository | TABLE |
| relay_log_recovery        | ON    |
| sync_master_info          | 1     |
+---------------------------+-------+
4 rows in set (0.00 sec)

mysql>
mysql> change master to master_host='127.0.0.1', master_port=3306, master_user='repl',MASTER_PASSWORD='slavepass', master_auto_position=1;
Query OK, 0 rows affected, 2 warnings (0.06 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> stop slave;
Query OK, 0 rows affected (0.07 sec)

mysql> start slave io_thread;
Query OK, 0 rows affected (0.00 sec)

mysql> select sleep(3);
+----------+
| sleep(3) |
+----------+
|        0 |
+----------+
1 row in set (3.00 sec)

mysql> stop slave io_thread;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: System lock
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000003
          Read_Master_Log_Pos: 581155742
               Relay_Log_File: B-relay-bin.000004
                Relay_Log_Pos: 448
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 581039455
              Relay_Log_Space: 488200
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 135
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 3d2babcf-0033-11e3-842a-5d4541789890
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 3d2babcf-0033-11e3-842a-5d4541789890:1-373
            Executed_Gtid_Set: 3d2babcf-0033-11e3-842a-5d4541789890:1:3-283
                Auto_Position: 1
1 row in set (0.01 sec)

mysql> use test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|      999 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) from t1;
+----------+
| count(*) |
+----------+
|      999 |
+----------+
1 row in set (0.00 sec)
[7 Aug 2013 16:07] MySQL Verification Team
// On master

mysql>  select count(*) from t1;
+----------+
| count(*) |
+----------+
|   501000 |
+----------+
1 row in set (0.99 sec)

mysql> show master status;
+------------------+-----------+--------------+------------------+---------------------------------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                           |
+------------------+-----------+--------------+------------------+---------------------------------------------+
| mysql-bin.000003 | 581988168 |              |                  | 3d2babcf-0033-11e3-842a-5d4541789890:1-1007 |
+------------------+-----------+--------------+------------------+---------------------------------------------+
1 row in set (0.00 sec)
[7 Aug 2013 16:22] MySQL Verification Team
Conf files used for testing..

[root@cluster-repo mysql-5.6.13]# more A.cnf B.cnf
::::::::::::::
A.cnf
::::::::::::::
[mysqld]
port               = 3306
socket             = /tmp/A/mysql.sock
basedir            = /data/ushastry/server/mysql-5.6.13
datadir            = /tmp/A/
tmpdir             = /tmp/A/
pid-file           = /tmp/A/A.pid
log-bin=mysql-bin
server-id=1
log-error          = /tmp/A/A.err
report-host=SlaveA
report-port=3306

#replicate-ignore-table=database.table
log-slave-updates = 1
gtid_mode=ON
enforce-gtid-consistency=ON
::::::::::::::
B.cnf
::::::::::::::
[mysqld]
port               = 3307
socket             = /tmp/B/mysql.sock
basedir            = /data/ushastry/server/mysql-5.6.13
datadir            = /tmp/B/
tmpdir             = /tmp/B/
pid-file           = /tmp/B/B.pid
log-bin            = mysql-bin
server-id          = 2
log-error          = /tmp/B/B.err
report-host=SlaveB
report-port=3307

#replicate-ignore-table=database.table
log-slave-updates = 1
gtid_mode=ON
enforce-gtid-consistency=ON

// This is how I started master/slave server

Master:

bin/mysqld_safe --defaults-file=./A.cnf --user=mysql &

Slave:

bin/mysqld_safe --defaults-file=./B.cnf --user=mysql --relay_log_info_repository=TABLE --master_info_repository=TABLE --sync_master_info=1 --relay-log-recovery=1 &
[30 Oct 2013 8:35] Jon Stephens
Fixed in 5.6+. Documented fix in the 5.6.15 and 5.7.3 changelogs, as follows:

    When stopping the I/O thread, it was possible with a very large
    transaction (equivalent to a binary log size greater than 100MB)
    that the thread did not receive the transaction to the end. When
    reconnecting with MASTER_AUTO_POSITION=1, it then tried to fetch 
    changes from the next transaction, which could lead to loss of 
    the incomplete transaction and its data.

Closed.
[30 Oct 2013 9:04] Venkatesh Duggirala
Bug #69758 marked as duplicate of this one.
[4 Dec 2013 12:10] Laurynas Biveinis
5.6$ bzr log -r 5567
------------------------------------------------------------
revno: 5567
committer: Venkatesh Duggirala<venkatesh.duggirala@oracle.com>
branch nick: mysql-5.6
timestamp: Wed 2013-10-30 11:04:38 +0530
message:
  Bug#17280176 TRANSACTIONS SKIPPED ON SLAVE AFTER
  "STOP/START SLAVE" USING GTID REPLICATION
  
  Analysis: Slave updates 'GTID_RETRIEVED' set upon receiving
  GTID_LOG_EVENT for a particular transaction which is first
  event in the event group. Say, I/O thread is stopped *after*
  adding GTID number to 'gtid_trieved' set and *before* it
  actually retrieves all the events from that GTID event
  group. Next time when this I/O thread is reconnected,
  it sends union of GTID_RETRIEVED + GTID_EXECUTED
  set to master. So Master thinks that slave is having all the
  events from this GTID set(which includes partially
  retrieved GTID) and it will not resend them again.
  Hence slave is missing some events for ever.
  
  Fix: It is not easy to find the end of a group of events.
  So mysql server is unsure whether I/O thread retrieved the
  last gtid transaction events completely or not
  (before it is going down because of a crash/normal
  shutdown/normal stop slave io_thread). It is possible that
  I/O thread would have retrieved and written only partial
  transaction events. So Server will request Master to send
  the last gtid event once again. We do this by removing
  the last I/O thread retrieved gtid event from
  "Retrieved_gtid_set".
  Possible cases:
  1) I/O thread would have retrieved full
  transaction already in the first time itself, but retrieving
  them again will not cause problem because GTID number is same,
  Hence SQL thread will not commit it again.
  2) I/O thread would have retrieved full transaction already and
  SQL thread would have already executed it. In that case,
  We are not going remove last retrieved gtid from
  "Retrieved_gtid_set" otherwise we will see gaps in "Retrieved set".
[3 Feb 2014 10:12] Laurynas Biveinis
5.6$ bzr log -r 5586
------------------------------------------------------------
revno: 5586
committer: Venkatesh Duggirala<venkatesh.duggirala@oracle.com>
branch nick: mysql-5.6
timestamp: Mon 2013-11-04 21:47:16 +0530
message:
  Bug#17280176 TRANSACTIONS SKIPPED ON SLAVE AFTER
  "STOP/START SLAVE" USING GTID REPLICATION
  
  Fixing post push solaris compiler warning
  "last_retrieved_gtid hides Relay_log_info::last_retrieved_gtid."