MySQL Bugs: #69758: Transactions skipped on slave after "Lost connection" using GTID replication

Bug #69758	Transactions skipped on slave after "Lost connection" using GTID replication
Submitted:	16 Jul 2013 21:16	Modified:	30 Oct 2013 9:08
Reporter:	Kolbe Kegel	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.14	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	GTID

Description:
Several times, I have seen transactions skipped on the slave after an innocuous and automatically-recovered slave I/O connection problem.

global.gtid_executed on slave1:

216ab17d-9446-11e2-8465-0296f9f70ce0:1-175903064,
2c26d027-9461-11e2-8516-065e7e93e308:1-591025000:591025002-604755189,
59a0faf6-e9bb-11e2-b1a6-12171118b797:1-34878539,
59ce6087-e9b7-11e2-b18c-0296f9ea2b4d:1-49706943

Error log excerpt from slave1:

2013-07-13 21:03:18 11819 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2013-07-13 21:03:18 11819 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysqld-bin.002709' at position 2151119
2013-07-13 21:03:18 11819 [Warning] Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.

From binary log:

#130713 14:02:39 server id 1364118399  end_log_pos 2151119 CRC32 0x27cde35c     GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '2c26d027-9461-11e2-8516-065e7e93e308:591025001'/*!*/;
# at 2151119

The error log shows the connection being lost and automatically re-established. However, you can see in global.gtid_executed on the slave that the transaction at that position was never executed.

From slave2:

216ab17d-9446-11e2-8465-0296f9f70ce0:1-175903064,
2c26d027-9461-11e2-8516-065e7e93e308:1-604754697,
59a0faf6-e9bb-11e2-b1a6-12171118b797:1-34878210,
59ce6087-e9b7-11e2-b18c-0296f9ea2b4d:1-40778817:40778819-49399396:49399398-49706588

2013-07-14 04:03:17 4790 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2013-07-14 04:03:17 4790 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysqld-bin.000504' at position 2150529
2013-07-14 04:03:17 4790 [Warning] Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.

#130713 21:02:42 server id 10140178  end_log_pos 2150348 CRC32 0x2affdfa7       GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '59ce6087-e9b7-11e2-b18c-0296f9ea2b4d:40778818'/*!*/;
# at 2150348
#130713 21:02:42 server id 10140178  end_log_pos 2150442 CRC32 0x5eadf3fe       Query   thread_id=310806        exec_time=0     error_code=0
SET TIMESTAMP=1373774562/*!*/;
BEGIN
/*!*/;
# at 2150442
#130713 21:02:42 server id 10140178  end_log_pos 2150529 CRC32 0x32524851       Table_map: `d1`.`t1` mapped to number 70
# at 2150529

2013-07-14 16:02:34 4790 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2013-07-14 16:02:34 4790 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mysqld-bin.000576' at position 91412639
2013-07-14 16:02:34 4790 [Warning] Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.

#130714  9:01:19 server id 10140178  end_log_pos 91412545 CRC32 0xd6307679      GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '59ce6087-e9b7-11e2-b18c-0296f9ea2b4d:49399397'/*!*/;
# at 91412545
#130714  9:01:19 server id 10140178  end_log_pos 91412639 CRC32 0x18240e9b      Query   thread_id=368129        exec_time=0     error_code=0
SET TIMESTAMP=1373817679/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
SET @@session.auto_increment_increment=100, @@session.auto_increment_offset=41/*!*/;
BEGIN
/*!*/;
# at 91412639

How to repeat:
I have not been able to artificially repeat this yet, as it seems to require a certain kind of network failure while the the I/O thread is receiving a particular part of a transaction. However, I have seen this enough times to be certain that it is a serious a pervasive problem.

Suggested fix:
When the slave I/O thread automatically resumes replication after a connection problem, it mustn't do so in a way that transactions are silently skipped by the SQL thread!

I think it's possible that the failure to properly execute/account the transaction may be related to the fact that the SET GTID_NEXT statement is in a previous log and is lost when the relay log automatically rotates after the I/O thread disconnection.

I managed to reproduce this with these steps:

1) attach gdb to mysqld process on master
2) create a breakpoint at rpl_master.cc:1147 (for 5.6.11)
3) insert something into a table on master
4) continue to one event *after* you get event_type==33 (GTID_LOG_EVENT) ... so keep continuing and printing query_type until you see query_type=GTID_LOG_EVENT and then continue once more
5) print net_flush(net) on master mysqld and leave it stopped at the breakpoint
6) STOP SLAVE on slave
7) disable breakpoint and continue master until it resumes normal execution
8) START SLAVE on slave
9) do another INSERT on the master
10) slave will have a gap

Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0)
    at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161
1161    in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc
(gdb) print event_type
$5 = ROTATE_EVENT
(gdb) c
Continuing.

Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0)
    at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161
1161    in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc
(gdb) print event_type
$6 = GTID_LOG_EVENT
(gdb) c
Continuing.

Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0)
    at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161
1161    in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc
(gdb) print event_type
$7 = QUERY_EVENT
(gdb) print net_flush(net)
$8 = 0 '\000'
(gdb) print coord_buf
$9 = {file_name = 0x7fbc354f7620 "./mysqld-bin.000003", pos = 1138}

slave> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.30.203
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000003
          Read_Master_Log_Pos: 1013
               Relay_Log_File: mysqld-relay-bin.000003
                Relay_Log_Pos: 451
        Relay_Master_Log_File: mysqld-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1013
              Relay_Log_Space: 1730
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 101201206
                  Master_UUID: 8af32317-efe9-11e2-99f4-080027afd7da
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4
            Executed_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4
                Auto_Position: 1
1 row in set (0.00 sec)

slave> stop slave;
Query OK, 0 rows affected (0.05 sec)

(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00000000008a76e3 in mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*)
                                                   at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1147
        breakpoint already hit 35 times
(gdb) disable breakpoint 1
(gdb) c
Continuing.

slave> start slave;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values ();
Query OK, 1 row affected (0.00 sec)

slave> select @@global.gtid_executed;
+--------------------------------------------+
| @@global.gtid_executed                     |
+--------------------------------------------+
| 8af32317-efe9-11e2-99f4-080027afd7da:1-4:6 |
+--------------------------------------------+
1 row in set (0.00 sec)

From slave relay logs...

The end of mysqld-relay-bin.000004:
# at 451
#130718 13:43:00 server id 101201206  end_log_pos 1061 CRC32 0x7c6c0c1f         GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '8af32317-efe9-11e2-99f4-080027afd7da:5'/*!*/;
# at 499
#130718 13:43:00 server id 101201206  end_log_pos 1138 CRC32 0x055d5f9f         Query   thread_id=3     exec_time=0     error_code=0
SET TIMESTAMP=1374180180/*!*/;
SET @@session.pseudo_thread_id=3/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1075838976/*!*/;
SET @@session.auto_increment_increment=100, @@session.auto_increment_offset=31/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 576
#130718 13:44:33 server id 101441248  end_log_pos 630 CRC32 0x42b934c3  Rotate to mysqld-relay-bin.000005  pos: 4
DELIMITER ;
# End of log file

And the beginning of mysqld-relay-bin.000005:
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#130718 13:44:33 server id 101441248  end_log_pos 120 CRC32 0x5e3eea03  Start: binlog v 4, server v 5.6.11-log created 130718 13:44:33
BINLOG '
sVPoUQ/g3gsGdAAAAHgAAABAAAQANS42LjExLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAQPq
Pl4=
'/*!*/;
# at 120
#130718 13:44:33 server id 101441248  end_log_pos 191 CRC32 0x1c9eb19e  Previous-GTIDs
# 8af32317-efe9-11e2-99f4-080027afd7da:1-5
# at 191
#691231 16:00:00 server id 101201206  end_log_pos 0 CRC32 0xb535c925    Rotate to mysqld-bin.000003  pos: 4
# at 239
#130718 13:35:11 server id 101201206  end_log_pos 120 CRC32 0x50cce632  Start: binlog v 4, server v 5.6.11-log created 130718 13:35:11 at startup
ROLLBACK/*!*/;
BINLOG '
f1HoUQ82NQgGdAAAAHgAAAAAAAQANS42LjExLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAB/UehREzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAATLm
zFA=
'/*!*/;
# at 355
#130718 13:44:33 server id 0  end_log_pos 403 CRC32 0xc82288fe  Rotate to mysqld-bin.000003  pos: 151
# at 403
#130718 13:44:33 server id 0  end_log_pos 451 CRC32 0xb6ddc136  Rotate to mysqld-bin.000003  pos: 1254
# at 451
#130718 13:45:49 server id 101201206  end_log_pos 1302 CRC32 0x1f20da1f         GTID [commit=yes]
SET @@SESSION.GTID_NEXT= '8af32317-efe9-11e2-99f4-080027afd7da:6'/*!*/;
# at 499

While this is certainly a pretty artificial way to reproduce this bug, it does show that if the connection the slave and master is severed between the time a slave has received the GTID_LOG_NEXT event and the corresponding QUERY_EVENT, the slave may skip a transaction. 

Here's a couple lines from SHOW SLAVE STATUS after restarting the slave:

           Retrieved_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-6
            Executed_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4:6

Note that Retrieved_Gtid_Set includes 8af32317-efe9-11e2-99f4-080027afd7da:5 while Executed_Gtid_Set does not. Perhaps the issue here is as simple as the slave's Retrieved_Gtid_Set counter being advanced before the entire transaction is *actually* received by the slave and written to the relay log?

As can be seen in the relay log excerpts included above, the content of 8af32317-efe9-11e2-99f4-080027afd7da:5 was never actually written to the slave's binary log.

Hello Kolbe,

Thank you for the bug report and reproducible test case. 
I was able to reproduce this 1/10 times.

Thanks,
Umesh

MySQL bugs team would like to thank very much Mr. Kolbe Kegel, for his contribution.

Umesh, can you please share the details of how you repeated this bug? Did you run mysqld under gdb or did you reproduce it more organically by applying a workload to the master and then somehow severing the connection to the slave? I'm interested to know how/why you are able to reproduce this only 1 out of 10 times.

Regardless, losing transactions for 1 of every 10 connection failures seems still like a very severe problem, so I am glad to see that this bug is Verified and will be getting attention from Oracle!

Hello Kolbe,

Yes, I ran mysqld(master) under gdb and I followed exactly what you outlined in your earlier note(18 Jul 20:51), except that I killed/started slave server in steps (6)/(8) instead of STOP SLAVE/START SLAVE on slave.

Retrieved_Gtid_Set: 99691660-f0a6-11e2-9ec5-0800272e2cce:6-22
Executed_Gtid_Set: 99691660-f0a6-11e2-9ec5-0800272e2cce:6-21

// slave
mysql> select @@global.gtid_executed;
+-------------------------------------------+
| @@global.gtid_executed                    |
+-------------------------------------------+
| 99691660-f0a6-11e2-9ec5-0800272e2cce:6-21 |
+-------------------------------------------+

Note that Retrieved_Gtid_Set includes 99691660-f0a6-11e2-9ec5-0800272e2cce:6-22 while Executed_Gtid_Set does not and despite SLAVE start/stop couple of times
.Executed_Gtid_Set didn't show last event which was shown in Retrieved_Gtid_Set.. I also, noticed "99691660-f0a6-11e2-9ec5-0800272e2cce:6-22" was logged 
into the relay log(mistakenly all datadir overwritten in my next attempt to reproduce this again).

But I was able to repeat this behavior only once and in later attempts Executed_Gtid_Set and Retrieved_Gtid_Set were matching all the time.

Thanks,
Umesh

// 69943 - repeatable test case:

How to repeat:
1. Setup master with gtid.

2. Create table t1
create table t1 (id int auto_increment primary key, value int, value2 varchar(10000)) engine=innodb;

3. Run a very long transaction on master (generating hundreds of megabytes of binary logs). I used a simple perl script tran_insert.pl as attached.
perl tran_insert.pl 5613 root 1 500000 500000
 (insert 500000 rows within one transaction, into table t1. This script generates 500MB binlog)

4. Run a couple of short transactions on master
perl tran_insert.pl 5613 root 500001 501000 1
 (insert 1 row by 1 transaction, 1000 rows in total)

5. Setup a slave with gtid. Lower network bandwidth will make it easy to repeat the bug.
change master to master_host='master_host', master_port=5613, master_user='repl', master_auto_position=1;

6. Start replication, then stop immediately
start slave io_thread;
select sleep(3);
stop slave io_thread;

 (make sure not to fetch all 500MB binary logs. Decrease sleep seconds appropriately)

7. Restart replication
start slave io_thread;

8. Check relay logs or table t1. You'll be able to verify that the entire large transaction (500000 rows) is lost.

This issue still exists in MySQL 5.6.14.

Thank you for the bug report, duplicate of Bug #69943

This issue still exists in MySQL Community 5.6.16. It's getting more frequent in my setup (CentOS 6.5, MySQL 5.6.16). This server is a slave and only used for backups (not used even to query any of the tables). During nightly backups using Percona Xtrabackup, when the SQL thread is stopped momentarily, transactions are skipped and eventually replication breaks. I think this should be a high priority bug.