Bug #69758 | Transactions skipped on slave after "Lost connection" using GTID replication | ||
---|---|---|---|
Submitted: | 16 Jul 2013 21:16 | Modified: | 30 Oct 2013 9:08 |
Reporter: | Kolbe Kegel | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
Version: | 5.6.14 | OS: | Linux |
Assigned to: | CPU Architecture: | Any | |
Tags: | GTID |
[16 Jul 2013 21:16]
Kolbe Kegel
[18 Jul 2013 20:51]
Kolbe Kegel
I managed to reproduce this with these steps: 1) attach gdb to mysqld process on master 2) create a breakpoint at rpl_master.cc:1147 (for 5.6.11) 3) insert something into a table on master 4) continue to one event *after* you get event_type==33 (GTID_LOG_EVENT) ... so keep continuing and printing query_type until you see query_type=GTID_LOG_EVENT and then continue once more 5) print net_flush(net) on master mysqld and leave it stopped at the breakpoint 6) STOP SLAVE on slave 7) disable breakpoint and continue master until it resumes normal execution 8) START SLAVE on slave 9) do another INSERT on the master 10) slave will have a gap Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0) at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161 1161 in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc (gdb) print event_type $5 = ROTATE_EVENT (gdb) c Continuing. Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0) at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161 1161 in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc (gdb) print event_type $6 = GTID_LOG_EVENT (gdb) c Continuing. Breakpoint 1, mysql_binlog_send (thd=0x2adc130, log_ident=<optimized out>, pos=<optimized out>, slave_gtid_executed=0x7fbc354f83a0) at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1161 1161 in /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc (gdb) print event_type $7 = QUERY_EVENT (gdb) print net_flush(net) $8 = 0 '\000' (gdb) print coord_buf $9 = {file_name = 0x7fbc354f7620 "./mysqld-bin.000003", pos = 1138} slave> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.30.203 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysqld-bin.000003 Read_Master_Log_Pos: 1013 Relay_Log_File: mysqld-relay-bin.000003 Relay_Log_Pos: 451 Relay_Master_Log_File: mysqld-bin.000003 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1013 Relay_Log_Space: 1730 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 101201206 Master_UUID: 8af32317-efe9-11e2-99f4-080027afd7da Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4 Executed_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4 Auto_Position: 1 1 row in set (0.00 sec) slave> stop slave; Query OK, 0 rows affected (0.05 sec) (gdb) info breakpoints Num Type Disp Enb Address What 1 breakpoint keep y 0x00000000008a76e3 in mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*) at /export/home/pb2/build/sb_0-8905638-1365165249.38/mysql-5.6.11/sql/rpl_master.cc:1147 breakpoint already hit 35 times (gdb) disable breakpoint 1 (gdb) c Continuing. slave> start slave; Query OK, 0 rows affected (0.01 sec) mysql> insert into t1 values (); Query OK, 1 row affected (0.00 sec) slave> select @@global.gtid_executed; +--------------------------------------------+ | @@global.gtid_executed | +--------------------------------------------+ | 8af32317-efe9-11e2-99f4-080027afd7da:1-4:6 | +--------------------------------------------+ 1 row in set (0.00 sec) From slave relay logs... The end of mysqld-relay-bin.000004: # at 451 #130718 13:43:00 server id 101201206 end_log_pos 1061 CRC32 0x7c6c0c1f GTID [commit=yes] SET @@SESSION.GTID_NEXT= '8af32317-efe9-11e2-99f4-080027afd7da:5'/*!*/; # at 499 #130718 13:43:00 server id 101201206 end_log_pos 1138 CRC32 0x055d5f9f Query thread_id=3 exec_time=0 error_code=0 SET TIMESTAMP=1374180180/*!*/; SET @@session.pseudo_thread_id=3/*!*/; SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/; SET @@session.sql_mode=1075838976/*!*/; SET @@session.auto_increment_increment=100, @@session.auto_increment_offset=31/*!*/; /*!\C utf8 *//*!*/; SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/; SET @@session.lc_time_names=0/*!*/; SET @@session.collation_database=DEFAULT/*!*/; BEGIN /*!*/; # at 576 #130718 13:44:33 server id 101441248 end_log_pos 630 CRC32 0x42b934c3 Rotate to mysqld-relay-bin.000005 pos: 4 DELIMITER ; # End of log file And the beginning of mysqld-relay-bin.000005: /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #130718 13:44:33 server id 101441248 end_log_pos 120 CRC32 0x5e3eea03 Start: binlog v 4, server v 5.6.11-log created 130718 13:44:33 BINLOG ' sVPoUQ/g3gsGdAAAAHgAAABAAAQANS42LjExLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAQPq Pl4= '/*!*/; # at 120 #130718 13:44:33 server id 101441248 end_log_pos 191 CRC32 0x1c9eb19e Previous-GTIDs # 8af32317-efe9-11e2-99f4-080027afd7da:1-5 # at 191 #691231 16:00:00 server id 101201206 end_log_pos 0 CRC32 0xb535c925 Rotate to mysqld-bin.000003 pos: 4 # at 239 #130718 13:35:11 server id 101201206 end_log_pos 120 CRC32 0x50cce632 Start: binlog v 4, server v 5.6.11-log created 130718 13:35:11 at startup ROLLBACK/*!*/; BINLOG ' f1HoUQ82NQgGdAAAAHgAAAAAAAQANS42LjExLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAB/UehREzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAATLm zFA= '/*!*/; # at 355 #130718 13:44:33 server id 0 end_log_pos 403 CRC32 0xc82288fe Rotate to mysqld-bin.000003 pos: 151 # at 403 #130718 13:44:33 server id 0 end_log_pos 451 CRC32 0xb6ddc136 Rotate to mysqld-bin.000003 pos: 1254 # at 451 #130718 13:45:49 server id 101201206 end_log_pos 1302 CRC32 0x1f20da1f GTID [commit=yes] SET @@SESSION.GTID_NEXT= '8af32317-efe9-11e2-99f4-080027afd7da:6'/*!*/; # at 499
[18 Jul 2013 21:17]
Kolbe Kegel
While this is certainly a pretty artificial way to reproduce this bug, it does show that if the connection the slave and master is severed between the time a slave has received the GTID_LOG_NEXT event and the corresponding QUERY_EVENT, the slave may skip a transaction. Here's a couple lines from SHOW SLAVE STATUS after restarting the slave: Retrieved_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-6 Executed_Gtid_Set: 8af32317-efe9-11e2-99f4-080027afd7da:1-4:6 Note that Retrieved_Gtid_Set includes 8af32317-efe9-11e2-99f4-080027afd7da:5 while Executed_Gtid_Set does not. Perhaps the issue here is as simple as the slave's Retrieved_Gtid_Set counter being advanced before the entire transaction is *actually* received by the slave and written to the relay log? As can be seen in the relay log excerpts included above, the content of 8af32317-efe9-11e2-99f4-080027afd7da:5 was never actually written to the slave's binary log.
[22 Jul 2013 15:11]
MySQL Verification Team
Hello Kolbe, Thank you for the bug report and reproducible test case. I was able to reproduce this 1/10 times. Thanks, Umesh
[22 Jul 2013 17:55]
MySQL Verification Team
MySQL bugs team would like to thank very much Mr. Kolbe Kegel, for his contribution.
[22 Jul 2013 18:13]
Kolbe Kegel
Umesh, can you please share the details of how you repeated this bug? Did you run mysqld under gdb or did you reproduce it more organically by applying a workload to the master and then somehow severing the connection to the slave? I'm interested to know how/why you are able to reproduce this only 1 out of 10 times. Regardless, losing transactions for 1 of every 10 connection failures seems still like a very severe problem, so I am glad to see that this bug is Verified and will be getting attention from Oracle!
[23 Jul 2013 10:43]
MySQL Verification Team
Hello Kolbe, Yes, I ran mysqld(master) under gdb and I followed exactly what you outlined in your earlier note(18 Jul 20:51), except that I killed/started slave server in steps (6)/(8) instead of STOP SLAVE/START SLAVE on slave. Retrieved_Gtid_Set: 99691660-f0a6-11e2-9ec5-0800272e2cce:6-22 Executed_Gtid_Set: 99691660-f0a6-11e2-9ec5-0800272e2cce:6-21 // slave mysql> select @@global.gtid_executed; +-------------------------------------------+ | @@global.gtid_executed | +-------------------------------------------+ | 99691660-f0a6-11e2-9ec5-0800272e2cce:6-21 | +-------------------------------------------+ Note that Retrieved_Gtid_Set includes 99691660-f0a6-11e2-9ec5-0800272e2cce:6-22 while Executed_Gtid_Set does not and despite SLAVE start/stop couple of times .Executed_Gtid_Set didn't show last event which was shown in Retrieved_Gtid_Set.. I also, noticed "99691660-f0a6-11e2-9ec5-0800272e2cce:6-22" was logged into the relay log(mistakenly all datadir overwritten in my next attempt to reproduce this again). But I was able to repeat this behavior only once and in later attempts Executed_Gtid_Set and Retrieved_Gtid_Set were matching all the time. Thanks, Umesh
[7 Aug 2013 12:28]
MySQL Verification Team
// 69943 - repeatable test case: How to repeat: 1. Setup master with gtid. 2. Create table t1 create table t1 (id int auto_increment primary key, value int, value2 varchar(10000)) engine=innodb; 3. Run a very long transaction on master (generating hundreds of megabytes of binary logs). I used a simple perl script tran_insert.pl as attached. perl tran_insert.pl 5613 root 1 500000 500000 (insert 500000 rows within one transaction, into table t1. This script generates 500MB binlog) 4. Run a couple of short transactions on master perl tran_insert.pl 5613 root 500001 501000 1 (insert 1 row by 1 transaction, 1000 rows in total) 5. Setup a slave with gtid. Lower network bandwidth will make it easy to repeat the bug. change master to master_host='master_host', master_port=5613, master_user='repl', master_auto_position=1; 6. Start replication, then stop immediately start slave io_thread; select sleep(3); stop slave io_thread; (make sure not to fetch all 500MB binary logs. Decrease sleep seconds appropriately) 7. Restart replication start slave io_thread; 8. Check relay logs or table t1. You'll be able to verify that the entire large transaction (500000 rows) is lost.
[25 Sep 2013 22:57]
Kolbe Kegel
This issue still exists in MySQL 5.6.14.
[30 Oct 2013 9:08]
Venkatesh Duggirala
Thank you for the bug report, duplicate of Bug #69943
[22 Feb 2014 1:53]
Arvind Sharma
This issue still exists in MySQL Community 5.6.16. It's getting more frequent in my setup (CentOS 6.5, MySQL 5.6.16). This server is a slave and only used for backups (not used even to query any of the tables). During nightly backups using Percona Xtrabackup, when the SQL thread is stopped momentarily, transactions are skipped and eventually replication breaks. I think this should be a high priority bug.