Bug #69873 | Replication stop with "Error in Xid_log_event: Commit could not be completed" | ||
---|---|---|---|
Submitted: | 30 Jul 2013 16:17 | Modified: | 2 Sep 2014 14:57 |
Reporter: | Alexander Du | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Row Based Replication ( RBR ) | Severity: | S2 (Serious) |
Version: | 5.6.12, 5.6.13 | OS: | Linux (CentOS 5.7 x86_64) |
Assigned to: | CPU Architecture: | Any | |
Tags: | replication, stop |
[30 Jul 2013 16:17]
Alexander Du
[8 Aug 2013 10:18]
Alexander Du
Upgrade master to 5.6.12 did not change anything. The error remains the same
[14 Aug 2013 7:12]
Boris Uchitel
This is what I see in error log in my case: 2013-08-13 22:04:08 27852 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0 2013-08-13 22:04:08 27852 [Warning] Slave: Unknown table 'gm.tmp_rank_res' Error_code: 1051 2013-08-13 22:04:08 27852 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000377' position 416194823 where `gm.tmp_rank_res` is a temporary table which is created on master in one of my stored procedures
[22 Aug 2013 12:09]
NOT_FOUND NOT_FOUND
We are seeing the same after upgrading from 5.6.10 to 5.6.12, running on FreeBSD 8.3. No mention of temporary tables in our logs: 2013-08-16 23:21:49 55009 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0 2013-08-16 23:21:49 55009 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db02-bin.006200' position 2364009690 This happens at random intervals and requires a 'start slave' to recover. Until this issue is fixed, we need to run periodic 'start slave' commands through cron to avoid being woken up in the middle of the night..
[12 Sep 2013 19:50]
MySQL Verification Team
Hello! Thank you for the report. I'm able to reproduce this on slave box.. How to repeat: if the slave thread is in "Waiting for commit lock" state and if you kill that thread then it can be observed.. slave1 [localhost] {msandbox} ((none)) > show processlist; +----+-------------+-----------+------+---------+------+----------------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+-----------+------+---------+------+----------------------------------+------------------+ | 70 | msandbox | localhost | NULL | Query | 0 | init | show processlist | | 71 | system user | | NULL | Connect | 425 | Waiting for master to send event | NULL | | 72 | system user | | NULL | Connect | 109 | Waiting for commit lock | NULL | +----+-------------+-----------+------+---------+------+----------------------------------+------------------+ slave1 [localhost] {msandbox} ((none)) > kill 72; Query OK, 0 rows affected (0.00 sec) slave1 [localhost] {msandbox} ((none)) > show processlist; +----+-------------+-----------+------+---------+------+----------------------------------+-------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+-----------+------+---------+------+----------------------------------+-------------------+ | 70 | msandbox | localhost | NULL | Query | 919 | init | show slave status | | 71 | system user | | NULL | Connect | 1353 | Waiting for master to send event | NULL | | 72 | system user | | NULL | Killed | 1037 | Reading event from the relay log | NULL | | 73 | msandbox | localhost | NULL | Query | 911 | init | show slave status | | 74 | msandbox | localhost | NULL | Query | 904 | init | show slave status | | 75 | msandbox | localhost | NULL | Killed | 895 | init | show slave status | | 78 | msandbox | localhost | NULL | Query | 0 | init | show processlist | +----+-------------+-----------+------+---------+------+----------------------------------+-------------------+ 7 rows in set (0.00 sec) slave1 [localhost] {msandbox} ((none)) > show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: rsandbox Master_Port: 19278 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 27786 Relay_Log_File: mysql_sandbox19279-relay-bin.000009 Relay_Log_Pos: 2945 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Error in Xid_log_event: Commit could not be completed, '' Skip_Counter: 0 Exec_Master_Log_Pos: 27786 Relay_Log_Space: 9949 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Error in Xid_log_event: Commit could not be completed, '' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: eb76981c-1c58-11e3-bbb4-324defc71619 Master_Info_File: /data/ushastry/sandboxes/rsandbox_5_6_13/node1/data/master.info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 130913 23:14:29 Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 1 row in set (0.61 sec) slave1 [localhost] {msandbox} ((none)) > // from error log 2013-09-13 22:43:46 19001 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0 ## Workaround restart slave
[12 Sep 2013 19:52]
MySQL Verification Team
Exact steps are in http://bugs.mysql.com/70307
[3 Jul 2014 9:40]
MySQL Verification Team
With the latest build and steps from http://bugs.mysql.com/70307 , this time I see below error and text instead of ''. Restarting slave would fix the issue but endup losing the transaction which was executing at the time of killing the thread which was in "Waiting for commit lock" state. Would shortly add conf file used to the bug. mysql> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 127.0.0.1 Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: master-bin.000003 Read_Master_Log_Pos: 2163 Relay_Log_File: 70307S-relay-bin.000002 Relay_Log_Pos: 1634 Relay_Master_Log_File: master-bin.000003 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1317 Last_Error: Error in Xid_log_event: Commit could not be completed, 'Query execution was interrupted' Skip_Counter: 0 Exec_Master_Log_Pos: 2163 Relay_Log_Space: 1808 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1317 Last_SQL_Error: Error in Xid_log_event: Commit could not be completed, 'Query execution was interrupted' Replicate_Ignore_Server_Ids: Master_Server_Id: 1 Master_UUID: da0467b5-03d7-11e4-a140-00163e44510c Master_Info_File: /tmp/70307S/master.info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 140705 05:39:38 Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 1 row in set (0.00 sec) mysql> select version(); +---------------------------------------+ | version() | +---------------------------------------+ | 5.6.20-enterprise-commercial-advanced | +---------------------------------------+ 1 row in set (0.00 sec)
[2 Sep 2014 14:57]
David Moss
The following was added to the 5.6.21 and 5.7.5 changelog with commit 4114: When an SQL thread which was waiting for a commit lock was killed and restarted it caused a transaction to be skipped on slave. This fix ensures that thread positions are correctly persisted and transactions resume at the correct position.
[2 Sep 2014 15:40]
David Moss
Posted by developer: The following was added to the 5.6.21 and 5.7.5 changelog with commit 4114: When an SQL thread which was waiting for a commit lock was killed and restarted it caused a transaction to be skipped on slave. This fix ensures that thread positions are correctly persisted and transactions resume at the correct position.
[25 Sep 2014 12:17]
Laurynas Biveinis
revno: 6118 committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com> branch nick: Bug17450876_mysql-5.6 timestamp: Tue 2014-08-19 09:57:08 +0530 message: Bug#17450876:REPLICATION STOP WITH "ERROR IN XID_LOG_EVENT: COMMIT COULD NOT BE COMPLETED" Problem: ======== When a SQL thread which is waiting for commit lock is killed and restarted it causes a transaction to be skipped on slave. Analysis: ======== when SQL thread is at a state where a DML is waiting for MDL commit lock if SQL thread is killed then position are getting updated in memory. i.e in the existing design positions are flushed before the actual commit because of this rli object will have its positions updated but the transaction is yet to be committed. When the SQL thread is restarted it reads position from the rli object and hence the last transaction gets skipped on slave. Fix: === When SQL thread is killed at a stage where it is waiting for commit lock, the commit fails and an error is reported back saying "Commit could not be completed and Query execution was interrupted". As part of fix SQL threads positions that existed before the commit are persisted and they are restored back on error. Similar symptoms exist in case of MTS as well. In MTS "The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state" error is reported. In MTS a bitmap is maintained for successful commits. This bit map is cleared on error and the old positions are retrieved from the checkpoint which points to old positions.
[30 Sep 2014 13:44]
David Moss
Posted by developer: Reopened by merge, no new changelog entries needed. Set back to Closed.