MySQL Bugs: #69873: Replication stop with "Error in Xid_log

Bug #69873	Replication stop with "Error in Xid_log_event: Commit could not be completed"
Submitted:	30 Jul 2013 16:17	Modified:	2 Sep 2014 14:57
Reporter:	Alexander Du	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S2 (Serious)
Version:	5.6.12, 5.6.13	OS:	Linux (CentOS 5.7 x86_64)
Assigned to:		CPU Architecture:	Any
Tags:	replication, stop

Description:
I have had several stops of replication with following messages in error.log
Server run as slave and as master for another slave in chain of replications.
This happend after upgrade from 5.5.28 to 5.6.12
Master runing MySQL 5.5.28

2013-07-30 16:53:50 32726 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0
2013-07-30 16:53:50 32726 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'log-bin.012468' position 408143139

show slave status\G
...
Connect_Retry: 60
              Master_Log_File: log-bin.012471
          Read_Master_Log_Pos: 72952613
               Relay_Log_File: relay-bin.013534
                Relay_Log_Pos: 408143296
        Relay_Master_Log_File: log-bin.012468
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: Error in Xid_log_event: Commit could not be completed, ''
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 408143139
              Relay_Log_Space: 3295080657
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error: Error in Xid_log_event: Commit could not be completed, ''
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 3
                  Master_UUID:
             Master_Info_File: master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 130730 16:53:50
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0

How to repeat:
Problem occurs randomly

Upgrade master to 5.6.12 did not change anything. The error remains the same

This is what I see in error log in my case:

2013-08-13 22:04:08 27852 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0
2013-08-13 22:04:08 27852 [Warning] Slave: Unknown table 'gm.tmp_rank_res' Error_code: 1051
2013-08-13 22:04:08 27852 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000377' position 416194823

where `gm.tmp_rank_res` is a temporary table which is created on master in one of my stored procedures

We are seeing the same after upgrading from 5.6.10 to 5.6.12, running on FreeBSD 8.3. No mention of temporary tables in our logs:

2013-08-16 23:21:49 55009 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0
2013-08-16 23:21:49 55009 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db02-bin.006200' position 2364009690

This happens at random intervals and requires a 'start slave' to recover. Until this issue is fixed, we need to run periodic 'start slave' commands through cron to avoid being woken up in the middle of the night..

Hello!

Thank you for the report.
I'm able to reproduce this on slave box..

How to repeat:

if the slave thread is in "Waiting for commit lock" state and if you kill that thread then it can be observed..

slave1 [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+
| Id | User        | Host      | db   | Command | Time | State                            | Info             |
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+
| 70 | msandbox    | localhost | NULL | Query   |    0 | init                             | show processlist |
| 71 | system user |           | NULL | Connect |  425 | Waiting for master to send event | NULL             |
| 72 | system user |           | NULL | Connect |  109 | Waiting for commit lock          | NULL             |
+----+-------------+-----------+------+---------+------+----------------------------------+------------------+

slave1 [localhost] {msandbox} ((none)) > kill 72;
Query OK, 0 rows affected (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------+------+---------+------+----------------------------------+-------------------+
| Id | User        | Host      | db   | Command | Time | State                            | Info              |
+----+-------------+-----------+------+---------+------+----------------------------------+-------------------+
| 70 | msandbox    | localhost | NULL | Query   |  919 | init                             | show slave status |
| 71 | system user |           | NULL | Connect | 1353 | Waiting for master to send event | NULL              |
| 72 | system user |           | NULL | Killed  | 1037 | Reading event from the relay log | NULL              |
| 73 | msandbox    | localhost | NULL | Query   |  911 | init                             | show slave status |
| 74 | msandbox    | localhost | NULL | Query   |  904 | init                             | show slave status |
| 75 | msandbox    | localhost | NULL | Killed  |  895 | init                             | show slave status |
| 78 | msandbox    | localhost | NULL | Query   |    0 | init                             | show processlist  |
+----+-------------+-----------+------+---------+------+----------------------------------+-------------------+
7 rows in set (0.00 sec)

slave1 [localhost] {msandbox} ((none)) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19278
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 27786
               Relay_Log_File: mysql_sandbox19279-relay-bin.000009
                Relay_Log_Pos: 2945
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error: Error in Xid_log_event: Commit could not be completed, ''
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 27786
              Relay_Log_Space: 9949
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error: Error in Xid_log_event: Commit could not be completed, ''
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: eb76981c-1c58-11e3-bbb4-324defc71619
             Master_Info_File: /data/ushastry/sandboxes/rsandbox_5_6_13/node1/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 130913 23:14:29
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.61 sec)

slave1 [localhost] {msandbox} ((none)) >

// from error log

2013-09-13 22:43:46 19001 [ERROR] Slave SQL: Error in Xid_log_event: Commit could not be completed, '', Error_code: 0

## Workaround
restart slave

Exact steps are in http://bugs.mysql.com/70307

With the latest build and steps from http://bugs.mysql.com/70307 , this time I see below error and text instead of ''.
Restarting slave would fix the issue but endup losing the transaction which was executing at the time of killing the thread which was in "Waiting for commit lock" state.
Would shortly add conf file used to the bug.

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin.000003
          Read_Master_Log_Pos: 2163
               Relay_Log_File: 70307S-relay-bin.000002
                Relay_Log_Pos: 1634
        Relay_Master_Log_File: master-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1317
                   Last_Error: Error in Xid_log_event: Commit could not be completed, 'Query execution was interrupted'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2163
              Relay_Log_Space: 1808
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1317
               Last_SQL_Error: Error in Xid_log_event: Commit could not be completed, 'Query execution was interrupted'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: da0467b5-03d7-11e4-a140-00163e44510c
             Master_Info_File: /tmp/70307S/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 140705 05:39:38
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

mysql> select version();
+---------------------------------------+
| version()                             |
+---------------------------------------+
| 5.6.20-enterprise-commercial-advanced |
+---------------------------------------+
1 row in set (0.00 sec)

The following was added to the 5.6.21 and 5.7.5 changelog with commit 4114:
When an SQL thread which was waiting for a commit lock was killed and restarted it caused a transaction to be skipped on slave. This fix ensures that thread positions are correctly persisted and transactions resume at the correct position.

Posted by developer:
 
The following was added to the 5.6.21 and 5.7.5 changelog with commit 4114:
When an SQL thread which was waiting for a commit lock was killed and restarted it caused a transaction to be skipped on slave. This fix ensures that thread positions are correctly persisted and transactions resume at the correct position.

revno: 6118
committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com>
branch nick: Bug17450876_mysql-5.6
timestamp: Tue 2014-08-19 09:57:08 +0530
message:
  Bug#17450876:REPLICATION STOP WITH "ERROR IN XID_LOG_EVENT:
  COMMIT COULD NOT BE COMPLETED"
  
  Problem:
  ========
  When a SQL thread which is waiting for commit lock is killed
  and restarted it causes a transaction to be skipped on slave.
  
  Analysis:
  ========
  when SQL thread is at a state where a DML is waiting for MDL
  commit lock if SQL thread is killed then position are getting
  updated in memory. i.e in the existing design positions are
  flushed before the actual commit because of this rli object
  will have its positions updated but the transaction is yet
  to be committed.  When the SQL thread is restarted it reads
  position from the rli object and hence the last transaction
  gets skipped on slave.
  
  Fix:
  ===
  When SQL thread is killed at a stage where it is waiting for
  commit lock, the commit fails and an error is reported back
  saying "Commit could not be completed and Query execution
  was interrupted".  As part of fix SQL threads positions that
  existed before the commit are persisted and they are
  restored back on error.
  
  Similar symptoms exist in case of MTS as well. In MTS
  "The slave coordinator and worker threads are stopped,
  possibly leaving data in inconsistent state" error is
  reported. In MTS a bitmap is maintained for successful
  commits. This bit map is cleared on error and the old
  positions are retrieved from the checkpoint which points to
  old positions.

Posted by developer:
 
Reopened by merge, no new changelog entries needed. Set back to Closed.