Bug #91633 Replication failure (errno 1399) on update in XA tx after deadlock
Submitted: 13 Jul 2018 21:14 Modified: 28 May 2021 7:35
Reporter: Lukas Sydorowski Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.22, 5.7.33 OS:Linux
Assigned to: CPU Architecture:x86
Tags: deadlock, replication, xa

[13 Jul 2018 21:14] Lukas Sydorowski
Description:
It appears that replication may break upon executing a row update immediately after a forced transaction rollback due to a deadlock being detected while in an XA transaction.

Regardless of whether or not the offending update is considered valid on the existing connection, it is expected that replication should still continue to work despite any errors due to deadlock or transaction rollback.

How to repeat:
1. Set up two MySQL instances.

2. Set up replication from one (master) to the other (slave).

3. Set up a test database and insert test data (executed on master node):
mysql> create schema TestDatabase;
mysql> use TestDatabase;
mysql> create table Table1 (id bigint auto_increment primary key not null, str varchar(50));
mysql> insert into Table1(str) values("a");
mysql> insert into Table1(str) values("b");

4. Run the attached scripts concurrently, started at about the same time (within one second of each other) on the master node, which will execute the statements in t1.sql and t2.sql, respectively (files also attached):
./run1.sh &
./run2.sh &

5. Observe replication failure.

Slave status immediately before offending update is called:

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.162.0.2
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 1
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 1775
               Relay_Log_File: dbtest2-relay-bin.000005
                Relay_Log_Pos: 1988
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: mysql.user
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1775
              Relay_Log_Space: 2237
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: 0af4ea96-8397-11e8-99cc-42010aa20002
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 0af4ea96-8397-11e8-99cc-42010aa20002:1-10
            Executed_Gtid_Set: 0af4ea96-8397-11e8-99cc-42010aa20002:1-10
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

Slave status immediately after offending update is called:

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.162.0.2
                  Master_User: replicator
                  Master_Port: 3306
                Connect_Retry: 1
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 2210
               Relay_Log_File: dbtest2-relay-bin.000005
                Relay_Log_Pos: 1988
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: mysql.user
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1399
                   Last_Error: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'TestDatabase'. Query: 'COMMIT'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1775
              Relay_Log_Space: 2672
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1399
               Last_SQL_Error: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'TestDatabase'. Query: 'COMMIT'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: 0af4ea96-8397-11e8-99cc-42010aa20002
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 0
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 180713 19:45:23
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 0af4ea96-8397-11e8-99cc-42010aa20002:1-11
            Executed_Gtid_Set: 0af4ea96-8397-11e8-99cc-42010aa20002:1-10
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)
[13 Jul 2018 21:16] Lukas Sydorowski
run1.sh

Attachment: run1.sh (application/x-sh, text), 70 bytes.

[13 Jul 2018 21:16] Lukas Sydorowski
run2.sh

Attachment: run2.sh (application/x-sh, text), 70 bytes.

[13 Jul 2018 21:16] Lukas Sydorowski
t1.sql

Attachment: t1.sql (application/octet-stream, text), 354 bytes.

[13 Jul 2018 21:16] Lukas Sydorowski
t2.sql

Attachment: t2.sql (application/octet-stream, text), 435 bytes.

[16 Jul 2018 12:45] MySQL Verification Team
Hello Lukas,

Thank you for the report!

Thanks,
Umesh
[5 Oct 2018 23:14] Keith Larson
Running these same queries... I was able to duplicate this on Mysql to Percona 5.7 versions. 
I was also able to duplicate steps and not get replication errors 5.7 to 8.0
[30 Mar 2021 16:50] Ramakrishna Muppaneni
Hi,

I got similar issue, do we have any workaround? 
By decreasing no XA transactions does this will work as expected?
We have around 
All requestsXA commit 1200/min
All requestsXA END 1190/min
All requestsXA PREPARE 400/min

Thanks,
Rama
[28 May 2021 7:35] MySQL Verification Team
- Base bug closed with below change log.
Fixed in 5.7.35, 8.0.26.

Replication could fail if a DML statement was executed immediately
after an XA transaction was rejected or forced to rollback due to a
deadlock.