Bug #83983 innodb fail to recover the prepared xa transaction
Submitted: 28 Nov 2016 8:50 Modified: 18 Feb 2019 16:20
Reporter: dennis gao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.13 OS:CentOS (6.5)
Assigned to: CPU Architecture:Any
Tags: xa

[28 Nov 2016 8:50] dennis gao
Description:
We use sysbench to test the failover of mysql semi-sync replication with XA and gtid for mysql5.7.

We use kill -9 to kill the master, then do xa recover on slave, and commit all can recovered xa transaction on slave.
Then start the killed master, and let it change master to the slave. 
Then we get the following replication error:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.16.90.12
                  Master_User: root
                  Master_Port: 13311
                Connect_Retry: 60
              Master_Log_File: s12-bin.000001
          Read_Master_Log_Pos: 1389155
               Relay_Log_File: s247-relay-bin.000003
                Relay_Log_Pos: 1465
        Relay_Master_Log_File: s12-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1397
                   Last_Error: Error 'XAER_NOTA: Unknown XID' on query. Default database: ''. Query: 'XA COMMIT X'3933342d302d322d31',X'',1'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 598946
              Relay_Log_Space: 791925
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1397
               Last_SQL_Error: Error 'XAER_NOTA: Unknown XID' on query. Default database: ''. Query: 'XA COMMIT X'3933342d302d322d31',X'',1'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 13311
                  Master_UUID: 66c93c46-aca4-11e6-bf95-c81f66de7923
             Master_Info_File: /usr/local/mysql57/data13310/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 161118 04:53:18
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 66c93c46-aca4-11e6-bf95-c81f66de7923:1-2182
            Executed_Gtid_Set: 66c93c46-aca4-11e6-bf95-c81f66de7923:1-6,
781f8f7f-aca3-11e6-a3f7-5254006a2711:1-1292
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

And we check the binlog of the master, we can find the xa prepare of that xid:

# at 603168
#161118  4:51:58 server id 13310  end_log_pos 603233 CRC32 0xb8cdb1f2   GTID    last_committed=1263     sequence_number=1279
SET @@SESSION.GTID_NEXT= '781f8f7f-aca3-11e6-a3f7-5254006a2711:1279'/*!*/;
# at 603233
#161118  4:51:58 server id 13310  end_log_pos 603343 CRC32 0xfb782c4d   Query   thread_id=147   exec_time=0     error_code=0
SET TIMESTAMP=1479462718/*!*/;
XA START X'3933342d302d322d31',X'',1
/*!*/;
# at 603343
#161118  4:51:58 server id 13310  end_log_pos 603471 CRC32 0x1d45d355   Query   thread_id=147   exec_time=0     error_code=0
use `test`/*!*/;
SET TIMESTAMP=1479462718/*!*/;
UPDATE test.norm_table SET c2 = c2 + 1 where c1 = 2489
/*!*/;
# at 603471
#161118  4:51:58 server id 13310  end_log_pos 603638 CRC32 0x11cad517   Query   thread_id=147   exec_time=0     error_code=0
SET TIMESTAMP=1479462718/*!*/;
INSERT INTO dbscale.TRANSACTIONS_SQL_2 VALUES ('934-0-2-1', 0, '1-934', 1, 'par2_rep:rep')
/*!*/;
# at 603638
#161118  4:51:58 server id 13310  end_log_pos 603739 CRC32 0xcdec8031   Query   thread_id=147   exec_time=0     error_code=0
SET TIMESTAMP=1479462718/*!*/;
XA END X'3933342d302d322d31',X'',1
/*!*/;
# at 603739
#161118  4:51:58 server id 13310  end_log_pos 603784 CRC32 0x4cb842e0   XA PREPARE X'3933342d302d322d31',X'',1
XA PREPARE X'3933342d302d322d31',X'',1
/*!*/; 

-------------------------------------------------------------------------------------------------------------------------------------------------

but after the master start-up, we can not find the xid (934-0-2-1) in the xa recover:
mysql> xa recover;
+----------+--------------+--------------+-----------+
| formatID | gtrid_length | bqual_length | data      |
+----------+--------------+--------------+-----------+
|        1 |            9 |            0 | 927-0-2-1 |
|        1 |            9 |            0 | 933-0-2-1 |
|        1 |            9 |            0 | 931-0-2-1 |
|        1 |            9 |            0 | 926-0-2-1 |
|        1 |            9 |            0 | 863-0-2-1 |
|        1 |            9 |            0 | 930-0-2-1 |
|        1 |            9 |            0 | 922-0-2-1 |
|        1 |            9 |            0 | 921-0-2-1 |
|        1 |            9 |            0 | 906-0-2-1 |
+----------+--------------+--------------+-----------+
9 rows in set (0.00 sec) 

-----------------------------------------------------------------------------------------------------------------------------------

And the mysql conf file can be found in the attached file.

----------------------------------------------------------------------------------------------------------------------

why the prepared xa transaction missing in the xa recover?

How to repeat:
We use a middle layer sharding program before the mysql, which will assign the xa id for each transaction.
Then we use the sysbench to press the middle layer sharding program.

We can reproduce this bug in our test environment.

But we are not sure how to reproduce it directly by using sysbench without our middle layer sharding program.
[28 Nov 2016 8:50] dennis gao
-------------------------------------------------------------------------------------------------------------------------------------------------

During the master start-up, we can find the following log:

2016-11-18T09:53:09.955426Z 0 [Note] Found 12 prepared transaction(s) in InnoDB
2016-11-18T09:53:09.955606Z 0 [Warning] Found 12 prepared XA transactions
2016-11-18T09:53:09.955633Z 0 [Note] Crash recovery finished.
2016-11-18T09:53:09.955777Z 0 [Note] InnoDB: Starting recovery for XA transactions...
2016-11-18T09:53:09.955805Z 0 [Note] InnoDB: Transaction 49261 in prepared state after recovery
2016-11-18T09:53:09.955823Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.955840Z 0 [Note] InnoDB: Transaction 49257 in prepared state after recovery
2016-11-18T09:53:09.955857Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.955873Z 0 [Note] InnoDB: Transaction 49253 in prepared state after recovery
2016-11-18T09:53:09.955889Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.955906Z 0 [Note] InnoDB: Transaction 49252 in prepared state after recovery
2016-11-18T09:53:09.955922Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.955972Z 0 [Note] InnoDB: Transaction 49251 in prepared state after recovery
2016-11-18T09:53:09.955997Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.956014Z 0 [Note] InnoDB: Transaction 49240 in prepared state after recovery
2016-11-18T09:53:09.956030Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956048Z 0 [Note] InnoDB: Transaction 49238 in prepared state after recovery
2016-11-18T09:53:09.956064Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956080Z 0 [Note] InnoDB: Transaction 49236 in prepared state after recovery
2016-11-18T09:53:09.956138Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956159Z 0 [Note] InnoDB: Transaction 49234 in prepared state after recovery
2016-11-18T09:53:09.956176Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956192Z 0 [Note] InnoDB: Transaction 49225 in prepared state after recovery
2016-11-18T09:53:09.956208Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956225Z 0 [Note] InnoDB: Transaction 49216 in prepared state after recovery
2016-11-18T09:53:09.956241Z 0 [Note] InnoDB: Transaction contains changes to 1 rows
2016-11-18T09:53:09.956257Z 0 [Note] InnoDB: Transaction 49156 in prepared state after recovery
2016-11-18T09:53:09.956293Z 0 [Note] InnoDB: Transaction contains changes to 2 rows
2016-11-18T09:53:09.956313Z 0 [Note] InnoDB: 12 transactions in prepared state after recovery
2016-11-18T09:53:09.956327Z 0 [Note] Found 12 prepared transaction(s) in InnoDB
2016-11-18T09:53:09.956356Z 0 [Warning] Found 12 prepared XA transactions
2016-11-18T09:53:10.053821Z 0 [Note] InnoDB: Rollback of trx with id 49263 completed
2016-11-18T09:53:10.053909Z 0 [Note] InnoDB: Rolling back trx with id 49262, 2 rows to undo
2016-11-18T09:53:10.321974Z 0 [Note] InnoDB: Rollback of trx with id 49262 completed
2016-11-18T09:53:10.322247Z 0 [Note] InnoDB: Rolling back trx with id 49239, 1 rows to undo
2016-11-18T09:53:10.554457Z 0 [Note] InnoDB: Rollback of trx with id 49239 completed
2016-11-18T09:53:10.554564Z 0 [Note] InnoDB: Rolling back trx with id 49231, 2 rows to undo
2016-11-18T09:53:10.721856Z 0 [Note] InnoDB: Rollback of trx with id 49231 completed
2016-11-18T09:53:10.721986Z 0 [Note] InnoDB: Rolling back trx with id 49132, 1 rows to undo
2016-11-18T09:53:10.838489Z 0 [Note] InnoDB: Rollback of trx with id 49132 completed
2016-11-18T09:53:10.838645Z 0 [Note] InnoDB: Rollback of non-prepared transactions completed
[28 Nov 2016 8:52] dennis gao
the mysql conf file of master

Attachment: my13310.cnf (application/octet-stream, text), 2.42 KiB.

[28 Nov 2016 11:26] dennis gao
And we test again, and get more critical problem:

1. set up replicaion, master (m1) -> slave (m2)
2. we press the xa transaction on master(m1)
3. use kill -9 to stop the m1 during the test
4. do the xa recover  on m2 and do xa rollback for all xa transaction
5. start-up the m1 and let it work as a slave of m2, and it success
6. we press the xa transaction on the new master (m2)
7. use kill -9 to stop the m2 during the test
8. do the xa recover  on m1 and do xa rollback for all xa transaction
9. we start the m2 and let it work as a slave of m1

Again we get the similar error from the "show slave status" of m2

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.16.90.247
                  Master_User: root
                  Master_Port: 13310
                Connect_Retry: 60
              Master_Log_File: s247-bin.000002
          Read_Master_Log_Pos: 2404
               Relay_Log_File: s247-relay-bin.000002
                Relay_Log_Pos: 411
        Relay_Master_Log_File: s247-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1397
                   Last_Error: Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA ROLLBACK X'33323435312d302d322d31',X'',1'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2229
              Relay_Log_Space: 832
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1397
               Last_SQL_Error: Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA ROLLBACK X'33323435312d302d322d31',X'',1'
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 13310
                  Master_UUID: 781f8f7f-aca3-11e6-a3f7-5254006a2711
             Master_Info_File: /usr/local/mysql57/data13313/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 161128 05:35:23
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 781f8f7f-aca3-11e6-a3f7-5254006a2711:15
            Executed_Gtid_Set: 66c93c46-aca4-11e6-bf95-c81f66de7923:1-6,
781f8f7f-aca3-11e6-a3f7-5254006a2711:1-15,
b3af6b30-ad4f-11e6-88ca-5254006a2711:1-18
                Auto_Position: 1
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
1 row in set (0.00 sec)

-------------------------------------------------------------------------------------------------------------------------

We analysis the s1's binlog, and find out:

The xid 33323435312d302d322d31 is generated by s1 before it be killed, and this transaction is xa committed

# at 37477
#161128 18:32:49 server id 13310  end_log_pos 37542 CRC32 0xf18132b1  GTID  last_committed=12 sequence_number=13
SET @@SESSION.GTID_NEXT= '781f8f7f-aca3-11e6-a3f7-5254006a2711:13'/*!*/;
# at 37542
#161128 18:32:49 server id 13310  end_log_pos 37656 CRC32 0x40412d00  Query thread_id=220 exec_time=359 error_code=0
SET TIMESTAMP=1480329169/*!*/;
XA START X'33323435312d302d322d31',X'',1
/*!*/;
# at 37656
#161128 18:32:49 server id 13310  end_log_pos 37783 CRC32 0x41a6f7a8  Query thread_id=220 exec_time=359 error_code=0
use `test`/*!*/;
SET TIMESTAMP=1480329169/*!*/;
UPDATE test.norm_table SET c2 = c2 + 1 where c1 = 990
/*!*/;
# at 37783
#161128 18:32:49 server id 13310  end_log_pos 37879 CRC32 0x9b1bbed2  Query thread_id=220 exec_time=359 error_code=0
SET TIMESTAMP=1480329169/*!*/;
XA END X'33323435312d302d322d31',X'',1
/*!*/;
# at 37879
#161128 18:32:49 server id 13310  end_log_pos 37926 CRC32 0xeb7ae697  XA PREPARE X'33323435312d302d322d31',X'',1
XA PREPARE X'33323435312d302d322d31',X'',1
/*!*/;
# at 37926
#161128 18:32:49 server id 13310  end_log_pos 37991 CRC32 0xc0dbc688  GTID  last_committed=13 sequence_number=14
SET @@SESSION.GTID_NEXT= '781f8f7f-aca3-11e6-a3f7-5254006a2711:14'/*!*/;
# at 37991
#161128 18:32:49 server id 13310  end_log_pos 38099 CRC32 0x6f488705  Query thread_id=220 exec_time=359 error_code=0
SET TIMESTAMP=1480329169/*!*/;
XA COMMIT X'33323435312d302d322d31',X'',1

But in the test step "8. do the xa recover  on m1 and do xa rollback for all xa transaction", we still can find this xa transaction id, and can do xa rollbck!!!
This is the rollback binlog of m1:

#161128  5:35:22 server id 13310  end_log_pos 2294 CRC32 0x7e85b1cd     GTID    last_committed=6        sequence_number=7
SET @@SESSION.GTID_NEXT= '781f8f7f-aca3-11e6-a3f7-5254006a2711:15'/*!*/;
# at 2294
#161128  5:35:22 server id 13310  end_log_pos 2404 CRC32 0xafa374ea     Query   thread_id=78    exec_time=0     error_code=0
SET TIMESTAMP=1480329322/*!*/;
SET @@session.sql_mode=1436549152/*!*/;
XA ROLLBACK X'33323435312d302d322d31',X'',1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

It means the xa committed transaction is losted after recover, but we has already set 
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
innodb_doublewrite=1
innodb_support_xa=1

The committed transaction is lost!!!
[18 Feb 2019 16:20] MySQL Verification Team
Hi,

Sorry this took while to reproduce but I did so I'm verifying this bug.

Thanks for the report
Bogdan