Bug #94130 XA COMMIT may lead replication broken
Submitted: 30 Jan 7:22 Modified: 12 Apr 18:39
Reporter: phoenix Zhang Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: XA transactions Severity:S3 (Non-critical)
Version:5.7.23 OS:Any
Assigned to: CPU Architecture:Any

[30 Jan 7:22] phoenix Zhang
Description:
In master-slave replication, when execute xa transactions, if multiple clients execute XA COMMIT at the same time, then the replication may broken.

How to repeat:
use semi-sync will repeat 100%.

First, A is master server, and B is slave server. In A, B, all open the semi-sync. In the master server, the semi-sync timeout set large enough.

mysql> set global rpl_semi_sync_master_enabled=1, rpl_semi_sync_master_timeout=31560000000;

Then, In the master server A, execute below SQL to start a XA PREPARE transaction.

mysql> xa start '11';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values(0,0);
Query OK, 1 row affected (0.00 sec)

mysql> xa end '11';
Query OK, 0 rows affected (0.00 sec)

mysql> xa prepare '11';
Query OK, 0 rows affected (0.04 sec)

mysql> quit
Bye

Then,  B net-out. A will not COMMIT any transaction, cause it should wait semi-ack FROM B.
Open multiple clients, all execute below command:
mysql> xa commit '11';

All clients will not return OK packet to client.

Then B net-in again,  all return OK packet to client. This time, check the binlog events from A, it will have multiple same XA COMMIT

mysql> show binlog events in 'mysql-bin.000002';
+------------------+------+----------------+-----------+-------------+--------------------------------------------------------------------+
| Log_name         | Pos  | Event_type     | Server_id | End_log_pos | Info                                                               |
+------------------+------+----------------+-----------+-------------+--------------------------------------------------------------------+
| mysql-bin.000002 |  891 | Gtid           |        12 |         956 | SET @@SESSION.GTID_NEXT= 'e9eca58e-6437-11e6-b93d-d8cb8aee97da:11' |
| mysql-bin.000002 |  956 | Query          |        12 |        1045 | XA START X'3131',X'',1                                             |
| mysql-bin.000002 | 1045 | Table_map      |        12 |        1091 | table_id: 111 (test.t1)                                            |
| mysql-bin.000002 | 1091 | Write_rows     |        12 |        1135 | table_id: 111 flags: STMT_END_F                                    |
| mysql-bin.000002 | 1135 | Query          |        12 |        1222 | XA END X'3131',X'',1                                               |
| mysql-bin.000002 | 1222 | XA_prepare     |        12 |        1260 | XA PREPARE X'3131',X'',1                                           |
| mysql-bin.000002 | 1260 | Gtid           |        12 |        1325 | SET @@SESSION.GTID_NEXT= 'e9eca58e-6437-11e6-b93d-d8cb8aee97da:12' |
| mysql-bin.000002 | 1325 | Query          |        12 |        1415 | XA COMMIT X'3131',X'',1                                            |
| mysql-bin.000002 | 1415 | Gtid           |        12 |        1480 | SET @@SESSION.GTID_NEXT= 'e9eca58e-6437-11e6-b93d-d8cb8aee97da:13' |
| mysql-bin.000002 | 1480 | Query          |        12 |        1570 | XA COMMIT X'3131',X'',1                                            |
| mysql-bin.000002 | 1570 | Gtid           |        12 |        1635 | SET @@SESSION.GTID_NEXT= 'e9eca58e-6437-11e6-b93d-d8cb8aee97da:14' |
| mysql-bin.000002 | 1635 | Query          |        12 |        1725 | XA COMMIT X'3131',X'',1                                            |
| mysql-bin.000002 | 1725 | Gtid           |        12 |        1790 | SET @@SESSION.GTID_NEXT= 'e9eca58e-6437-11e6-b93d-d8cb8aee97da:15' |
| mysql-bin.000002 | 1790 | Query          |        12 |        1880 | XA COMMIT X'3131',X'',1                                            |
+------------------+------+----------------+-----------+-------------+--------------------------------------------------------------------+
28 rows in set (0.00 sec)

Now, we check the slave status from B, it will be sql-error, which say unknown xid.

The reason why sql-error in slave thread is: 
when it execute the first XA COMMIT, it will success. But for the next one, the xid will not exist any more.

Suggested fix:
Before XA COMMIT success after commit-stage, it should block all later XA COMMIT/XA ROLLBACK with the same xid.
[30 Jan 13:37] Bogdan Kecman
Hi,

I'm not sure I follow properly here. From what I see you did on master

XA START
XA END
XA PREPARE

and on SLAVE?! you did XA COMMIT

of course it will cause the error, this is not a bug, please explain what exactly are you doing and what you expect

thanks
Bogdan
[30 Jan 15:04] phoenix Zhang
All i execute sql on the master server. The slave just do nothing.

But what u see above, the master may generate multiple XA COMMIT binlog events with the same xid. And the slave will replicate those events and execute.  So, after it execute the first one, the next one will throw Unknown XID sql error
[30 Jan 15:09] phoenix Zhang
The step is:
1. master open the semi-sync master enabled, and set the timeout big enough
2. master execute
   XA START
   INSERT/UPDATE/DELETE
   XA END
   XA PREPARE
   quit
3. let slave net-out
4. In master, open multiple client, all execute
   XA COMMIT
   this time, all will not return 
5. let slave net-in again
   now, all XA COMMIT in step 4 will return. And when u check binlog events, all  will write the events in binlog file.
6. And, now, the slave is slave sql-error
[30 Jan 15:23] Bogdan Kecman
the "let slave net-out" in step3 - you disconnect slave from the network or you stop slave?
[31 Jan 2:30] phoenix Zhang
i disconnect slave from network.
[1 Feb 0:35] Bogdan Kecman
Thanks for the report I verified the behavior and I do believe it is a bug.

all best
Bogdan
[12 Apr 18:39] Paul Dubois
Posted by developer:
 
Fixed in 8.0.18.

In replication scenarios, if multiple clients concurrently executed
XA COMMIT or XA ROLLBACK statements that used the same XID value,
replication inconsistency could occur.