Bug #87560 XA PREPARE log order error in replication and binlog recovery
Submitted: 28 Aug 2017 9:14 Modified: 13 Dec 2017 7:44
Reporter: Wei Zhao (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: XA transactions Severity:S2 (Serious)
Version:mysql-5.7.17 OS:Any
Assigned to: CPU Architecture:Any
Tags: binlog, recovery, replication, xa

[28 Aug 2017 9:14] Wei Zhao
Description:
The XA PREPARE command is executed by first writing and flushing binlog, then writing storage engine logs, which will only be flushed at beginning of next group commit.This is wrong and different from regular transaction commits where storage engine logs are written and flushed, then binlog of the group commit transactions are written. A simple test case to prove it wrong is listed in below section.

How to repeat:
An example to prove it's wrong is this: set up a master-slave replication, and start a lot of concurrent XA transaction branches, and then kill the mysqld process regularly, overtime, one could see the slave is blocked.

The reason is that when master's mysqld is killed, there can be a XA transaction branch X1 whose binlog is written onto the binlog file and happen to be received by the slave, but X1's engine log isn't written yet, so the slave will execute X1's binlog and lock the rows X1 has updated/inserted/deleted, and will have prepared transaction on slave. But when the master's mysqld is restarted, its innodb doesn't know about X1 at all, thus X1 doesn't exist for the master(although it's on the binlog file, that doens't matter), so the master can execute other transactions that would otherwise conflict with X1, and when binlogs of such transactions reach the slave, the slave worker threads will be blocked by X1 when executing such conflicting tranctions, and overtime the slave's all worker threads are blocked by X1.

Suggested fix:
I will upload my patches to fix the whole issues in a second.

The fix is to write engine logs first, then write binlogs, just like the procedures of commiting regular transactions. this is what the xa_prepare_swap_order.diff patch does. However, I did so in a brutal force way, there can be more elegant and consistent ways, probably by modifying the binlog_prepare() function.

And this isn't sufficent yet, because binlog recovery doesn't consider XA transactions at all. We have to rollback engine(e.g. innodb) transactions that don't exist on binlog at server startup, and keep those existing in binlog file as prepared, so that they can be committed later. And this is what the recover_xa_trx_in_binlog_recovery.diff  patch does.
[28 Aug 2017 9:15] Wei Zhao
flush engine log before writing binlog

Attachment: xa_prepare_swap_order.diff (application/octet-stream, text), 3.22 KiB.

[28 Aug 2017 9:16] Wei Zhao
recover xa transactions in binlog recovery

Attachment: recover_xa_trx_in_binlog_recovery.diff (application/octet-stream, text), 8.84 KiB.

[28 Aug 2017 9:20] Wei Zhao
add version&tags.
[28 Aug 2017 17:31] MySQL Verification Team
Hi,

Thanks for the report

all best
Bogdan
[17 Nov 2017 1:57] Wei Zhao
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa_prepare_swap_order.diff (application/octet-stream, text), 3.22 KiB.

[17 Nov 2017 1:58] Wei Zhao
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: recover_xa_trx_in_binlog_recovery.diff (application/octet-stream, text), 8.84 KiB.

[13 Dec 2017 7:44] Wei Zhao
The same problem exist also for XA COMMIT ... ONE PHASE, and my above patch doesn't solve it --- because in XA COMMIT ... ONE PHASE execution, ha_prepare_low() is called instead of ha_prepare(), so I have to swap order in ha_prepare_low() in a very similar way too, to make sure engine prepare is done BEFORE binlog prepare.

I'm not gonna provide this new patch again because it's very similar and you probably will implement the fix in a more elegant way.
[13 Dec 2017 14:30] MySQL Verification Team
Please do

Thanks
Bogdan
[14 Dec 2017 1:53] Wei Zhao
this patch fixes the 'XA COMMIT ... ONE PHASE' bug I appended above.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa_cop_swap_order.diff (application/octet-stream, text), 2.04 KiB.

[14 Dec 2017 6:21] Wei Zhao
This patch has to be applied after applying the 'recover_xa_trx_in_binlog_recovery.diff' patch I submitted earlier. 

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa_cop_recovery.diff (application/octet-stream, text), 1.94 KiB.

[18 Jan 2020 6:35] phoenix Zhang
There still have some problem, when do binlog recover, it will only read the last binlog file. 

Check below case:
Session 1:
XA START 'xid1';
INSERT INTO t1 values (1);
XA END 'xid1';
XA PREPARE 'xid1';

Session 2:
FLUSH logs;

Session 1:
XA COMMIT 'xid1';

When do xa commit, mysqld crash. If now, binlog already write, but trx not commit in engine.

When restart, from engine, it can find xid1 in prepared state, but it cannot find any xa prepare event in the last binlog file. Then it will rollback this trx.
[19 Jan 2020 4:13] phoenix Zhang
another problem, this patch only use the gtrid part for compare. While, in real xid, it contains (gtrid, bqual, format_id).
[19 Apr 2021 4:25] jingbo zhao
Recently, our team(ksyun) also discovered an XA Transaction problem. When using Xtrabackup for backup, if XA transaction is executed halfway and write XA prepare to the binlog. The Xtrabackup immediately records the current binlog position, and the following will be reported when recovery the backup to establish master-slave replication. error:

Last_Error: Error 'XAER_NOTA: Unknown XID' on query. Default database: 'zbdba'. Query: 'XA COMMIT X'1a',X'a1',1'

This is because half of the XA transaction is executed on the slave.