Bug #90959 XA transactions can lock forever if a gap lock is also taken on the slave
Submitted: 22 May 2018 6:40 Modified: 11 Jun 2018 11:23
Reporter: Andreas Wederbrand Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.22 OS:Any
Assigned to: CPU Architecture:Any

[22 May 2018 6:40] Andreas Wederbrand
Description:
Environment:
A simple master-slave replication environment where the slave accepts local deletes, using STATEMENT based replication.
Or a master-master / active-active replication environment using STATEMENT based replication.

Error:
Two parallel XA transactions are started (XA START, XA END, XA PREPARE) on the master but not committed (XA COMMIT). 
They both try to do "insert ignore" into the same row.
On the slave that same row has been deleted.

When the two transactions reaches the slave they both claim a gap lock. If at this point, between the two XA PREPARE and XA COMMIT, a transaction is started on the slave that also locks this gap and the two XA COMMIT arrives before the lock timeout happens, the replication thread will hang forever. 

The only way forward is to manually XA RECOVER and XA COMMIT those transactions and then restart the slave.

How to repeat:
How to repeat:

Basically create a table, with a gap between two rows on the slave.
Make an "insert ignore" into that gap in two separate XA TRANSACTIONS on the master. 
Before issuing "XA COMMIT" on either of them make sure to take a X-lock on the gap on slave. 

When the "XA COMMITS" are issued the replication thread will hang forever and the only way forward is to do "XA RECOVER" on the slave on the first transaction. 

Steps to reproduce this can be found in this gist
https://gist.github.com/wederbrand/e423048ea8bf7d81bca730dee9583c22

It creates all files needed and recreates the error.
It uses official docker images and nothing more.

Suggested fix:
One way could be to revert to the old way of replicating XA transactions using normal (non XA) transactions when committed. At least make it an option.
[23 May 2018 5:19] MySQL Verification Team
Hello Andreas,

Thank you for the report and test case.

Thanks,
Umesh
[23 May 2018 5:21] MySQL Verification Team
test results

Attachment: 90959.results (application/octet-stream, text), 56.63 KiB.

[25 May 2018 15:55] Sveta Smirnova
Test case without X-lock on the gap on slave

Attachment: rpl_bug90959.test (application/octet-stream, text), 1.48 KiB.

[25 May 2018 15:55] Sveta Smirnova
Option file for master, copy same file for slave

Attachment: rpl_bug90959-master.opt (application/octet-stream, text), 26 bytes.

[25 May 2018 15:56] Sveta Smirnova
Bug is repeatable even if there is no X-lock on the gap on slave. And if XA COMMIT is issued on the same connection which started XA transaction. Looks like more serious than at the first glance.
[28 May 2018 7:35] Andreas Wederbrand
Sveta is correct. I've updated my gist if anyone wants to test this using docker instead.
[11 Jun 2018 11:23] Venkatesh Duggirala
Post by Developer:
==================

Statement Based Replication + XA is not supported. Please see 
https://dev.mysql.com/doc/refman/8.0/en/xa-restrictions.html (last paragraph).

And also starting from 5.7.20 version, MySQL replication server generates 
unsafe statement warning if XA statement is executed in Statement based replication mode. 

Please see https://bugs.mysql.com/bug.php?id=85639 for more information.

Regards,
Venkatesh.