Bug #84442 XA PREPARE inconsistent with XTRABACKUP
Submitted: 9 Jan 2017 5:00 Modified: 5 Mar 2018 14:59
Reporter: Wei Zhao (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: XA transactions Severity:S2 (Serious)
Version:5.7.16 OS:Any
Assigned to: CPU Architecture:Any
Tags: Enterprise Backup, FTWRL, XA PREPARE, XTRABACKUP

[9 Jan 2017 5:00] Wei Zhao
Description:
This bug was initially reported to Percona because it's related to Percona Xtrabackup: https://bugs.launchpad.net/percona-server/+bug/1651941

But Percona suggested I report it here because "MySQL Enterprise Backup is affected the same". Hence this bug report.

XTRABACKUP does this to make sure the innodb redo log it copies and the binlog position it notes down are consistent:

FLUSH TABLE WITH READ LOCK; ---- (1)
....
FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS; ---- (2)
....
--- copy innodb redo log files ---- (3)
SHOW MASTER STATUS; ---- (4)
UNLOCK TABLES; ---- (5)

Since at beginning of any transaction commit, an intention exclusive global COMMIT lock is acquired, which is also acquired by FTWRL in SHARED mode, the above FTWRL at (1) can(or is intended to) make sure no transaction is prepared to innodb and no transaction's binlogs flushed to binlog file while XTRABACKUP is running between above (1) and (5) steps.

However, XA PREPARE doesn't acquire the COMMIT lock and it makes the transaction prepared in innodb and also flushes the transaction's binlogs to binlog file(in the flush stage), so this behavior can potentially lose prepared transactions:

Suppose transaction T1 got prepared via XA PREPARE between above stmt (2) and (3), then stmt (4) will return a binlog position right after T1's binlogs, but T1's innodb redo logs are still in the redo log buffer, not flushed by stmt (2) and not copied at (3), then T1 will be lost when the DB instance is restored later, and the restored DB instance's inndb data and binlog data will be inconsistent --- T1 exists in binlog but not in innodb.

I made a patch as attached.

How to repeat:
as detailed above.

Suggested fix:
I made a patch as attached.
[9 Jan 2017 5:01] Wei Zhao
The patch works for me.

Attachment: xa-prepare-commit-lock.2.diff (application/octet-stream, text), 2.63 KiB.

[9 Jan 2017 7:27] Umesh Shastry
Hello David Zhao,

Thank you for the report and contribution.
Please note that in order to submit contributions you must first sign the Oracle Contribution Agreement (OCA). For additional information please check http://www.oracle.com/technetwork/community/oca-486395.html.
If you have any questions, please contact the MySQL community team - http://www.mysql.com/about/contact/?topic=community

Thanks,
Umesh
[28 Mar 2017 4:59] Umesh Shastry
Hello David Zhao,

I see you have already created a patch for this, may I request you to please submit your patch as contribution so that it can be used? In order to submit contributions you must first sign the Oracle Contribution Agreement (OCA).
For additional information please check http://www.oracle.com/technetwork/community/oca-486395.html.
If you have any questions, please contact the MySQL community team.

Thanks,
Umesh
[13 Apr 2017 16:29] Paul Dubois
Posted by developer:
 
Noted in 5.7.19, 8.0.2 changelogs.

XA PREPARE, XA ROLLBACK, and XA COMMIT for a transaction from a
disconnected session did not take a global commit lock and modified
the binary log and InnoDB redo log even when FLUSH TABLES WITH READ
LOCK was in effect. This could lead to inconsistent backups when
backup tools assumed that the server was in a read-only state.
[26 Feb 2018 9:51] Wei Zhao
I'm opening the bug to add my previous patch as contribution. I didn't have OCA yet in April 2017 and forgot about this bug until now, hopefully I'm not too late.
[26 Feb 2018 9:55] Wei Zhao
this patch fixes the bug

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa-prepare-commit-lock.2.diff (application/octet-stream, text), 2.63 KiB.

[26 Feb 2018 14:25] Ståle Deraas
Posted by developer:
 
Hi Wei Zhao, the bug is fixed in 5.7.19. But thanks again for the contribution...even if it came too late for us to use it.
[5 Mar 2018 14:59] Miguel Solorzano
Closing according prior comment from development is already fixed.