Bug #98616 XA PREPARE/XA COMMIT/XA ROLLBACK lost if mysql crash just after binlog flush
Submitted: 15 Feb 2020 15:20 Modified: 22 Feb 2020 7:20
Reporter: dennis gao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: XA transactions Severity:S3 (Non-critical)
Version:5.7.29 OS:Any
Assigned to: CPU Architecture:Any

[15 Feb 2020 15:20] dennis gao
Description:
For "xa commit" and "xa rollback", mysql does not use binlog as a Transaction Coordinator Log during the start-up recover.

In MYSQL_BIN_LOG::recover, the query_event of "xa commit" and "xa rollback" is ignored, and does not record the xa commit xid and xa rollback xid.

And in xarecover_handlerton, there is no code to commit/rollback the recovred prepared transaction according to binlog.

So the xa commit and xa rollback transaction may just lost if mysql crash.

For "xa prepare" and "xa commit one phase" are more complex and worse.

Take "xa prepare" for example, "xa prepare" will modify the undo page to set the "undo->state = TRX_UNDO_PREPARED" and "undo->xid   = *trx->xid" (see code trx_undo_set_state_at_prepare in trx0undo.cc). But these modifications are done after binlog flushed and may lost if mysql crash just after binlog flushed.

When mysql start-up recover, the innodb engine will only recover the transaction in prepared state, which means the undo->state of the recovered transaction must be TRX_UNDO_PREPARED (check trx_rollback_resurrected in trx0roll.cc). So if a "xa prepare" transaction only flush binlog, it will just be rollback during start-up recover and lost. 

How to repeat:
Config mysql with innodb_flush_log_at_trx_commit=1 and sync_binlog=1.

Use gdb to startup mysql, such as "gdb --args ./bin/mysqld --defaults-file=my.cnf", and set a break-point in ordered_commit (binlog.cc) and just after "DEBUG_SYNC(thd, "bgc_after_sync_stage_before_commit_stage");".

The do the following test:

test1:
use test;
create table t1 (c1 int) engine=innodb;
xa start '1';
insert into t1 values (1);
xa end '1';
xa prepare '1'; --------------------here gdb will catch the break-point

Then quit the gdb to simulate a mysql crash, and just restart the mysql, and login to execute "xa recover", we will not find the xid '1', but if we check the binlog, we will find the "xa prepare '1'".

For "xa commit one phase", "xa commit", "xa rollback" are similar, just set break-points in the same point as "xa prepare", and quit the gdb to simulate a mysql crash just after mysql flushed the binlog to disk; Then restart mysql, we will find "xa commit one phase", "xa commit", "xa rollback" lost, but the binlog contains the related transaction events.

Suggested fix:
For "xa commit" and "xa rollback", we should:
1. check the "xa commit" and "xa rollback" query events and maintain two sets to record the commited x-id and rollbacked x-id in MYSQL_BIN_LOG::recover.
2. in xarecover_handlerton, we should loop all xids from engines which are generated by external TM, and find-out whether the xid is in the sets of xa commit and xa rollback; if inside, do commit_by_xid or rollback_by_xid respectively.

For "xa prepare" and "xa commit one phase", we should:
1. write the "undo->xid   = *trx->xid" into unde page before binlog flushed, so that these undo page modification can be flushed into related redo log. So that the mysql can know the external xid of this xa transaction during start-up recover.
2. During the innodb start-up recover, check trx_rollback_resurrected should keep all external TM transactions which has real xid, and these transcations should be handled in server layer.
3. For MYSQL_BIN_LOG::recover, should record all the xa transaction which is prepared but not commit/rollback, as a prepared_xid_set
4. In xarecover_handlerton, we should loop all xids from engines which are generated by external TM, and find-out whether the xid is in the prepared_xid_set, if inside should do xa prepare for it; if not inside, we should check whether this transaction is a prepared xa transaction before this start-up recover, if not, just do rollback_by_xid.
[15 Feb 2020 15:52] dennis gao
Patch to fix these bugs.

Attachment: xa_recover_bugs_for_prepare_commit_rollback-v3.diff (text/x-patch), 28.36 KiB.

[15 Feb 2020 15:53] dennis gao
The patch is developed on mysql 5.7.25, but can be patched on 5.7.29 and work.

The mysql-test test case will be added later.
[16 Feb 2020 16:26] dennis gao
Patch with test case file.

Attachment: xa_recover_bugs_for_prepare_commit_rollback-v5.diff (text/x-patch), 28.47 KiB.

[16 Feb 2020 16:31] dennis gao
Plz ignore the previous patch, and check this one for test case file.

Attachment: xa_recover_bugs_for_prepare_commit_rollback-v6.diff (text/x-patch), 38.29 KiB.

[17 Feb 2020 17:15] MySQL Verification Team
Hi Mr. gao,

Thank you for your bug report.

I have repeated the behaviour. However, I am in doubt whether this is intended behaviour or not. If it is intended behaviour, then your report would constitute a feature request. I am going to leave that decision to the team in charge of XA transactions.

Verified as reported.
[21 Feb 2020 5:05] MySQL Verification Team
Hello Dennis gao ,

Thank you very much for your patch contribution, we appreciate it!
In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.

regards,
Umesh
[22 Feb 2020 7:20] dennis gao
Hello Umesh Shastry,

Thanks for the response.
I have sent the signed OCA application to oracle-ca_us@oracle.com, and waiting for the approval. 

regards,

dennis gao
[22 Feb 2020 7:53] MySQL Verification Team
Thank you Dennis gao.

regards,
Umesh
[22 Feb 2020 13:45] MySQL Verification Team
Bug #98709 marked as duplicate of this one
[5 Mar 2020 11:36] dennis GAO
The patch after the approved of OCA

Attachment: xa_recover_bugs_for_prepare_commit_rollback-v7.diff (text/x-patch), 38.49 KiB.

[5 Mar 2020 13:51] MySQL Verification Team
Thank you Mr. Gao !!!!!
[7 Mar 2020 5:25] dennis GAO
Fix some mistake

Attachment: xa_recover_bugs_for_prepare_commit_rollback-v8.diff (text/x-patch), 38.87 KiB.

[7 Mar 2020 5:31] dennis GAO
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa_recover_bugs_for_prepare_commit_rollback-v8.diff (text/x-patch), 38.87 KiB.

[9 Mar 2020 13:01] MySQL Verification Team
Thank you Mr. gao.
[24 Nov 2020 9:27] Nico Liu
Hi Mr.Gao.

I am glad to view your patch. But I have two questions I'd like to ask you.

1. in trx_set_undo_xid_for_prepare_for_mysql 

TrxInInnoDB	trx_in_innodb(trx, true);

trx->in_innodb will be set to TRX_FORCE_ROLLBACK_DISABLE

What is the purpose of this? Or is it necessary?

2. in trx_recover_for_mysql

if (trx_state_eq(trx, TRX_STATE_PREPARED) || (trx->xid && !trx->xid->get_my_xid())) {

in some cases, the xid of a common transaction is not NULL, I think we can change it to

if (trx_state_eq(trx, TRX_STATE_PREPARED) || 
    (trx->xid && !trx->xid->get_my_xid() && !trx->xid->is_null())){

These are my two questions. I look forward to your reply.

Best Regards.
[24 Nov 2020 10:34] dennis GAO
Hello Nico Liu,

For "trx->in_innodb will be set to TRX_FORCE_ROLLBACK_DISABLE", I think it is not necessary, trx_set_undo_xid_for_prepare_for_mysql func is before doing prepare, not the prepare.  I think the transaction still can be rollback in this point.

For the second one, I think you are correct. "!trx->xid->is_null()" should be added.

regards,

dennis gao
[24 Nov 2020 10:36] dennis GAO
Postfix for the preview patch "xa_recover_bugs_for_prepare_commit_rollback-v8.diff". 

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: postfix-xa_recover_bugs_for_prepare_commit_rollback.diff (text/x-patch), 1.39 KiB.

[24 Nov 2020 10:37] dennis GAO
postfix-xa_recover_bugs_for_prepare_commit_rollback.diff will fix one bug of preview patch:

If a trx in trx_sys->rw_trx_list has been rollbacked in trx_rollback_resurrected during start-up
     recover due to trx->xid->get_gtrid_length()==0, xarecover_handlerton should skip this trx rather
     than do hton->rollback_by_xid, which will lead mysql crash.
[24 Nov 2020 13:34] Nico Liu
Hello Dennis gao ,

Thanks for your reply.
In the previous process, binlog is written before innodb. As a result, the crash safe capability is unavailable. Your patch writes undo before writing binlog, but does not modify subsequent actions in the previous process. As a result, innodb_trx increases. Can the sequence of the previous process be changed? Or what's the idea of your patch?

regards,
nico Liu.
[25 Nov 2020 15:47] dennis GAO
Hello Nico Liu,

I see your name maybe a Chinese name, maybe you can check my Chinese blog for the problem analysis: http://blog.sina.com.cn/s/blog_4673e6030102zdtd.html

So that you will known why need write xid to undo before flush binlog.
[30 Nov 2020 14:39] MySQL Verification Team
Thank you for your discussion.
[17 Dec 2020 4:45] nico liu
Hello Dennis gao ,

I'm sorry to ask you a few more questions. If an active undo is created in the InnoDB and the binlog is rotated during the recovery process, the patch will be degraded and invalid. Active transactions will be rolled back in xareocver. It does not seem to satisfy the rules of group submission.

regards,
Nico liu