Bug #94803 rpl sql_thread may broken due to XAER_RMFAIL error for unfinished xa transaction
Submitted: 28 Mar 2019 2:47 Modified: 5 Mar 13:45
Reporter: dennis gao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.24 OS:Any
Assigned to: CPU Architecture:Any

[28 Mar 2019 2:47] dennis gao
Description:
If the slave mysql generate a partial relay log with a unfinished xa transaction, the sql_thread of this slave will broken due to the following error:
" Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: ''. Query: 'ROLLBACK'".

How to repeat:
1. do a big xa transaction in master mysql, we name this transaction as trx1
2. shutdown the slave mysql before it finishes writing the relay log of trx1, so the relay log of trx1 is partial
3. restart slave mysql without auto-start slave, and start the sql_thread only.
4. use show slave status\G to check the slave error.

Suggested fix:
In coord_handle_partial_binlogged_transaction, mysql directly add a "ROLLBACK" Query_log_event if need to finish the partial transaction:

static bool coord_handle_partial_binlogged_transaction(Relay_log_info *rli, 
                                                       const Log_event *ev)
{
  DBUG_ENTER("coord_handle_partial_binlogged_transaction");
  /*
    This function is called holding the rli->data_lock.
    We must return it still holding this lock, except in the case of returning
    error.
  */
  mysql_mutex_assert_owner(&rli->data_lock);
  THD *thd= rli->info_thd;

  if (!rli->curr_group_seen_begin)
  {
    DBUG_PRINT("info",("Injecting QUERY(BEGIN) to rollback worker"));
    Log_event *begin_event= new Query_log_event(thd,
                                                STRING_WITH_LEN("BEGIN"),
                                                true, /* using_trans */
                                                false, /* immediate */
                                                true, /* suppress_use */
                                                0, /* error */
                                                true /* ignore_command */);
    ((Query_log_event*) begin_event)->db= "";
    begin_event->common_header->data_written= 0;
    begin_event->server_id= ev->server_id;

If the transaction is xa trancation, it will lead slave sql thread broken.

For fix:
mysql should check whether the on-going relay log transaction is an xa transaction or not, if it is xa  transaction, shoud add the following query events:
1. xa end 
2. xa rollback
[28 Mar 2019 2:50] dennis gao
The code in previous post not include the ROLLBACK event, plz check this post:

static bool coord_handle_partial_binlogged_transaction(Relay_log_info *rli,
                                                       const Log_event *ev)
{
....

  DBUG_PRINT("info",("Injecting QUERY(ROLLBACK) to rollback worker"));
  Log_event *rollback_event= new Query_log_event(thd,
                                                 STRING_WITH_LEN("ROLLBACK"),
                                                 true, /* using_trans */
                                                 false, /* immediate */
                                                 true, /* suppress_use */
                                                 0, /* error */
                                                 true /* ignore_command */);

.....
[28 Mar 2019 13:59] MySQL Verification Team
Hi,

Thank you for your bug report.

I have searched entire source code of 5.7.24, but I was not able to find the function  coord_handle_partial_binlogged_transaction().

Please, let me know the name of the source code file where it is located. It is possible that my `grep` program is broken .....
[28 Mar 2019 14:55] dennis gao
In sql/rpl_slave.cc .
[28 Mar 2019 15:10] MySQL Verification Team
Hi,

I am verifying your bug, solely on the basis of the code analysis. Indeed, in case of the XA transactions, those two statements are missing.

Verified as reported.
[26 Feb 16:27] dennis gao
Patch to fix this bug:

Attachment: xa_rollback_rpl-v4.diff (text/x-patch), 6.85 KiB.

[26 Feb 16:27] dennis gao
Suggested fix:
    1. In coord_handle_partial_binlogged_transaction, we should check it is a external xa transaction
    2. if yes, check the xid_state
    3. if xid_state is XA_ACTIVE, generate and apply the Query_log_event "XA END", then generate and apply the Query_log_event "XA ROLLBACK"
    4. if xid_state is XA_IDLE, generate and apply the Query_log_event "XA ROLLBACK"
    
    For xa rollback event execution, the xid_state should be checked again after execution ha_rollback_low:
    1. if the xid_state is XA_ACTIVE, we should invoke gtid_state->update_on_rollback to avoid the unexpected modification to executed_gtids
    2. if the xid_state is XA_IDEL, we invoke gtid_state->update_on_commit
[26 Feb 16:31] dennis gao
For use this patch, plz erase the "-#ifndef DBUG_OFF" for XID::xid_to_str in xa.cc.

This patch is developed on mysql 5.7.25, and can be used in mysql5.7.29.
[26 Feb 16:34] dennis gao
Patch with test case file

Attachment: xa_rollback_rpl-v5.diff (text/x-patch), 11.44 KiB.

[27 Feb 12:51] MySQL Verification Team
Hello Mr. gao,

Thank you for your contribution.

However, we can not use your patch until you have signed the OCA agreement.

In the next comment you will find all the info that is needed for that formality.

Thanks in advance.
[27 Feb 12:51] MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it!

In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.
[28 Feb 1:32] dennis gao
hello Sinisa Milivojevic,

Thanks for response!
I had send the application of signed OCA to oracle-ca_us@oracle.com in 2020-2-21, and not yet get any feedback.
Do I need to send again, or where I can check whether my OCA is approved?

Regards,

Dennis GAO
[28 Feb 12:44] MySQL Verification Team
Hi Mr. Gao,

I do not see you among the list of the official contributors.

Please, follow the procedure once again and , if it does not work, let us know and we shall enquire for you.

Thanks.
[28 Feb 13:03] dennis gao
Hi Sinisa Milivojevic,

I have re-sent the application to oracle-ca_us@oracle.com again with email title "Oracle Contributor Agreement of XiaoxinGAO".

Regards!

Dennis GAO
[28 Feb 13:08] MySQL Verification Team
Hi Mr. Gao,

Let us know if you get a response.

I do hope you have signed the contract. No need to answer that if you have.

If you do not get response in 10 - 15 days, also let us know ........
[5 Mar 11:33] dennis GAO
The patch after the approved of OCA

Attachment: xa_rollback_rpl-v5.diff (text/x-patch), 11.44 KiB.

[5 Mar 13:45] MySQL Verification Team
Thank you, Mr. gao .....
[7 Mar 5:32] dennis GAO
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: xa_rollback_rpl-v5.diff (text/x-patch), 11.44 KiB.

[9 Mar 13:03] MySQL Verification Team
Thank you Mr. GAO.