Bug #94803 rpl sql_thread may broken due to XAER_RMFAIL error for unfinished xa transaction
Submitted: 28 Mar 2:47 Modified: 28 Mar 15:10
Reporter: dennis gao Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.24 OS:Any
Assigned to: CPU Architecture:Any

[28 Mar 2:47] dennis gao
Description:
If the slave mysql generate a partial relay log with a unfinished xa transaction, the sql_thread of this slave will broken due to the following error:
" Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: ''. Query: 'ROLLBACK'".

How to repeat:
1. do a big xa transaction in master mysql, we name this transaction as trx1
2. shutdown the slave mysql before it finishes writing the relay log of trx1, so the relay log of trx1 is partial
3. restart slave mysql without auto-start slave, and start the sql_thread only.
4. use show slave status\G to check the slave error.

Suggested fix:
In coord_handle_partial_binlogged_transaction, mysql directly add a "ROLLBACK" Query_log_event if need to finish the partial transaction:

static bool coord_handle_partial_binlogged_transaction(Relay_log_info *rli, 
                                                       const Log_event *ev)
{
  DBUG_ENTER("coord_handle_partial_binlogged_transaction");
  /*
    This function is called holding the rli->data_lock.
    We must return it still holding this lock, except in the case of returning
    error.
  */
  mysql_mutex_assert_owner(&rli->data_lock);
  THD *thd= rli->info_thd;

  if (!rli->curr_group_seen_begin)
  {
    DBUG_PRINT("info",("Injecting QUERY(BEGIN) to rollback worker"));
    Log_event *begin_event= new Query_log_event(thd,
                                                STRING_WITH_LEN("BEGIN"),
                                                true, /* using_trans */
                                                false, /* immediate */
                                                true, /* suppress_use */
                                                0, /* error */
                                                true /* ignore_command */);
    ((Query_log_event*) begin_event)->db= "";
    begin_event->common_header->data_written= 0;
    begin_event->server_id= ev->server_id;

If the transaction is xa trancation, it will lead slave sql thread broken.

For fix:
mysql should check whether the on-going relay log transaction is an xa transaction or not, if it is xa  transaction, shoud add the following query events:
1. xa end 
2. xa rollback
[28 Mar 2:50] dennis gao
The code in previous post not include the ROLLBACK event, plz check this post:

static bool coord_handle_partial_binlogged_transaction(Relay_log_info *rli,
                                                       const Log_event *ev)
{
....

  DBUG_PRINT("info",("Injecting QUERY(ROLLBACK) to rollback worker"));
  Log_event *rollback_event= new Query_log_event(thd,
                                                 STRING_WITH_LEN("ROLLBACK"),
                                                 true, /* using_trans */
                                                 false, /* immediate */
                                                 true, /* suppress_use */
                                                 0, /* error */
                                                 true /* ignore_command */);

.....
[28 Mar 13:59] Sinisa Milivojevic
Hi,

Thank you for your bug report.

I have searched entire source code of 5.7.24, but I was not able to find the function  coord_handle_partial_binlogged_transaction().

Please, let me know the name of the source code file where it is located. It is possible that my `grep` program is broken .....
[28 Mar 14:55] dennis gao
In sql/rpl_slave.cc .
[28 Mar 15:10] Sinisa Milivojevic
Hi,

I am verifying your bug, solely on the basis of the code analysis. Indeed, in case of the XA transactions, those two statements are missing.

Verified as reported.