MySQL Bugs: #83588: Slave retries (slave_transaction_retries) does not work with XA transactions

Bug #83588	Slave retries (slave_transaction_retries) does not work with XA transactions
Submitted:	27 Oct 2016 22:57	Modified:	15 Dec 2016 12:58
Reporter:	Jesper wisborg Krogh	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: XA transactions	Severity:	S2 (Serious)
Version:	5.7.13	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
If you use XA transactions and a lock wait timeout or deadlock occurs for the SQL thread on the replication slave, the automatic retry will not work.

The reason is that while the SQL thread will do a ROLLBACK, it will not roll the XA transaction back. This means that when you retry the transaction, the first event will be "XA START ..." which is invalid as the XA transaction is already in progress giving the error:

  LAST_ERROR_MESSAGE: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1'

How to repeat:
master> CREATE TABLE t1 (id int unsigned NOT NULL PRIMARY KEY, val int DEFAULT 0) ENGINE=InnoDB;
Query OK, 0 rows affected (0.05 sec)

master> INSERT INTO t1 VALUES (1, 0), (2, 0), (3, 0), (4, 0);
Query OK, 4 rows affected (0.02 sec)
Records: 4  Duplicates: 0  Warnings: 0

slave> START TRANSACTION;
Query OK, 0 rows affected (0.01 sec)

slave> SELECT * FROM t1 FOR UPDATE;
+----+------+
| id | val  |
+----+------+
|  1 |    0 |
|  2 |    0 |
|  3 |    0 |
|  4 |    0 |
+----+------+
4 rows in set (0.00 sec)

-- Keep the slave transaction open

master> XA START '1';
Query OK, 0 rows affected (0.00 sec)

master> UPDATE t1 SET val = 1 WHERE id = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

master> XA END '1';
Query OK, 0 rows affected (0.00 sec)

master> XA PREPARE '1';
Query OK, 0 rows affected (0.00 sec)

master> XA COMMIT '1';
Query OK, 0 rows affected (0.01 sec)

-- Wait for a little longer than innodb_wait_lock_timeout (defaults to 50 seconds)

slave> SELECT * FROM performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
         CHANNEL_NAME: 
            WORKER_ID: 0
            THREAD_ID: NULL
        SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 87b55630-9b1c-11e6-9930-08002715584a:40
    LAST_ERROR_NUMBER: 1399
   LAST_ERROR_MESSAGE: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1'
 LAST_ERROR_TIMESTAMP: 2016-10-28 09:29:04
1 row in set (0.00 sec)

From the slave's error log:

2016-10-27T22:29:04.833600Z 2 [Warning] Slave SQL for channel '': Could not execute Update_rows event on table db1.t1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log ol7-bin.000004, end_log_pos 1281, Error_code: 1205
2016-10-27T22:29:04.833792Z 2 [Note] Slave SQL for channel '': Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is partial transaction left on relay log because of restarting IO thread with auto-positioning protocol. Error_code: 0
2016-10-27T22:29:04.833848Z 2 [ERROR] Slave SQL for channel '': Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1', Error_code: 1399
2016-10-27T22:29:04.833858Z 2 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state Error_code: 1399
2016-10-27T22:29:04.833861Z 2 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'ol7-bin.000004' position 1031

Suggested fix:
If an XA transaction is in progress, the XA transaction must also be rolled back before retrying.

Posted by developer:
 
The workaround is to start the SQL thread again - this will be safe as the lock wait timeout happens at XA PREPARE, i.e. at the point where the
changes are written to the binary log. So it is not possible for another transaction to be interleaved with the part that's being retried.

Fixed in 5.7.18, 8.0.1 under the heading of

Bug#24764800 REPLICATION FAILING ON SLAVE WITH XAER_RMFAIL ERROR

Posted by developer:
 
Correction - fixed in 5.7.17 and 8.0.1. Here's the changelog entry:

Replication: When using XA transactions, if a lock wait
timeout or deadlock occurred for the applier (SQL) thread
on a replication slave, the automatic retry did not work.
The cause was that while the SQL thread would do a
rollback, it would not roll the XA transaction back. This
meant that when the transaction was retried, the first
event was XA START which was invalid as the XA
transaction was already in progress, leading to an
XAER_RMFAIL error. (Bug #24764800)
References: See also: Bug #24923091, Bug #24966941.