Description:
If you use XA transactions and a lock wait timeout or deadlock occurs for the SQL thread on the replication slave, the automatic retry will not work.
The reason is that while the SQL thread will do a ROLLBACK, it will not roll the XA transaction back. This means that when you retry the transaction, the first event will be "XA START ..." which is invalid as the XA transaction is already in progress giving the error:
LAST_ERROR_MESSAGE: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1'
How to repeat:
master> CREATE TABLE t1 (id int unsigned NOT NULL PRIMARY KEY, val int DEFAULT 0) ENGINE=InnoDB;
Query OK, 0 rows affected (0.05 sec)
master> INSERT INTO t1 VALUES (1, 0), (2, 0), (3, 0), (4, 0);
Query OK, 4 rows affected (0.02 sec)
Records: 4 Duplicates: 0 Warnings: 0
slave> START TRANSACTION;
Query OK, 0 rows affected (0.01 sec)
slave> SELECT * FROM t1 FOR UPDATE;
+----+------+
| id | val |
+----+------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
+----+------+
4 rows in set (0.00 sec)
-- Keep the slave transaction open
master> XA START '1';
Query OK, 0 rows affected (0.00 sec)
master> UPDATE t1 SET val = 1 WHERE id = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
master> XA END '1';
Query OK, 0 rows affected (0.00 sec)
master> XA PREPARE '1';
Query OK, 0 rows affected (0.00 sec)
master> XA COMMIT '1';
Query OK, 0 rows affected (0.01 sec)
-- Wait for a little longer than innodb_wait_lock_timeout (defaults to 50 seconds)
slave> SELECT * FROM performance_schema.replication_applier_status_by_worker\G
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 0
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_SEEN_TRANSACTION: 87b55630-9b1c-11e6-9930-08002715584a:40
LAST_ERROR_NUMBER: 1399
LAST_ERROR_MESSAGE: Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1'
LAST_ERROR_TIMESTAMP: 2016-10-28 09:29:04
1 row in set (0.00 sec)
From the slave's error log:
2016-10-27T22:29:04.833600Z 2 [Warning] Slave SQL for channel '': Could not execute Update_rows event on table db1.t1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log ol7-bin.000004, end_log_pos 1281, Error_code: 1205
2016-10-27T22:29:04.833792Z 2 [Note] Slave SQL for channel '': Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is partial transaction left on relay log because of restarting IO thread with auto-positioning protocol. Error_code: 0
2016-10-27T22:29:04.833848Z 2 [ERROR] Slave SQL for channel '': Error 'XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state' on query. Default database: 'db1'. Query: 'XA START X'31',X'',1', Error_code: 1399
2016-10-27T22:29:04.833858Z 2 [Warning] Slave: XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state Error_code: 1399
2016-10-27T22:29:04.833861Z 2 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'ol7-bin.000004' position 1031
Suggested fix:
If an XA transaction is in progress, the XA transaction must also be rolled back before retrying.