Bug #85639 XA transactions are 'unsafe' for RPL using SBR
Submitted: 27 Mar 2017 9:11 Modified: 4 Oct 2017 14:33
Reporter: João Gramacho Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[27 Mar 2017 9:11] João Gramacho
Description:
Before WL#6860, a GTID DML transaction represented the full transaction,
containing the events to change the database content, to prepare the 
transaction on the storage engine and to commit it.

WL#6860 made possible to have DML transaction being prepared with one
GTID and committed with other.

Having two XA transactions committed in parallel on master may lead to
have them prepared in the inverse order on the slave applier.

An example of such situation is described in the test case present in
"How To Repeat".

The test case also shows that, when the transactions are applied using
RBR instead of SBR statements, as the storage engine acquired no gap
locks for pure RBR changes when the transaction isolation level used by
the applier thread is READ COMMITTED, there would be no blocking issues.

Unfortunately when the transactions are logged using SBR, the gap locks
create a condition that the replication applier cannot solve easily.

When the workload is not based on XA transactions, it might be possible
to identify the transaction that is blocking, waiting for it to be
committed or just killing it in a way that the replication applier will
retry it (just like it is done for deadlocks).

The problem is that when we use XA transactions, the already prepared
transaction that is blocking consumed a GTID. Rolling it back would not
guarantee it to be retried. And on retrying it, as its GTID was already
consumed, it will be skipped.

How to repeat:
Execute the test case below using mixed binary log format:

#
# Consider:
#   E=execute, P=prepare, C=commit;
#   1=first transaction, 2=second transaction;
#
# Master does: E1, E2, P2, P1, C1, C2
# Slave does:  E2, P2, E1, P1, C1, C2
#
--source include/have_debug.inc
# The test case only make sense for RBR or mixed mode binary log format
--source include/have_binlog_format_mixed_or_row.inc
--source include/master-slave.inc

--source include/rpl_connection_slave.inc
# To hit the issue, we need to split the data in two pages.
# This global variable will help us.
SET @saved_innodb_limit_optimistic_insert_debug =
    @@GLOBAL.innodb_limit_optimistic_insert_debug;
SET @@GLOBAL.innodb_limit_optimistic_insert_debug = 2;

# Let's generate the workload on the master
--source include/rpl_connection_master.inc
CREATE TABLE t1 (
  c1 INT NOT NULL,
  KEY(c1)
) ENGINE=InnoDB;

CREATE TABLE t2 (
  c1 INT NOT NULL,
  FOREIGN KEY(c1) REFERENCES t1(c1)
) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1), (3), (4);

--source include/rpl_connection_master1.inc
XA START 'XA1';
INSERT INTO t1 values(2);
XA END 'XA1';

# This transaction will reference the gap where XA1
# was inserted, and will be prepared and committed
# before XA1, so the slave will prepare it (but will
# not commit it) before preparing XA1.
--source include/rpl_connection_master.inc
XA START 'XA2';
INSERT INTO t2 values(3);
XA END 'XA2';

# The XA2 prepare should be binary logged first
XA PREPARE 'XA2';

# The XA1 prepare should be binary logged
# after XA2 prepare and before XA2 commit.
--source include/rpl_connection_master1.inc
XA PREPARE 'XA1';

# The commit order doesn't matter much for the issue being tested.
XA COMMIT 'XA1';
--source include/rpl_connection_master.inc
XA COMMIT 'XA2';

# Everything is fine if the slave can sync with the master.
--source include/sync_slave_sql_with_master.inc

#
# Cleanup
#
SET @@GLOBAL.innodb_limit_optimistic_insert_debug = @saved_innodb_limit_optimistic_insert_debug;

--source include/rpl_connection_master.inc
DROP TABLE t2, t1;

--source include/rpl_end.inc

Suggested fix:
Make XA transactions unsafe for SBR when using mixed binary log format.

Also, we should document well the issue and stop supporting sbr+xa.
[24 Jul 2017 12:49] Margaret Fisher
Posted by developer:
 
Changelog entry for 8.0.2 and 5.7.20:
Now that XA transactions are prepared and committed in two parts,
an issue with statement-based replication has been identified. 
If two XA transactions committed on the master are being prepared 
on the slave in the inverse order, locking dependencies can occur that cannot
be safely resolved. The issue is not present with row-based replication.

XA transactions are therefore now considered unsafe for statement-based 
replication.
- When binlog_format = STATEMENT, a warning is issued for 
DML statements inside XA transactions, and replication might fail with deadlock on slaves.
- When binlog_format = MIXED, DML statements inside 
XA transactions are logged using row-based replication. 
- When binlog_format = ROW, DML statements inside 
XA transactions are logged as before. 

Documentation changes for 8.0.2:
https://dev.mysql.com/doc/refman/8.0/en/xa-restrictions.html
C.6 Restrictions on XA Transactions
... The following restrictions exist for using XA transactions: ...
>> Add:
XA transactions are considered unsafe for statement-based replication. If two XA transactions committed in parallel on the master are being prepared on the slave in the inverse order, locking dependencies can occur that cannot be safely resolved, and it is possible for replication to fail with deadlock on the slave. From MySQL 8.0.2, when <literal>binlog_format = STATEMENT</literal> is set, a warning is issued for DML statements inside XA transactions. When <literal>binlog_format = MIXED</literal> or <literal>binlog_format = ROW</literal> is set, DML statements inside XA transactions are logged using row-based replication, and the potential issue is not present.

https://dev.mysql.com/doc/refman/8.0/en/replication-features-transactions.html
18.4.1.33 Replication and Transactions
>> Add a link to XA transactions restrictions topic above:
For restrictions that apply specifically to XA transactions, see C.6 Restrictions on XA Transactions.

https://dev.mysql.com/doc/refman/8.0/en/replication-rbr-safe-unsafe.html
18.2.1.3 Determination of Safe and Unsafe Statements in Binary Logging
>> Add:
XA transactions. If two XA transactions committed in parallel on the master are being prepared on the slave in the inverse order, locking dependencies can occur with statement-based replication that cannot be safely resolved, and it is possible for replication to fail with deadlock on the slave. From MySQL 8.0.2, when <literal>binlog_format = STATEMENT</literal> is set, DML statements inside XA transactions are flagged as being unsafe and generate a warning. When <literal>binlog_format = MIXED</literal> or <literal>binlog_format = ROW</literal> is set, DML statements inside XA transactions are logged using row-based replication, and the potential issue is not present.
[16 Nov 2017 10:02] Erlend Dahl
Bug#86486 XA Transactions in binlog cause replication broken

was marked as a duplicate
[26 Apr 2021 3:42] jingbo zhao
Nice Job! But your repeat case is hard to repeat(need complie MySQL to use debug mode) and understand. Our team(ksyun) has repeated this error use a sample case.

create table:

CREATE TABLE `test1` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(36) DEFAULT NULL,
  KEY `idx_id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

make data:
mysql> insert into test1() values(1, "zbdba"),(3, "zbdba"),(5, "zbdba"),(6, "zbdba"),(10, "zbdba"),(10, "zbdba"),(15, "zbdba"),(18, "zbdba");
Query OK, 8 rows affected (0.03 sec)
Records: 8  Duplicates: 0  Warnings: 0

start repeat:

transaction 1:
mysql> xa start '111';
Query OK, 0 rows affected (0.00 sec)
mysql> insert into test1() values(8, "zbdba");
Query OK, 1 row affected (0.01 sec)
mysql> xa end '111';
Query OK, 0 rows affected (0.01 sec)

transaction 2:
mysql> xa start '222';
Query OK, 0 rows affected (0.00 sec)
mysql> delete from test1 where id = 10;
Query OK, 2 rows affected (0.00 sec)
mysql> xa end '222';
Query OK, 0 rows affected (0.00 sec)
mysql> xa prepare '222';
Query OK, 0 rows affected (0.00 sec)

transaction 1:
mysql> xa prepare '111';
Query OK, 0 rows affected (0.00 sec)
mysql> xa commit '111';
Query OK, 0 rows affected (0.00 sec)

transaction 2:
mysql> xa commit '222';
Query OK, 0 rows affected (0.00 sec)