Bug #85447 Slave SQL thread locking issue on a certain XA workload on master
Submitted: 14 Mar 2017 13:09 Modified: 14 Mar 2017 13:43
Reporter: Laurynas Biveinis (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7, 5.7.17 OS:Any
Assigned to: CPU Architecture:Any

[14 Mar 2017 13:09] Laurynas Biveinis
Description:
A certain workload involving large rows, foreign keys and XA transcations on interleaved user connections on master will leave slave unable to replay it. 

Run the MTR testcase in "How to repeat", and:
- sync_slave_sql_with_master.inc will block on a lock:

SHOW SLAVE STATUS\G
...
      Slave_SQL_Running_State: System lock
...
Finally giving up:
Last_Errno: 1205
Last_Error: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

There is information in InnoDB status too, but I haven't saved it right now. The testcase is easily perturbed, i.e. insert one less row or 10 smaller row to t1 and the issue goes away.

How to repeat:
-master.opt, -slave.opt:
--gtid-mode=ON --log-slave-updates --enforce-gtid-consistency --lower-case-table-names --transaction-isolation=read-committed

.test:
--source include/have_innodb.inc
--source include/have_binlog_format_row.inc
--source include/master-slave.inc
--source include/count_sessions.inc

CREATE TABLE t1 (t1_pk INT PRIMARY KEY, t1_blob BLOB) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1, REPEAT("a", 65000));
INSERT INTO t1 VALUES (2, REPEAT("a" ,65000));
INSERT INTO t1 VALUES (3, REPEAT("a", 5165));
INSERT INTO t1 VALUES (4, REPEAT("a", 5158));

CREATE TABLE t2 (t2_pk INT PRIMARY KEY, t1_pk INT,
  FOREIGN KEY (t1_pk) REFERENCES t1 (t1_pk)) ENGINE=InnoDB;

XA START '1';

--connect(con2,localhost,root)
XA START '2';
INSERT INTO t1 VALUES (5, NULL);

--connection master
INSERT INTO t1 VALUES (6, REPEAT("a", 8000));

INSERT INTO t2 VALUES (4, 6);

XA END '1';
XA PREPARE '1';

--connection con2
XA END '2';
XA PREPARE '2';

--connection master
XA COMMIT '1';

--connection con2
XA COMMIT '2';

--source include/sync_slave_sql_with_master.inc

--disconnect con2

--connection master
DROP TABLE t2, t1;
--source include/wait_until_count_sessions.inc

--source include/rpl_end.inc
[14 Mar 2017 13:43] Umesh Shastry
Hello Laurynas,

Thank you for the report and test case.

Thanks,
Umesh
[30 Aug 2017 23:16] Kenny Gryp
This is what you would get in INNODB_LOCKS:
perconaserver mysql> select * from information_schema.INNODB_LOCKS;
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+
| lock_id          | lock_trx_id | lock_mode | lock_type | lock_table | lock_index | lock_space | lock_page | lock_rec | lock_data              |
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+
| 2238925:347:13:1 | 2238925     | X         | RECORD    | `xa`.`t1`  | PRIMARY    |        347 |        13 |        1 | supremum pseudo-record |
| 2238923:347:13:1 | 2238923     | S         | RECORD    | `xa`.`t1`  | PRIMARY    |        347 |        13 |        1 | supremum pseudo-record |
+------------------+-------------+-----------+-----------+------------+------------+------------+-----------+----------+------------------------+

SHOW ENGINE INNODB TRANSACTIONS output:

---TRANSACTION 2238938, ACTIVE 2 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 18, OS thread handle 140294237603584, query id 96216 System lock
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238938 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

------------------
TABLE LOCK table `xa`.`t1` trx id 2238938 lock mode IX
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238938 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

---TRANSACTION 2238923, ACTIVE (PREPARED) 1085 sec recovered trx
4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 2
TABLE LOCK table `xa`.`t1` trx id 2238923 lock mode IX
TABLE LOCK table `xa`.`t2` trx id 2238923 lock mode IX
RECORD LOCKS space id 347 page no 13 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238923 lock mode S
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

RECORD LOCKS space id 347 page no 14 n bits 72 index PRIMARY of table `xa`.`t1` trx id 2238923 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000006; asc     ;;
 1: len 6; hex 0000002229cb; asc    ") ;;
 2: len 7; hex ee000001d80110; asc        ;;
 3: len 30; hex 616161616161616161616161616161616161616161616161616161616161; asc aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa; (total 8000 bytes);
[30 Aug 2017 23:17] Kenny Gryp
Given the fact that this is also a 'supremum pseudo-record' issue. This bug might be very likely related: https://bugs.mysql.com/bug.php?id=86486
[30 Aug 2017 23:19] Kenny Gryp
NOTE FOR PEOPLE HAVING THIS ISSUE: 

You can 'workaround' this issue by restarting the database, which causes the locks by the XA transaction that is already prepared to be gone and replication will just continue.

(no need to restore slaves from backups, skip binlog events...)
[27 Sep 2017 23:21] Vlad Lesin
The issue has been already fixed in 5.7.18 with the following commit:

commit c90aba97ee76f65770f6b4641edef1c32db58279
Author: Debarun Banerjee <debarun.banerjee@oracle.com>
Date: Sun Nov 13 10:31:35 2016 +0530

    BUG#25082593 FOREIGN KEY VALIDATION DOESN'T NEED TO ACQUIRE GAP LOCK
                 IN READ COMMITTED
[18 Oct 2017 11:27] Kenny Gryp
The test case from Laurynas does not fail in 5.7.19. IMO this bug can be closed as fixed.
[31 Oct 2017 0:48] Jesper wisborg Krogh
Posted by developer:
 
The test case does not reproduce for me in 5.7.18. I agree, this is most likely a duplicate of Bug 25082593 which was fixed in 5.7.18 with the following change log entry:

A gap lock was taken unnecessarily during foreign key validation while
using the READ COMMITTED isolation level.