Bug #74177 | --slave-preserve-commit-order causes slave to deadlock and break for some querie | ||
---|---|---|---|
Submitted: | 1 Oct 2014 13:51 | Modified: | 26 Jun 2015 16:01 |
Reporter: | Kristian Nielsen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.7.5, 5.7.6 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[1 Oct 2014 13:51]
Kristian Nielsen
[3 Dec 2014 13:23]
MySQL Verification Team
Hello Kristian Nielsen, Thank you for the report and test case. With the provided test case, observed a timeout on slave and failure. Thanks, Umesh
[3 Dec 2014 13:23]
MySQL Verification Team
// 5.7.6 [umshastr@hod03]/export/umesh/mysql-5.7.6/mysql-test: ./mtr --debug rpl_knielsen Logging: ./mtr --debug rpl_knielsen MySQL Version 5.7.6 Checking supported features... - SSL connections supported - binaries are debug compiled Collecting tests... Checking leftover processes... Removing old var directory... Creating var directory '/export/umesh/mysql-5.7.6/mysql-test/var'... Installing system database... ============================================================================== TEST RESULT TIME (ms) or COMMENT -------------------------------------------------------------------------- connection slave; ERROR: sync_slave_sql.inc failed on connection 'slave' ERROR: use_gtids='' ERROR: _saved_gtids='' ERROR: _saved_file='master-bin.000001' ERROR: _saved_pos='5536' ERROR: timeout='300' ERROR: result='-1' ERROR: error type: Timeout after 300 seconds. main.rpl_knielsen [ fail ] Test ended at 2014-12-03 13:10:40 CURRENT_TEST: main.rpl_knielsen mysqltest: [Warning] /export/umesh/mysql-5.7.6/bin/mysqltest: Option 'debug=d:t:A,/export/umesh/mysql-5.7.6/mysql-test/var/log/mysqltest.trace' used, but is disabled mysqltest: In included file ./include/sync_slave_sql.inc at line 136: included from /export/umesh/mysql-5.7.6/mysql-test/t/rpl_knielsen.test at line 52: At line 135: Error in sync_with_master.inc - the logfile can be found in '/export/umesh/mysql-5.7.6/mysql-test/var/log/main.rpl_knielsen/rpl_knielsen.log' -------------------------------------------------------------------------- The servers were restarted 0 times Spent 0.000 of 310 seconds executing testcases Completed: Failed 1/1 tests, 0.00% were successful. Failing test(s): main.rpl_knielsen ## build used // build used commit: 1982dd7ab813e3cdb28d3bb119cc7b706cf01b15 date: 2014-11-27 13:27:39 +0300 build-date: 2014-11-27 12:38:45 +0100 short: 1982dd7 branch: mysql-trunk MySQL source 5.7.6
[3 Dec 2014 13:28]
MySQL Verification Team
// In the second attempt extracted transaction details from slave // ------------ TRANSACTIONS ------------ Trx id counter 1368 Purge done for trx's n:o < 1364 undo n:o < 0 state: running but idle History list length 24 Total number of lock structs in row lock hash table 10 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421154230976672, not started 0 lock struct(s), heap size 1160, 0 row lock(s) ---TRANSACTION 421154230969904, not started 0 lock struct(s), heap size 1160, 0 row lock(s) ---TRANSACTION 421154230967648, not started 0 lock struct(s), heap size 1160, 0 row lock(s) ---TRANSACTION 421154230966520, not started 0 lock struct(s), heap size 1160, 0 row lock(s) ---TRANSACTION 1367, ACTIVE 29 sec updating or deleting mysql tables in use 1, locked 1 LOCK WAIT 3 lock struct(s), heap size 1160, 2 row lock(s), undo log entries 1 MySQL thread id 10, OS thread handle 139679249843968, query id 267 updating UPDATE t3 SET b=NULL WHERE a=6 ------- TRX HAS BEEN WAITING 29 SEC FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 22 page no 4 n bits 80 index b_idx of table `test`.`t3` trx id 1367 lock_mode X locks gap before rec insert intention waiting Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 32 0: len 4; hex 80000002; asc ;; 1: len 4; hex 80000002; asc ;; ------------------ ---TRANSACTION 1366, ACTIVE 29 sec updating or deleting mysql tables in use 1, locked 1 LOCK WAIT 3 lock struct(s), heap size 1160, 2 row lock(s), undo log entries 1 MySQL thread id 6, OS thread handle 139679250908928, query id 268 updating UPDATE t1 SET b=NULL WHERE a=6 ------- TRX HAS BEEN WAITING 29 SEC FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 20 page no 4 n bits 80 index b_idx of table `test`.`t1` trx id 1366 lock_mode X locks gap before rec insert intention waiting Record lock, heap no 3 PHYSICAL RECORD: n_fields 2; compact format; info bits 32 0: len 4; hex 80000002; asc ;; 1: len 4; hex 80000002; asc ;; ------------------ ---TRANSACTION 1333, ACTIVE (PREPARED) 244 sec 3 lock struct(s), heap size 1160, 4 row lock(s), undo log entries 1 MySQL thread id 9, OS thread handle 139679250110208, query id 239 Waiting for its turn to commit. ---TRANSACTION 1332, ACTIVE (PREPARED) 244 sec 3 lock struct(s), heap size 1160, 4 row lock(s), undo log entries 1 MySQL thread id 11, OS thread handle 139679249577728, query id 243 Waiting for its turn to commit. ---TRANSACTION 1331, ACTIVE (PREPARED) 244 sec 3 lock struct(s), heap size 1160, 4 row lock(s), undo log entries 1 MySQL thread id 7, OS thread handle 139679250642688, query id 235 Waiting for its turn to commit. ---TRANSACTION 1330, ACTIVE (PREPARED) 244 sec 1 lock struct(s), heap size 1160, 0 row lock(s), undo log entries 1 MySQL thread id 8, OS thread handle 139679250376448, query id 237 Waiting for its turn to commit.
[22 May 2015 12:13]
Marko Mäkelä
Posted by developer: The scenario in the bug description demonstrates that InnoDB record locking is not symmetric. First of all, we should keep in mind that InnoDB does not have a concept of table row locks. Instead, it has two primary kinds of locks: table locks, and index record locks. We have a possible conflict between two single-statement transactions: UPDATE and DELETE. Internally, both transactions are implemented as UPDATE inside InnoDB. We will first x-lock the matching index records, and then update them. Let me show it in detail: The two single-statement transactions that we have are: T1: UPDATE t4 SET b=NULL WHERE a=6; T2: DELETE FROM t4 WHERE b <= 3; T1 would do the following: 1. Look up a=6 in the clustered index. 2. X-lock the matching clustered index record (a,b)=(6,6). 3. X-lock the matching secondary index record (b,a)=(6,6). 4. Write an update_undo record covering the changes in 5,6,7. 5. Update the clustered index record (6,6) to (6,NULL). 6. Delete-mark the secondary index record (6,6). 7. Insert the secondary index record (6,NULL). Steps 2,3,4 may be in a different order. T2 could do the following: 1. Search for b<=3 in the secondary index. This might not match any recs locked or modified by T1. 2. X-lock the matching records in the secondary index. Note: Bug#3300 introduced the sloppy 'semi-consistent' semantics that non-matching rows will not be locked! 3. X-lock the corresponding records in the clustered index. 4. For each match: 4.1. write update_undo records for the DELETE operation 4.2. delete-mark the clustered index record 4.3. delete-mark the secondary index record
[22 May 2015 12:19]
Marko Mäkelä
Posted by developer: The asymmetry in my example is that the UPDATE will not block the DELETE, but the DELETE would block the UPDATE. The DELETE is only matching the record (2,2). Before the UPDATE and the DELETE, the internal order of records in the secondary index (b,a) would be (NULL,1),(NULL,3),(NULL,5),(2,2),(4,4),(6,6). Internally, the secondary index record inserted by the UPDATE is (NULL,6), and it would be ordered right before the (2,2) that was probably record-and-gap locked by the DELETE. Maybe this asymmetry would go away (and the UPDATE would block the DELETE) if the fix of Bug#3300 was disabled?
[26 Jun 2015 16:01]
David Moss
The following was added to the 5.7.8 changelog: When using a multi-threaded slave with slave_preserve_commit_order=1, certain combinations of transactions being applied in parallel could cause a deadlock and stop the slave responding. The fix introduces a check for such deadlocks, so that when a transaction needs to wait for another transaction to release a row lock, InnoDB checks if there is a deadlock caused by the commit order. If it finds a deadlock caused by the commit order, it sets a deadlock flag for the slave worker which is holding the row lock. Then the worker rolls back its transaction and tries again.