Bug #65326 Deadlock situations that are not resolved in MySQL 5.6.5-m8
Submitted: 16 May 2012 2:53 Modified: 25 Jan 2013 18:26
Reporter: Huck Han Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:MySQL 5.6.5-m8, InnoDB 1.2.5 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution, deadlock, innodb

[16 May 2012 2:53] Huck Han
Description:
MySQL 5.6.5-m8 meets deadlock situations that are not resolved when many clients access small number of records via transactions concurrently.

How to repeat:
We used an OLTP benchmark (DBT-2) from OSDLDBT [1]. DBT-2 is known as an implementation of the TPC-C benchmark specification. The MySQL server is running on a 24-core Intel Xeon server that has a total of four 1.86 GHz Intel Xeon MP Processor 8000 series chips (each chip with 6 cores).The server is further equipped with 512 GB of RAM. We used MySQL 5.6.5-m8 (currently, the newest version) and Linux 3.1.5 as operating system. A memory-based file system (tmpfs) was used for storage of DBT-2.

We used the default configuration except for the following configurations :

max_connections : 1500 
table_cache : 1000 
innodb_buffer_pool_size : 1GB 
innodb_log_file_size : 256 M 
innodb_log_buffer_size : 16 M 
innodb_flush_method : fsync 
innodb_flush_log_at_trx_commit : 2 
innodb_lock_wait_timeout : 7200 
transaction-isolation :  SERIALIZABLE

We first performed a benchmark test under 1 warehouse (WH) and 10 clients (C). In the test, we observed extremely many rollbacks. You can see results of the test through downloading a text file from [2].

Then, we checked internal states of InnoDB through a MySQL command (“show engine innodb status”). You can see the states of InnoDB from [3]. We observed that several transactions (3385, 3377, and 3376 txns) were waiting for a long time (203s).  They are active and concurrent, and they tried to update the only row of warehouse table. We think that transactions made a deadlock situation but InnoDB did not resolve it.  

We did detailed analysis of the deadlock. To this end, we injected some debug code to the InnoDB source. Code modification was done only in two functions of the storage/innobase/lock/lock0lock.cc file, and you can download the modified source file from [4]. The first function where we left debug messages is "lock_rec_enqueue_ waiting", and it handles a waiting request for a lock which cannot be granted immediately. We left debug messages for all return values (DB_DEADLOCK, DB_WAIT, and DB_SUCCESS_LOCKED_REC) and their corresponding data (transaction id and lock mode). DB_DEADLOCK means that a deadlock situation is detected (and the deadlock will be resolved via rollback), and DB_WAIT means that the transaction requesting a lock is safe (the lock request will be added to a lock waiting list). DB_ SUCCESS_LOCKED_REC means that there was a deadlock, but another transaction was chosen as a victim. The other function is "lock_rec_dequeue_from_page" that removes a record lock request, waiting or granted, from the waiting list. We also left debug messages which transactions release their lock requests. 

You can see the MySQL log file from [5]. If you open the file, you can see many messages starting with DEADLOCK, ENQ, or DEQ. DEADLOCK is for "DB_DEADLOCK" and ENQ is for "DB_WAIT." DEQ is a debug message in the lock_rec_dequeue_from_ page function. The first and second number of DEADLOCK and ENQ is transaction id and lock mode (1026: shared and 1027: exclusive). All debug messages that related only to a single record of the warehouse table were left. 

From line 131, you can see the following: 

ENQ 3376 1026 --> the shared lock for txn 3376 is successfully added to waiting list. 
ENQ 3377 1026 --> the shared lock for txn 3377 is successfully added to waiting list. 
DEADLOCK 3371 1027 
DEQ 3371 
DEQ 3371 
ENQ 3378 1026 
DEQ 3360 
ENQ 3380 1026 
DEQ 3367 
ENQ 3382 1026 
ENQ 3383 1026 
DEQ 3364 
ENQ 3385 1026 --> the shared lock for txn 3385 is successfully added to waiting list. 
DEQ 3373 
ENQ 3387 1026 
DEQ 3356 
DEQ 3356 
ENQ 3376 1027 --> the exclusive lock for txn 3376 is successfully added to waiting list and the txn waits for txn 3377 and txn 3385.
DEADLOCK 3383 1027 
DEQ 3383 
DEQ 3383 
ENQ 3385 1027  --> the exclusive lock for txn 3385 is successfully added to waiting list and the txn waits for txn 3376 and txn 3377.
ENQ 3377 1027  --> the exclusive lock for txn 3377 is successfully added to waiting list and the txn waits for txn 3376 and txn 3385.

Above log messages show that the three transactions were waiting for each other (a deadlock situation) and that the deadlock situation was not resolved. We think that the deadlock situation should be detected when handling the exclusive lock request for txn 3385 and txn 3377. And after that, all txns after 3377 failed. 

[1] http://sourceforge.net/apps/mediawiki/osdldbt/index.php?title=Main_Page
[2] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/dbt2-result.txt
[3] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/innodb-state.txt
[4] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/lock0lock.cc.log.txt
[5] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/mysql-log.txt

Suggested fix:
The “lock_deadlock_search” function detects deadlock situations by traversing wait-for graph. However, some stack operations in the procedure make side effects and this prevents the system from traversing the wait-for graph fully. We changed to the code so that MySQL does not lead to the unexpected situation, and you can download the patch file from [6]. However, we are not sure whether our modification reflects on intentions of InnoDB developers fully or not. 

We performed additional tests of 1 warehouse with varying the number of clients. We used modified MySQL 5.6.5-m8, original MySQL 5.6.5-m8, and MySQL 5.5.23, and you can see performance results from [7].

[6] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/lock0lock.cc.diff.txt
[7] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/throughput-wh-1.png
[26 May 2012 16:47] Huck Han
patch file

Attachment: lock0lock.cc.diff.txt (text/plain), 2.20 KiB.

[1 Jun 2012 12:23] Sveta Smirnova
Thank you for the report.

You have quite big innodb_lock_wait_timeout : 7200. Will behavior change if change it to default value: 50?
[4 Jun 2012 7:09] Huck Han
In previous experiments, all transactions that make a deadlock situation are aborted due to the lock timeout. Then, all transactions will be normally processed until another deadlock situation occurs. So, deadlock situations cannot be avoided even though the lock timout changes to the default value.
[25 Jan 2013 15:51] MySQL Verification Team
Looks like internal bug:
Bug 16169638 - INNODB DEADLOCK DETECTION RETURNS FALSE NEGATIVES IN SOME CASES
[25 Jan 2013 17:03] Sveta Smirnova
Thank you for the report.

This is duplicate of internal bug.
[25 Jan 2013 17:41] Valeriy Kravchuk
Sveta,

Let me ask: was this internal bug reported before this one, before "16 May 2012 2:53"? I truly hope so... 

Can you share the bug number for internal bug, so that we can check release notes and know if it is already fixed?