Description:
MySQL 5.6.5-m8 meets deadlock situations that are not resolved when many clients access small number of records via transactions concurrently.
How to repeat:
We used an OLTP benchmark (DBT-2) from OSDLDBT [1]. DBT-2 is known as an implementation of the TPC-C benchmark specification. The MySQL server is running on a 24-core Intel Xeon server that has a total of four 1.86 GHz Intel Xeon MP Processor 8000 series chips (each chip with 6 cores).The server is further equipped with 512 GB of RAM. We used MySQL 5.6.5-m8 (currently, the newest version) and Linux 3.1.5 as operating system. A memory-based file system (tmpfs) was used for storage of DBT-2.
We used the default configuration except for the following configurations :
max_connections : 1500
table_cache : 1000
innodb_buffer_pool_size : 1GB
innodb_log_file_size : 256 M
innodb_log_buffer_size : 16 M
innodb_flush_method : fsync
innodb_flush_log_at_trx_commit : 2
innodb_lock_wait_timeout : 7200
transaction-isolation : SERIALIZABLE
We first performed a benchmark test under 1 warehouse (WH) and 10 clients (C). In the test, we observed extremely many rollbacks. You can see results of the test through downloading a text file from [2].
Then, we checked internal states of InnoDB through a MySQL command (“show engine innodb status”). You can see the states of InnoDB from [3]. We observed that several transactions (3385, 3377, and 3376 txns) were waiting for a long time (203s). They are active and concurrent, and they tried to update the only row of warehouse table. We think that transactions made a deadlock situation but InnoDB did not resolve it.
We did detailed analysis of the deadlock. To this end, we injected some debug code to the InnoDB source. Code modification was done only in two functions of the storage/innobase/lock/lock0lock.cc file, and you can download the modified source file from [4]. The first function where we left debug messages is "lock_rec_enqueue_ waiting", and it handles a waiting request for a lock which cannot be granted immediately. We left debug messages for all return values (DB_DEADLOCK, DB_WAIT, and DB_SUCCESS_LOCKED_REC) and their corresponding data (transaction id and lock mode). DB_DEADLOCK means that a deadlock situation is detected (and the deadlock will be resolved via rollback), and DB_WAIT means that the transaction requesting a lock is safe (the lock request will be added to a lock waiting list). DB_ SUCCESS_LOCKED_REC means that there was a deadlock, but another transaction was chosen as a victim. The other function is "lock_rec_dequeue_from_page" that removes a record lock request, waiting or granted, from the waiting list. We also left debug messages which transactions release their lock requests.
You can see the MySQL log file from [5]. If you open the file, you can see many messages starting with DEADLOCK, ENQ, or DEQ. DEADLOCK is for "DB_DEADLOCK" and ENQ is for "DB_WAIT." DEQ is a debug message in the lock_rec_dequeue_from_ page function. The first and second number of DEADLOCK and ENQ is transaction id and lock mode (1026: shared and 1027: exclusive). All debug messages that related only to a single record of the warehouse table were left.
From line 131, you can see the following:
ENQ 3376 1026 --> the shared lock for txn 3376 is successfully added to waiting list.
ENQ 3377 1026 --> the shared lock for txn 3377 is successfully added to waiting list.
DEADLOCK 3371 1027
DEQ 3371
DEQ 3371
ENQ 3378 1026
DEQ 3360
ENQ 3380 1026
DEQ 3367
ENQ 3382 1026
ENQ 3383 1026
DEQ 3364
ENQ 3385 1026 --> the shared lock for txn 3385 is successfully added to waiting list.
DEQ 3373
ENQ 3387 1026
DEQ 3356
DEQ 3356
ENQ 3376 1027 --> the exclusive lock for txn 3376 is successfully added to waiting list and the txn waits for txn 3377 and txn 3385.
DEADLOCK 3383 1027
DEQ 3383
DEQ 3383
ENQ 3385 1027 --> the exclusive lock for txn 3385 is successfully added to waiting list and the txn waits for txn 3376 and txn 3377.
ENQ 3377 1027 --> the exclusive lock for txn 3377 is successfully added to waiting list and the txn waits for txn 3376 and txn 3385.
Above log messages show that the three transactions were waiting for each other (a deadlock situation) and that the deadlock situation was not resolved. We think that the deadlock situation should be detected when handling the exclusive lock request for txn 3385 and txn 3377. And after that, all txns after 3377 failed.
[1] http://sourceforge.net/apps/mediawiki/osdldbt/index.php?title=Main_Page
[2] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/dbt2-result.txt
[3] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/innodb-state.txt
[4] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/lock0lock.cc.log.txt
[5] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/mysql-log.txt
Suggested fix:
The “lock_deadlock_search” function detects deadlock situations by traversing wait-for graph. However, some stack operations in the procedure make side effects and this prevents the system from traversing the wait-for graph fully. We changed to the code so that MySQL does not lead to the unexpected situation, and you can download the patch file from [6]. However, we are not sure whether our modification reflects on intentions of InnoDB developers fully or not.
We performed additional tests of 1 warehouse with varying the number of clients. We used modified MySQL 5.6.5-m8, original MySQL 5.6.5-m8, and MySQL 5.5.23, and you can see performance results from [7].
[6] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/lock0lock.cc.diff.txt
[7] http://dcslab.snu.ac.kr/~hhyuck/mysql-bug-report/throughput-wh-1.png