Bug #49047 InnoDB deadlock detection is CPU intensive with many locks on a single row
Submitted: 24 Nov 2009 17:55 Modified: 6 Oct 2010 18:11
Reporter: Harrison Fisk Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.0,5.1,plugin OS:Any
Assigned to: John Russell CPU Architecture:Any
Tags: deadlock, innodb
Triage: Triaged: D3 (Medium) / R4 (High) / E3 (Medium)

[24 Nov 2009 17:55] Harrison Fisk
Description:
When there are many locks on a single row in InnoDB, CPU usage will get very large due to deadlock checking.  This is due to the fact that in this case, the complexity will grow at exponential time.  There are defined limits for the depth of the deadlock checking, however it doesn't really resolve the issue entirely.

The lock_deadlock_recursive function is called and loops over every lock previous in the lock list.  For each entry that conflicts, lock_deadlock_recursive is called again, which in turn loops over every lock, etc...  In the case of 200 locks on the same row, this gets very excessive.

I have done some dtrace tests (I will attach the dtrace script) to validate this assumption.

lock_rec_get_prev 203338240
Total queries: 10349

It is worth noting that in the case, that each row only has a few locks (such as 2), the algorithm works fine, it is just when a single locks starts conflicting with lots of locks it goes down hill significantly.

Example oprofile output (note you need to a low optimization compiled server or else the functions get inlined):

samples  %        app name                 symbol name
46873    34.6171  mysqld                   lock_rec_get_next_on_page
35001    25.8493  mysqld                   lock_rec_get_nth_bit
21753    16.0653  mysqld                   lock_rec_get_prev
17929    13.2411  mysqld                   ut_bit_get_nth
7432      5.4888  no-vmlinux               (no symbols)
2519      1.8604  opreport                 (no symbols)
415       0.3065  mysqld                   ut_delay
260       0.1920  mysqld                   lock_rec_add_to_queue
222       0.1640  libstdc++.so.6.0.8       (no symbols)
217       0.1603  mysqld                   lock_get_wait
201       0.1484  oprofiled                (no symbols)
172       0.1270  mysqld                   ut_hash_ulint
116       0.0857  libc-2.5.so              _int_free
107       0.0790  libc-2.5.so              _int_malloc
104       0.0768  mysqld                   ut_rnd_interval
94        0.0694  libc-2.5.so              strchr
86        0.0635  mysqld                   lock_deadlock_recursive
83        0.0613  libc-2.5.so              malloc_consolidate
77        0.0569  mysqld                   lock_has_to_wait
59        0.0436  mysqld                   lock_rec_get_first_on_page_addr
59        0.0436  mysqld                   ut_rnd_gen_next_ulint
56        0.0414  mysqld                   lock_rec_has_to_wait

How to repeat:
CREATE TABLE test.t1 (a int) ENGINE=InnoDB;
INSERT INTO test.t1 VALUES (5);
SET GLOBAL max_connections = 500;

for i in {1..199}; do ( yes "UPDATE t1 SET a=$i;" | mysql -S /tmp/mysql.sock test & ) ;
done

Run oprofile and watch the large CPU usage by the deadlock checking.

Run dtrace and count the calls to lock_rec_get_prev.

Suggested fix:
I see two options:

1.  Allow tuning of the deadlock code via the following options:

   * Allow it to be turned off and only rely on waits
   * Allow LOCK_MAX_DEPTH_IN_DEADLOCK_CHECK and LOCK_MAX_N_STEPS_IN_DEADLOCK_CHECK to be tuned

2.  Find some better way to handle it which doesn't degrade as poorly
[24 Nov 2009 17:58] Harrison Fisk
Dtrace script which counts the number of queries being done and the calls to lock_rec_get_prev

Attachment: count_deadlock_walks.d (text/x-dsrc), 566 bytes.

[24 Nov 2009 18:06] Harrison Fisk
Discussed a bit at:

http://www.facebook.com/MySQLatFacebook?v=wall&ref=nf#/MySQLatFacebook?v=feed&story_fbid=2...
[24 Nov 2009 18:45] Domas Mituzas
It may make sense to have separate error message for this case of lock interruption ("ur chain is too long").
Then tuning the search depth can be much more aggressive, as application could have specific remedies.
[25 Nov 2009 13:22] Mikhail Izioumtchenko
an application that has this kind of hot rows isn't going to get very far anyway. otoh a specific error code that Domas suggests means that the server tells the application that it has hot rows which is an interesting idea. Just writing a warning message in mysqld log could also be useful as long as it sufficiently identifies the offending row.
[25 Nov 2009 14:36] Mark Callaghan
Michael,

That is speculation on your part. My speculative reply is that a DBMS with as many global mutexes as InnoDB isn't likely to be useful for high-throughput OLTP. Shall I begin migration to PBXT or NoSQL today?

Neither your statement nor mine is useful.

The overhead for deadlock detection within InnoDB is too high. This bug is open so that the InnoDB team will look at this overhead.
[25 Nov 2009 20:17] Domas Mituzas
Just noting, that:

* A workload of hundred transactions a second on a row isn't unlikely or extremely bad application design
* A transaction that is held up for various reasons for one second isn't unlikely either. 
* These two issues in combination may cause extremely expensive pileups in under two seconds.
[26 Nov 2009 8:48] Marko Mäkelä
It might help a little to replace the procedure call stack with an explicit search stack or queue, as is often done in graph traversal algorithms. It would essentially be the same algorithm, just a more efficient implementation. This might not buy much in the case of lock_deadlock_recursive(), which only has about 10 register-width parameters and local variables, but it could save a lot of stack space in cascaded foreign key constraints (row_ins_foreign_check_on_constraint() and friends).
[26 Nov 2009 13:57] Heikki Tuuri
In many cases there is NO OTHER transaction in the state: "waiting for a lock". Then no search is needed at all!

There are probable lots of ways to prune the search space. The current InnoDB search is as simple as possible.
[27 Nov 2009 19:05] Harrison Fisk
I was thinking a bit more about this, and the original dtrace numbers do appear correct.

If you think of the pending locks as a lock-wait directed graph, then you have 199 edges from the last node to the previous ones, 198 for the lock before that, etc...

If you sum up the values 199+198+197+...+1, then you get 19900 edges for a graph with 200 entries connected in that way.  General graph cycle detection requires O(E) where E is the number of edges, so this makes sense.
 
Using my original dtrace numbers:

lock_rec_get_prev 203338240
Total queries: 10349

Average per:  19648

So those numbers seem correct for the number of calls.

To get improvements, we need to look at the lock_rec_get_prev call.  From looking at this, it looks like the lock records are a single direction linked list, so the list needs to be traversed each time to find each previous entry.  I suspect this is a large part of the total time spent.

Another option would be to change the direction of the search.  

Currently, the deadlock detection looks for any locks that are blocking the new lock and follows down the graph like that.  A different way to search would be to look for any trx's that are waiting on the current transaction and follow back in that direction.  This seems like it might be inferior when there are very few locks on a given row, but should be superior in this case with many locks on a single row.  In addition, it should be able to use forward traversal of the list which would be faster, than the previous one.
[3 Dec 2009 14:18] Heikki Tuuri
Harrison,

you found the way to prune the search tree :). Search the waits-for graph backwards. There are often no transactions waiting behind the transaction that joins a long lock queue.
[3 Dec 2009 14:25] Harrison Fisk
Hi Heikki,

I'm not 100% sure that searching backwards will always be better.  A common scenario is to have a long running transaction that is doing a lot of changes.  It might have 10-20 transactions waiting on it, but add to a row queue with only 1-2 locks.  In this case the forward would probably still be better.  

The only reason it would always be better would be if the speed of lock_rec_get_next vs. lock_rec_get_prev was much different.  This could be a good thing to do a micro-benchmark of, 20000 calls of lock_rec_get_prev vs. lock_rec_get_next.

I think the best solution is a hybrid of what was discussed:

*  Compare number of locks ahead in row lock queue to number of locks waiting on the trx
*  Decide to do forward or reverse searching based on which is smaller
   -  This automatically handles the case with no locks waiting on it that you mentioned as an optimization since the reverse search with 0 pending would be picked
[4 Dec 2009 14:42] Heikki Tuuri
Harrison,

you are right. If a transaction T already has 1,000,000 locks, it is highly inefficient to search the waits-for graph backwards. The backward search should be limited to relatively few steps, and then switch to a forward search if it does not yield a result.

Some further analysis of the one-row update example: a transaction T does have the table IX lock. The code has to scan the list of table locks. Only a table S-lock or a table X-lock can wait for an IX-lock. We could keep a count of S- and X-locks on a table. Or a count of waiting locks on a table.

Another observation: the current code in lock_deadlock_recursive() uses lock_rec_get_prev() to scan the locks that our transaction T has to wait for. And lock_rec_get_prev() is vey inefficient because the record lock list is a singly linked list. It becomes an O(n^2) complexity problem. We should in the current code start from the FIRST lock on the record and scan the list using lock_rec_get_next() which is very fast.

lock_deadlock_recursive(
...
        /* Look at the locks ahead of wait_lock in the lock queue */

        for (;;) {
                if (lock_get_type_low(lock) & LOCK_TABLE) {

                        lock = UT_LIST_GET_PREV(un_member.tab_lock.locks,
                                                lock);
                } else {
                        ut_ad(lock_get_type_low(lock) == LOCK_REC);
                        ut_a(bit_no != ULINT_UNDEFINED);

                        lock = (lock_t*) lock_rec_get_prev(lock, bit_no);
                }
...
[4 Dec 2009 14:48] Heikki Tuuri
Ok, Harrison had already commented on the get_prev inefficiency :). I should have used lock_rec_get_next() in the first place. The deadlock search algorithm allows us to scan the locks ahead of our lock L in any order we like.
[4 Dec 2009 20:36] Harrison Fisk
I tried a simple modification to switch from using lock_rec_get_prev to lock_rec_get_next and saw a huge improvement for the worst case test.

This patch isn't complete and shouldn't be used for production, but I will attach it. 

I then performed a test as follows:

mysql test -e "UPDATE t1 SET a = 0" && for i in {1..199}; do ( yes "UPDATE t1 SET a=a+1;" | mysql -S /tmp/mysql.sock test & ) ; done && sleep 120 && killall mysql && sleep 5 && mysql -e "select a, a/120 AS per_sec from test.t1"

This does as many updates as it can against a single row test.t1 table.  It then sees how many were done and what the rate per second was.

Without my patch:

+------+---------+
| a    | per_sec |
+------+---------+
| 9264 | 77.2000 |
+------+---------+

With my patch:

+-------+----------+
| a     | per_sec  |
+-------+----------+
| 68462 | 570.5167 | 
+-------+----------+

Test was done against 5.1.43 bzr with 1.0.5 InnoDB plugin.  Settings were as follows:

--ignore-builtin-innodb 
--plugin-load=innodb=ha_innodb_plugin.so 
--plugin-dir=../storage/innodb_plugin/.libs/ 
--skip-grant-tables 
--max-connections=500 
--table-cache=512 
--innodb-flush-log-at-trx-commit=0
[4 Dec 2009 20:37] Harrison Fisk
Small patch to switch from using lock_rec_read_prev to lock_rec_read_next

Attachment: deadlock_test.diff (text/x-patch), 981 bytes.

[4 Dec 2009 20:50] Harrison Fisk
Hrm, after further testing, it seems like my patch doesn't entirely work, but it is certainly fast ;)
[18 Dec 2009 16:28] Mark Callaghan
See http://bugs.mysql.com/bug.php?id=42644
Someone else reported this problem.
[18 Dec 2009 16:31] Mark Callaghan
And I am here today because I am dealing with another pileup from this problem. These are not easy to recognize (unless you have dtrace or are willing to run gdb on a production server).
[21 Dec 2009 14:25] Domas Mituzas
Please allow me to  cross-reference a pseudo-benchmark at:
http://mituzas.lt/2009/12/21/on-deadlock-detection/
[6 Jan 2010 14:48] Harrison Fisk
Corrected patch to switch from using lock_rec_read_prev to lock_rec_read_next

Attachment: deadlock_cpu_v2.diff (text/x-patch), 1.88 KiB.

[6 Jan 2010 14:55] Harrison Fisk
Found and fixed the problem with my small patch (it was skipping the first entry in the lock list).  Table locks are still scanned backwards since it has a double linked list for that.  All InnoDB tests pass in the test suite.

The patch was developed against the plugin, but should work with either version.

I did some more testing with the patch, numbers are updates/second on same row, higher is better:

Runs with default deadlock detection:
118.15
79.95

Runs with my patched version:
732.85
644.32

Runs with deadlock detection completely disabled as a control:
1596.94
1540.30
[21 Jan 2010 2:04] Mark Callaghan
On which version does the patch work? It has this code:

                } else {
                        ut_ad(lock_get_type(lock) == LOCK_REC);
                        ut_a(bit_no != ULINT_UNDEFINED);

                        lock = (lock_t*) lock_rec_get_next(bit_no, lock);

The interface for lock_rec_get_next() in 5.0.84 and 5.1.42 innodb plugin expects rec_t* as the first arg and bit_no is not rec_t.

lock_t*
lock_rec_get_next(
/*==============*/
                        /* out: next lock, NULL if none exists */
        rec_t*  rec,    /* in: record on a page */
        lock_t* lock)   /* in: lock */
[21 Jan 2010 3:02] Harrison Fisk
I developed it against the InnoDB plugin (innodb_plugin directory of 5.1).  I also tested it against mysql-next-mr as well.

In the InnoDB plugin from the most recent 5.1 bzr tree I see it defined as using an ulint heap_no and not rec_t* rec:

UNIV_INLINE
lock_t*
lock_rec_get_next(
/*==============*/
        ulint   heap_no,/*!< in: heap number of the record */
        lock_t* lock)   /*!< in: lock */
{
     ...
}

According to bzr gannotate it has been that way since the initial import in plugin version 1.0.4.  InnoDB built-in has it defined the old way.  

Looking at my patch, I see it is going against the innobase directory and not the innodb_plugin, so it might apply to the wrong directory by default in 5.1.  This is most likely due to the fact that I exported it from mysql-next-mr, rather than 5.1.
[21 Jan 2010 3:43] Mark Callaghan
You are right. The patch works for the plugin. I copied back the missing function to test it in 5.0.84. Thanks.
[19 Mar 2010 18:35] Omer Barnir
- Adding SR55RC tag (had only SRMR03)
[1 Apr 2010 11:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/104826

3473 Sergey Vojtovich	2010-04-01
      Applying InnoDB snapshot, fixes BUG#49047.
      
      Detailed revision comments:
      
      r6534 | sunny | 2010-01-29 23:42:49 +0200 (Fri, 29 Jan 2010) | 15 lines
      branches/zip: Two changes to fix the problem:
      
      1. First scan the joining transaction's locks and check if no other
      transaction is waiting for a lock held by the joining transaction.
      If no other transaction is waiting then  no deadlock an occur and
      we avoid doing an exhaustive search.
      
      2. Change the direction of the lock traversal from backward to forward.
      Previously we traversed backward from the lock that has to wait, the function
      to that fetched the previous node was very inefficient resulting in O(n^2)
      access to the rec lock list.
      
      Fix Bug #49047 InnoDB deadlock detection is CPU intensive with many locks on a single row.
      
      rb://218
[6 Apr 2010 7:58] Bugs System
Pushed into 5.1.46 (revid:sergey.glukhov@sun.com-20100405111026-7kz1p8qlzglqgfmu) (version source revid:svoj@sun.com-20100401151005-c6re90vdvutln15d) (merge vers: 5.1.46) (pib:16)
[24 Apr 2010 20:04] Mark Callaghan
was this pushed to 5.1.46 + 1.0.7? I assume it was from the comment but I don't see it using 5.1 launchpad branch with the 5.1.46 tag
[5 May 2010 15:09] Bugs System
Pushed into 5.1.47 (revid:joro@sun.com-20100505145753-ivlt4hclbrjy8eye) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[19 May 2010 21:03] Harrison Fisk
I see this now in the current 5.1 tree, so it will be part of 5.1.47.  

Any further work would be done in future trees, such as 5.5 or beyond.
[28 May 2010 5:47] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (pib:16)
[28 May 2010 6:17] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[28 May 2010 6:44] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[15 Jun 2010 8:12] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100615080459-smuswd9ooeywcxuc) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (merge vers: 5.1.47) (pib:16)
[15 Jun 2010 8:27] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100615080558-cw01bzdqr1bdmmec) (version source revid:mmakela@bk-internal.mysql.com-20100415070122-1nxji8ym4mao13ao) (pib:16)
[17 Jun 2010 11:48] Bugs System
Pushed into 5.1.47-ndb-7.0.16 (revid:martin.skold@mysql.com-20100617114014-bva0dy24yyd67697) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 12:25] Bugs System
Pushed into 5.1.47-ndb-6.2.19 (revid:martin.skold@mysql.com-20100617115448-idrbic6gbki37h1c) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[17 Jun 2010 13:12] Bugs System
Pushed into 5.1.47-ndb-6.3.35 (revid:martin.skold@mysql.com-20100617114611-61aqbb52j752y116) (version source revid:vasil.dimov@oracle.com-20100331130613-8ja7n0vh36a80457) (merge vers: 5.1.46) (pib:16)
[30 Jun 2010 21:14] John Russell
Text for the change log:

        Deadlock detection could be a bottleneck in InnoDB processing,
        if many transactions attempted to update the same row simultaneously.
        The algorithm has been improved to enhance performance and scalability.
[19 Sep 2010 19:56] Philip Tsai
Hi,

Is this bug really fixed or we should still not have too high an expectation?  It looks like the solution has been pushed to 5.1.47 (or 5.1.46) and beyond.  However, with the latest 5.1.50 (on redhat-linux-gnu x86_64), I still see the deadlock happening with about only 250 Threads_connected where many of them are updating one the same row in the same table, i.e.  "UPDATE [some table] SET [some column] WHERE id = [some id]".

Thanks.
[20 Sep 2010 0:37] James Day
Philip, are you using the plugin version of InnoDB or the built in one?
[20 Sep 2010 18:35] Philip Tsai
The built-in one.  Should I try the plug-in version instead?
[20 Sep 2010 19:23] James Day
Philip, the improvement is only in the plugin version. I'll ask that the release note be updated to say this.
[20 Sep 2010 19:25] James Day
Please update the change log entry to reflect this improvement only being in the InnoDB plugin version, perhaps changing to:

Deadlock detection could be a bottleneck in InnoDB processing, if many transactions attempted to update the same row simultaneously. The algorithm in the InnoDB plugin has been improved to enhance performance and scalability.
[20 Sep 2010 22:18] Philip Tsai
Got it. I will test the plug-in version and see how high the concurrency I can push to. Thanks.
[20 Sep 2010 23:32] James Day
Philip, please do let us know how you get on with it. It's very good to have success, failure or more work still needed feedback about improvements like this one.
[25 Sep 2010 1:48] Philip Tsai
Using the plugin (1.0.11) that comes with 5.1.50, I am able to test up to 1024 transactions limit concurrently without seeing the deadlock exception -- in that sense, the fix looks pretty good.  Too bad that it doesn't look like the Facebook patch of being able to set innodb_deadlock_detection=0 is present in 5.1.x and Innodb plugin as of 1.0.11; otherwise, I like the idea of being to turn it off selectively and see potentially much more performance gain (it makes sense to penalize those deadlock violating apps and let lock wait timeout take care of the rest instead; the overall potential gain looks worthy having this configuration).  On the other hand, perhaps 5.5 offers already more performance improvement, which I will try out later to see how much better using 5.5 is overall on top of 1.0.11 plugin.   Thanks.
[25 Sep 2010 4:08] James Day
Philip, thanks for confirming that it's effective. If you can find a case where it remains a cause of bottlenecks we can revisit this later, in a new bug report. 5.5 looks pretty impressive in many ways but keep watching, we're not done yet.
[6 Oct 2010 18:11] John Russell
Added to change log:

Deadlock detection could be a bottleneck in InnoDB processing, if
many transactions attempted to update the same row simultaneously.
The algorithm has been improved to enhance performance and
scalability, in the InnoDB Plugin for MySQL 5.1, and in InnoDB 1.1
for MySQL 5.5.
[17 Nov 2010 2:46] Mark Callaghan
This commit is described by the commit list entry:
           ------------------------------------------------------------
            revno: 0.3.1987
            committer: sunny
            timestamp: Fri 2010-01-29 21:42:49 +0000
            message:
              branches/zip: Two changes to fix the problem:
              
              1. First scan the joining transaction's locks and check if no other
              transaction is waiting for a lock held by the joining transaction.
              If no other transaction is waiting then  no deadlock an occur and
              we avoid doing an exhaustive search.
              
              2. Change the direction of the lock traversal from backward to forward.
              Previously we traversed backward from the lock that has to wait, the function
              to that fetched the previous node was very inefficient resulting in O(n^2)
              access to the rec lock list.
              
              Fix Bug #49047 InnoDB deadlock detection is CPU intensive with many locks on a single row.
              
              rb://218

A lot of it was undone by this commit including removing lock_trx_has_no_waiters but that was not mentioned here. We are here to help with lots of performance testing capability.

            ------------------------------------------------------------
            revno: 0.3.1994
            committer: sunny
            timestamp: Thu 2010-02-04 14:11:23 +0000
            message:
              branches/zip: Remove the additional check introduced in r6534 which tries
              to check if the joining transaction has any other transactions waiting on
              its locks. This optimization results in excessive deadlocks when running
              Sysbench with a large number of threads. The function seems to return
              FALSE positives.
              
              rb://250
[17 Nov 2010 3:06] Mark Callaghan
diff for the first change: bzr diff -c0.3.1987 
diff for the second change that undoes the first: bzr diff -c0.3.1994