Bug #51093 Crash (possibly stack overflow) in MDL_lock::find_deadlock
Submitted: 11 Feb 2010 13:45 Modified: 7 Mar 2010 1:00
Reporter: John Embretsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S1 (Critical)
Version:mysql-next-4284 OS:Solaris (SPARC)
Assigned to: Dmitry Lenev CPU Architecture:Any
Tags: pushbuild, rqg_pb2, test failure

[11 Feb 2010 13:45] John Embretsen
Description:
The Random Query Generator test 'rqg_mdl_stability' fails with a segmentation fault on the Solaris 10 SPARC platform in Pushbuild, with the following stack trace:

=>[1] MDL_object_lock::incompatible_granted_types_bitmap(this = <bad address 0x734fff60>), at 0x1006eb1e4
  ---- 1 following frame from gwindows -- possible stack overflow
  [2] MDL_ticket::is_incompatible_when_granted(this = 0x1016b5db0, type = MDL_SHARED_WRITE), at 0x1006e6bb4
  [3] MDL_lock::find_deadlock(this = 0x103281040, waiting_ticket = 0x1033a2030, deadlock_ctx = 0xffffffff73539310), at 0x1006e8840
  [4] MDL_context::find_deadlock(this = 0x102fc3880, deadlock_ctx = 0xffffffff73539310), at 0x1006e8df8
  [5] MDL_lock::find_deadlock(this = 0x103281040, waiting_ticket = 0x1016b5b80, deadlock_ctx = 0xffffffff73539310), at 0x1006e8af8
  [6] MDL_context::find_deadlock(this = 0x103229f80, deadlock_ctx = 0xffffffff73539310), at 0x1006e8df8
  [7] MDL_lock::find_deadlock(this = 0x103281040, waiting_ticket = 0x1033a2030, deadlock_ctx = 0xffffffff73539310), at 0x1006e8c30
  [8] MDL_context::find_deadlock(this = 0x102fc3880, deadlock_ctx = 0xffffffff73539310), at 0x1006e8df8
  [9] MDL_lock::find_deadlock(this = 0x103281040, waiting_ticket = 0x1016b5b80, deadlock_ctx = 0xffffffff73539310), at 0x1006e8af8
  [10] MDL_context::find_deadlock(this = 0x103229f80, deadlock_ctx = 0xffffffff73539310), at 0x1006e8df8
  [11] MDL_lock::find_deadlock(this = 0x103281040, waiting_ticket = 0x1033a2030, deadlock_ctx = 0xffffffff73539310), at 0x1006e8c30

(...) [repeated 100+ times]

The test seems to have failed this way since 2010-02-01 (although possibly not in every single run), although the stack trace has varied slightly. For example, the top two stack frames (referencing "incompatible when granted") were not always included on the 32-bit platform.

A crash in the same area of the code is reported in http://bugs.mysql.com/bug.php?id=50787.

Although the same test has also been run on Linux 32-bit, Windows 32-bit and Solaris x86 64-bit, the same issue has not been observed there. Only on Solaris 10 Sparc (32- and 64-bit).

Reproduced with bzr branch mysql-next-4284 and lp:randgen as of 2010-02-10.

How to repeat:
Refer to the top level of a set of binaries from bzr branch mysql-next-4284 as
environment variable N4284.

Obtain a recent version of the Random Query Generator, e.g. by:
bzr branch lp:randgen

cd randgen

Run:

perl ./runall.pl \ 
--grammar=conf/metadata_stability.yy \ 
--gendata=conf/metadata_stability.zz \ 
--validator=SelectStability,QueryProperties \ 
--engine=Innodb \ 
--mysqld=--loose-innodb-lock-wait-timeout=5 \ 
--mysqld=--table-lock-wait-timeout=5 \ 
--mysqld=--loose-skip-safemalloc \ 
--mysqld=--innodb \ 
--mysqld=--default-storage-engine=Innodb \ 
--mysqld=--transaction-isolation=SERIALIZABLE \ 
--mysqld=--innodb-flush-log-at-trx-commit=2 \ 
--mysqld=--table-lock-wait-timeout=1 \ 
--mysqld=--innodb-lock-wait-timeout=1 \ 
--mysqld=--log-output=file \ 
--queries=1M \ 
--duration=600 \ 
--reporters=Deadlock,ErrorLog,Backtrace,Shutdown \ 
--basedir=$N4284
[11 Feb 2010 13:59] John Embretsen
Stacktraces from more threads (from dbx in Pushbuild, sol10 sparc64).

Attachment: bug51093_stacktraces.txt (text/plain), 48.02 KiB.

[15 Feb 2010 8:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100327

3096 Dmitry Lenev	2010-02-15
      Fix for bug #51093 "Crash (possibly stack overflow) in 
      MDL_lock::find_deadlock".
      
      On some platform deadlock detector in metadata locking 
      subsystem under certain conditions might have exhausted
      stack space causing server crashes.
      
      Particularly this caused failures of rqg_mdl_stability
      test on Solaris in PushBuild.
      
      During search for deadlock MDL deadlock detector could 
      sometimes encounter loop in the waiters graph in which 
      MDL_context which has started search for a deadlock 
      does not participate. In such case our algorithm will 
      continue looping assuming that either this deadlock will 
      be resolved by MDL_context which has created it (i.e.
      by one of loop participants) or maximum search depth
      will be reached. 
      Since max search depth was set to 1000 in the latter case 
      on platforms where each iteration of deadlock search 
      algorithm needs more than DEFAULT_STACK_SIZE/1000 bytes 
      of stack (around 192 bytes for 32-bit and around 256 bytes 
      for 64-bit platforms) we might have exhausted stack space.
      
      This patch solves this problem by reducing maximum search
      depth for MDL deadlock detector to 100. This should be safe
      at the moment as it is unlikely that each iteration of the 
      current deadlock detector algorithm will consume more than 
      512 bytes of stack (thus total amount of stack required 
      can't be more than 512*100 bytes) and we require at least 
      80K of stack in order to open any table.
      
      Additional reasearch should be conducted in future in order
      to determine the more optimal value of maximum search depth.
      
      This patch does not include test case as existing
      rqg_mdl_stability test can serve as one.
[15 Feb 2010 12:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100368

3098 Dmitry Lenev	2010-02-15
      Fix for bug #51093 "Crash (possibly stack overflow) in 
      MDL_lock::find_deadlock".
      
      On some platforms deadlock detector in metadata locking 
      subsystem under certain conditions might have exhausted
      stack space causing server crashes.
      
      Particularly this caused failures of rqg_mdl_stability
      test on Solaris in PushBuild.
      
      During search for deadlock MDL deadlock detector could 
      sometimes encounter loop in the waiters graph in which 
      MDL_context which has started search for a deadlock 
      does not participate. In such case our algorithm will 
      continue looping assuming that either this deadlock will 
      be resolved by MDL_context which has created it (i.e.
      by one of loop participants) or maximum search depth
      will be reached. 
      Since max search depth was set to 1000 in the latter case 
      on platforms where each iteration of deadlock search 
      algorithm needs more than DEFAULT_STACK_SIZE/1000 bytes 
      of stack (around 192 bytes for 32-bit and around 256 bytes 
      for 64-bit platforms) we might have exhausted stack space.
      
      This patch solves this problem by reducing maximum search
      depth for MDL deadlock detector to 32. This should be safe
      at the moment as it is unlikely that each iteration of the 
      current deadlock detector algorithm will consume more than 
      1K of stack (thus total amount of stack required can't be
      more than 32K) and we require at least 80K of stack in order
      to open any table. Also this value should be (hopefully) big
      enough to not cause too much false deadlocks errors (there
      is an anecdotal evidence that real-life deadlocks are
      typically shorter than that).
      
      Additional reasearch should be conducted in future in order
      to determine the more optimal value of maximum search depth.
      
      This patch does not include test case as existing
      rqg_mdl_stability test can serve as one.
[15 Feb 2010 12:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100373

3099 Dmitry Lenev	2010-02-15
      Fix for bug #51093 "Crash (possibly stack overflow) in 
      MDL_lock::find_deadlock".
      
      On some platforms deadlock detector in metadata locking 
      subsystem under certain conditions might have exhausted
      stack space causing server crashes.
      
      Particularly this caused failures of rqg_mdl_stability
      test on Solaris in PushBuild.
      
      During search for deadlock MDL deadlock detector could 
      sometimes encounter loop in the waiters graph in which 
      MDL_context which has started search for a deadlock 
      does not participate. In such case our algorithm will 
      continue looping assuming that either this deadlock will 
      be resolved by MDL_context which has created it (i.e.
      by one of loop participants) or maximum search depth
      will be reached. 
      Since max search depth was set to 1000 in the latter case 
      on platforms where each iteration of deadlock search 
      algorithm needs more than DEFAULT_STACK_SIZE/1000 bytes 
      of stack (around 192 bytes for 32-bit and around 256 bytes 
      for 64-bit platforms) we might have exhausted stack space.
      
      This patch solves this problem by reducing maximum search
      depth for MDL deadlock detector to 32. This should be safe
      at the moment as it is unlikely that each iteration of the 
      current deadlock detector algorithm will consume more than 
      1K of stack (thus total amount of stack required can't be
      more than 32K) and we require at least 80K of stack in order
      to open any table. Also this value should be (hopefully) big
      enough to not cause too much false deadlock errors (there
      is an anecdotal evidence that real-life deadlocks are
      typically shorter than that).
      
      Additional reasearch should be conducted in future in order
      to determine the more optimal value of maximum search depth.
      
      This patch does not include test case as existing
      rqg_mdl_stability test can serve as one.
[15 Feb 2010 14:10] Dmitry Lenev
Fix for this bug was pushed into mysql-next-4284 tree. Since it was not repeatable outside of this non-public tree there is nothing to document. So I am simply closing this bug.

Please feel free to reopen it if problem re-occurs!
[16 Feb 2010 9:26] John Embretsen
Fix looks good (read: issue not seen in Pushbuild) so far. Thanks for fixing so quickly!
[16 Feb 2010 16:50] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100216101445-2ofzkh48aq2e0e8o) (version source revid:alik@sun.com-20100215140849-b9fal65nwvrzczh4) (merge vers: 6.0.14-alpha) (pib:16)
[16 Feb 2010 16:59] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100216101208-33qkfwdr0tep3pf2) (version source revid:alik@sun.com-20100215140838-olj0kdt5rps9wgec) (pib:16)
[6 Mar 2010 11:08] Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:vvaintroub@mysql.com-20100216221947-luyhph0txl2c5tc8) (merge vers: 5.5.99-m3) (pib:16)
[7 Mar 2010 1:00] Paul DuBois
No changelog entry needed.