Bug #50998 Deadlock in MDL code during test rqg_mdl_stability
Submitted: 8 Feb 2010 15:00 Modified: 7 Mar 2010 1:01
Reporter: John Embretsen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Locking Severity:S2 (Serious)
Version:mysql-next-4284 OS:Any
Assigned to: Dmitry Lenev CPU Architecture:Any
Tags: pushbuild, rqg_pb2, test failure

[8 Feb 2010 15:00] John Embretsen
Description:
The test 'rqg_mdl_stability' fails in Pushbuild (mysql-next-4284 branch) with what seems to be a deadlock between MDL threads.

The test, which is a concurrent (10 threads) Random Query Generator test, fails on both Linux, Solaris and Windows. Further analysis is required to determine if these failures are all the same or different issues.

Failure symptoms (linux, Feb 06):

(...)
# 13:17:20 10 stalled queries detected, declaring deadlock at DSN dbi:mysql:host=127.0.0.1:port=19300:user=root:database=test.
# 13:17:20 Executing SHOW PROCESSLIST:
(...)
# 13:17:45 #4  0x0844f02a in MDL_context::timed_wait (this=0xb237d08, timeout=1)
# 13:17:45     at mdl.cc:810
# 13:17:45 #5  0x084502e3 in MDL_context::acquire_lock_impl (this=0xb237d08, 
# 13:17:45     mdl_request=0xa6adeef8) at mdl.cc:1467
# 13:17:45 #6  0x0845057d in MDL_context::upgrade_shared_lock_to_exclusive (
# 13:17:45     this=0xb237d08, mdl_ticket=0xb2b33e8) at mdl.cc:1617
# 13:17:45 #7  0x082adeb3 in wait_while_table_is_used (thd=0xb237c98, table=0xa691f3d0, 
# 13:17:45     function=HA_EXTRA_FORCE_REOPEN) at sql_base.cc:2151
# 13:17:45 #8  0x0842bda7 in mysql_create_or_drop_trigger (thd=0xb237c98, 
# 13:17:45     tables=0xb241ad0, create=true) at sql_trigger.cc:477
# 13:17:45 #9  0x0825f384 in mysql_execute_command (thd=0xb237c98) at sql_parse.cc:4312
# 13:17:45 #10 0x0825fe61 in mysql_parse (thd=0xb237c98, 
# 13:17:45     inBuf=0xb241930 "CREATE TRIGGER trigger_1 AFTER INSERT ON table10_innodb FOR EACH ROW  UPDATE table_3 SET `col_int_key` = 1", length=106, 
# 13:17:45     found_semicolon=0xa6ae0258) at sql_parse.cc:5589
# 13:17:45 #11 0x08261186 in dispatch_command (command=COM_QUERY, thd=0xb237c98, 
# 13:17:45     packet=0xb239901 "  CREATE TRIGGER trigger_1 AFTER INSERT ON table10_innodb FOR EACH ROW  UPDATE table_3 SET `col_int_key` = 1", packet_length=108)
# 13:17:45     at sql_parse.cc:1023
(...)

# 13:17:45 #3  0x0844f0c8 in inline_mysql_cond_wait (that=0xb290384, mutex=0xb290340)
# 13:17:45     at ../include/mysql/psi/mysql_thread.h:784
# 13:17:45 #4  0x0844f14b in MDL_context::wait (this=0xb290308) at mdl.cc:786
# 13:17:45 #5  0x0844fe8f in MDL_context::wait_for_lock (this=0xb290308, 
# 13:17:45     mdl_request=0xb2ca9d4) at mdl.cc:1858
# 13:17:45 #6  0x082afd65 in Open_table_context::recover_from_failed_open (
# 13:17:45     this=0xa6a7cc10, thd=0xb290298, mdl_request=0xb2ca9d4, table=0xb2ca7e0)
# 13:17:45     at sql_base.cc:3846
# 13:17:45 #7  0x082b22fb in open_tables (thd=0xb290298, start=0xa6a7cca4, 
# 13:17:45     counter=0xa6a7cc8c, flags=0, prelocking_strategy=0xa6a7ccd0)
# 13:17:45     at sql_base.cc:4545
# 13:17:45 #8  0x082b2785 in open_and_lock_tables_derived (thd=0xb290298, 
# 13:17:45     tables=0xb2ca7e0, derived=true, flags=0, prelocking_strategy=0xa6a7ccd0)
# 13:17:45     at sql_base.cc:5116
# 13:17:45 #9  0x08263886 in open_and_lock_tables_derived (thd=0xb290298, 
# 13:17:45     tables=0xb2ca7e0, derived=true, flags=0) at mysql_priv.h:1572
# 13:17:45 #10 0x082638c4 in open_and_lock_tables (thd=0xb290298, tables=0xb2ca7e0)
# 13:17:45     at mysql_priv.h:1582
# 13:17:45 #11 0x08257254 in execute_sqlcom_select (thd=0xb290298, all_tables=0xb2ca7e0)
# 13:17:45     at sql_parse.cc:4521
# 13:17:45 #12 0x082583ee in mysql_execute_command (thd=0xb290298) at sql_parse.cc:2060
# 13:17:45 #13 0x0825fe61 in mysql_parse (thd=0xb290298, 
# 13:17:45     inBuf=0xb2c9988 "SELECT /* QUERY_ID: 10226944 */ /* RESULTSET_SAME_DATA_IN_EVERY_ROW */ func_1 ( 9 ) AS `col_int_key` FROM  table_2 WHERE func_2 ( 1 ) < func_2 ( `col_int_key` )", length=160, found_semicolon=0xa6a7e258)
# 13:17:45     at sql_parse.cc:5589
(...)

# 13:17:45 #2  0x084512be in MDL_context::stop_waiting (this=0xb33da90) at mdl.h:671
# 13:17:45 #3  0x0844feb0 in MDL_context::wait_for_lock (this=0xb33da90, 
# 13:17:45     mdl_request=0xb3479ac) at mdl.cc:1860
# 13:17:45 #4  0x082afd65 in Open_table_context::recover_from_failed_open (
# 13:17:45     this=0xa68fe080, thd=0xb33da20, mdl_request=0xb3479ac, table=0xb3477b8)
# 13:17:45     at sql_base.cc:3846
# 13:17:45 #5  0x082b22fb in open_tables (thd=0xb33da20, start=0xa68fe114, 
# 13:17:45     counter=0xa68fe0fc, flags=0, prelocking_strategy=0xa68fe140)
# 13:17:45     at sql_base.cc:4545
# 13:17:45 #6  0x082b2785 in open_and_lock_tables_derived (thd=0xb33da20, 
# 13:17:45     tables=0xb3477b8, derived=true, flags=0, prelocking_strategy=0xa68fe140)
# 13:17:45     at sql_base.cc:5116
# 13:17:45 #7  0x08263886 in open_and_lock_tables_derived (thd=0xb33da20, 
# 13:17:45     tables=0xb3477b8, derived=true, flags=0) at mysql_priv.h:1572
# 13:17:45 #8  0x082638c4 in open_and_lock_tables (thd=0xb33da20, tables=0xb3477b8)
# 13:17:45     at mysql_priv.h:1582
# 13:17:45 #9  0x0825ac11 in mysql_execute_command (thd=0xb33da20) at sql_parse.cc:3010
# 13:17:45 #10 0x0825fe61 in mysql_parse (thd=0xb33da20, 
# 13:17:45     inBuf=0xb3476b8 "INSERT INTO table11_innodb SELECT * FROM table11_innodb LIMIT 0", length=63, found_semicolon=0xa68ff258) at sql_parse.cc:5589
# 13:17:45 #11 0x08261186 in dispatch_command (command=COM_QUERY, thd=0xb33da20, 

etc.

Full stack trace of all threads will be attached.

How to repeat:
Refer to the top level of a set of binaries from bzr branch mysql-next-4284 as environment variable N4284.

Obtain a recent version of the Random Query Generator, e.g. by:
bzr branch lp:randgen

cd randgen

Run:

perl ./runall.pl \ 
--grammar=conf/metadata_stability.yy \ 
--gendata=conf/metadata_stability.zz \ 
--validator=SelectStability,QueryProperties \ 
--engine=Innodb \ 
--mysqld=--loose-innodb-lock-wait-timeout=5 \ 
--mysqld=--table-lock-wait-timeout=5 \ 
--mysqld=--loose-skip-safemalloc \ 
--mysqld=--innodb \ 
--mysqld=--default-storage-engine=Innodb \ 
--mysqld=--transaction-isolation=SERIALIZABLE \ 
--mysqld=--innodb-flush-log-at-trx-commit=2 \ 
--mysqld=--table-lock-wait-timeout=1 \ 
--mysqld=--innodb-lock-wait-timeout=1 \ 
--mysqld=--log-output=file \ 
--queries=1M \ 
--duration=600 \ 
--reporters=Deadlock,ErrorLog,Backtrace,Shutdown \ 
--basedir=$N4284

Reproduced against current mysql-next-4284 branch, revision kostja@sun.com-20100206103007-2te50xtpwd7o3spx.

Tested in Pushbuild using RQG framework revision bernt.johnsen@sun.com-20100203130304-fh8mvl2eta4gjau5.
[8 Feb 2010 15:07] John Embretsen
GDB backtrace from all threads, linux x86

Attachment: bug50998_stacktrace-all-threads_linux.txt (text/plain), 32.46 KiB.

[10 Feb 2010 14:19] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99830

3088 Dmitry Lenev	2010-02-10
      Fix for bug #50998 "Deadlock in MDL code during test 
      rqg_mdl_stability".
      
      When start of statement's waiting on a metadata lock 
      created more than one loop in waiters graph server might 
      have entered deadlock condition.
      
      The problem was that in the case described above MDL deadlock 
      detector had to perform several searches for deadlock but
      forgot to reset Deadlock_detection_context before performing 
      new search. 
      Failure to do so has broken assumption in code resposible for 
      choosing victim that if Deadlock_detection_context::victim
      is set we also have read lock on m_waiting_for_lock for this
      context. As result this lock could have been unlocked more
      times than it was acquired which corrupted rwlock's state
      which led to server deadlock.
      
      This fix ensures that such reset is done before each attempt
      to find a deadlock.
     @ mysql-test/r/mdl_sync.result
        Added test for bug #50998 "Deadlock in MDL code during test
        rqg_mdl_stability" as well as coverage for the case when
        addition of statement waiting for metadata lock adds several
        loops in the waiters graph and therefore several searches
        for deadlock should be performed by MDL deadlock detector.
     @ mysql-test/t/mdl_sync.test
        Added test for bug #50998 "Deadlock in MDL code during test
        rqg_mdl_stability" as well as coverage for the case when
        addition of statement waiting for metadata lock adds several
        loops in the waiters graph and therefore several searches
        for deadlock should be performed by MDL deadlock detector.
     @ sql/mdl.cc
        Ensure that in cases when MDL deadlock detector had to
        perform several searches for deadlock because several loops
        in waiters graph are possible we reset
        Deadlock_detection_context before performing each search.
        Failure to do so has broken assumption in code resposible
        for choosing victim that if Deadlock_detection_context::victim
        is set we also have read lock on m_waiting_for_lock for this
        context. As result this lock could have been unlocked more
        times than it was acquired which corrupted rwlock's state
        (no one was able to acquire write lock on it anymore).
[10 Feb 2010 15:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99842

3088 Dmitry Lenev	2010-02-10
      Fix for bug #50998 "Deadlock in MDL code during test 
      rqg_mdl_stability".
      
      When start of statement's waiting on a metadata lock 
      created more than one loop in waiters graph server might 
      have entered deadlock condition.
      
      The problem was that in the case described above MDL deadlock 
      detector had to perform several searches for deadlock but
      forgot to reset Deadlock_detection_context before performing 
      new search. 
      Failure to do so has broken assumption in code resposible for 
      choosing victim that if Deadlock_detection_context::victim
      is set we also have read lock on m_waiting_for_lock for this
      context. As result this lock could have been unlocked more
      times than it was acquired which corrupted rwlock's state
      which led to server deadlock.
      
      This fix ensures that such reset is done before each attempt
      to find a deadlock.
     @ mysql-test/r/mdl_sync.result
        Added test for bug #50998 "Deadlock in MDL code during test
        rqg_mdl_stability" as well as coverage for the case when
        addition of statement waiting for metadata lock adds several
        loops in the waiters graph and therefore several searches
        for deadlock should be performed by MDL deadlock detector.
     @ mysql-test/t/mdl_sync.test
        Added test for bug #50998 "Deadlock in MDL code during test
        rqg_mdl_stability" as well as coverage for the case when
        addition of statement waiting for metadata lock adds several
        loops in the waiters graph and therefore several searches
        for deadlock should be performed by MDL deadlock detector.
     @ sql/mdl.cc
        Ensure that in cases when MDL deadlock detector had to
        perform several searches for deadlock because several loops
        in waiters graph are possible we reset
        Deadlock_detection_context before performing each search.
        Failure to do so has broken assumption in code resposible
        for choosing victim that if Deadlock_detection_context::victim
        is set we also have read lock on m_waiting_for_lock for this
        context. As result this lock could have been unlocked more
        times than it was acquired which corrupted rwlock's state
        (no one was able to acquire write lock on it anymore).
[10 Feb 2010 15:49] Dmitry Lenev
Fix for this bug was pushed into mysql-next-4284 tree. Since this issue was not repeatable outside of this non-public-available tree there is nothing to document.
So I am simply closing this report.
[11 Feb 2010 14:11] John Embretsen
The test no longer deadlocks on Linux, so the it looks like the fix was good.

However, the same test still deadlocks on Windows, so that may be a different issue. I intend to file a new bug for that once tests have been run for today's pushes.

The test does not deadlock the same way on Solaris (SPARC), but instead crashes with what seems like some kind of infinite loop in MDL_Lock::find_deadlock(), see http://bugs.mysql.com/bug.php?id=51093 (reported today).
[11 Feb 2010 15:08] John Embretsen
Windows deadlock now reported as http://bugs.mysql.com/bug.php?id=51105.
[16 Feb 2010 16:46] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100216101445-2ofzkh48aq2e0e8o) (version source revid:jon.hauglid@sun.com-20100211140522-unpky24gmq8fkhhj) (merge vers: 6.0.14-alpha) (pib:16)
[16 Feb 2010 16:55] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100216101208-33qkfwdr0tep3pf2) (version source revid:dlenev@mysql.com-20100210154603-9hux05vnrgxonb9t) (pib:16)
[6 Mar 2010 11:00] Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:vvaintroub@mysql.com-20100216221947-luyhph0txl2c5tc8) (merge vers: 5.5.99-m3) (pib:16)
[7 Mar 2010 1:01] Paul DuBois
No changelog entry needed.