Bug #43529 Adding DBUG_SYNC_POINT cause DROP TABLE and KILL dead lock
Submitted: 10 Mar 2009 8:08 Modified: 3 Jun 2009 6:58
Reporter: Zhenxing He Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.0 OS:Any
Assigned to: Christoffer Hall CPU Architecture:Any
Tags: debug

[10 Mar 2009 8:08] Zhenxing He
Description:
When working on BUG#37145, a DBUG_SYNC_POINT is added to the constructor of Query_log_event, this works fine with other DDL statements, but may cause dead lock when a DROP TABLE is killed. This only happens on 5.0, there is no such problem for 5.1 and 6.0.

Here is the explanation of the scenario:

thd1:  DROP TABLE                                thd2:  KILL QUERY thd1
-------------------------------------------------------------------------------------
thd->current_mutex= &LOCK_open;
thd->current_cond= &COND_refresh;
.
.                                                thd->awake()
.                                                   pthread_mutex_lock(&mysys_var->mutex);
.                                                   ...
.                                                   pthread_mutex_lock(mysys_var->current_mutex);
. 
DBUG_SYNC_POINT();
   thd->current_mutex= &LOCK_user_locks;
   thd->current_cond= &ull->cond;
   pthread_cond_timedwait(&ull->cond, 
                          &LOCK_user_locks, &abstime);
   pthread_mutex_unlock(&LOCK_user_locks);
   pthread_mutex_lock(&thd->mysys_var->mutex);

And now, th1 is trying to lock thd->mysys_var->mutex, which is held by thd2, and thd2 is trying to lock LOCK_open, which is held by thd1.

How to repeat:
Apply the following patch to 5.0 code:

--- sql/log_event.cc	2009-03-05 10:10:44 +0000
+++ sql/log_event.cc	2009-03-09 05:20:24 +0000
@@ -1358,6 +1358,8 @@
 {
   time_t end_time;
 
+  DBUG_SYNC_POINT("debug_lock.before_query_log_event", 10);
+
   if (killed_status_arg == THD::KILLED_NO_VALUE)
     killed_status_arg= thd_arg->killed;
   error_code=

And then run the following test, it will cause dead lock sometimes:

---------------------------------------------------------------------
source include/master-slave.inc;

CREATE TABLE t1 (a int);

connection master;
SELECT GET_LOCK("debug_lock.before_query_log_event", 10);

connection master1;
let $connection_id= `SELECT CONNECTION_ID()`;
send DROP TABLE t1;

connection master;
eval kill query $connection_id;

SELECT RELEASE_LOCK("debug_lock.before_query_log_event");

connection master1;
reap;
-----------------------------------------------------------------------

Suggested fix:
By looking into the code, the problem is because in mysql_rm_table, it sets thd->current_mutex and thd->current_cond before calling mysql_rm_table_part2. I think one should only set these members right before the thread is about to wait for the condition, and reset it to 0 after been waken up. If some codes in between the set of these members and waiting for the condition also set these members and waiting (as DBUG_SYNC_POINT or other functions that calls thd->enter_cond() does), then KILL this thread will probably to cause a dead lock.

Move setting thd->current_mutex and thd->current_cond to 'remove_table_from_cache' (I am not sure if it is used in other places or not), right before start to waiting, and reset them to 0 after been waken up.
[10 Mar 2009 8:15] Zhenxing He
Hmm, rewrite the example, (hope it works this time):

thd1:  DROP TABLE                 thd2:  KILL QUERY thd1
------------------------------------------------------------------------
thd->current_mutex
   = &LOCK_open;
thd->current_cond
   = &COND_refresh;
...
...                                thd->awake()
...                                  lock(&mysys_var->mutex);
...                                  ...
...                                  lock(mysys_var->current_mutex);
...
DBUG_SYNC_POINT();
   thd->current_mutex
          = &LOCK_user_locks;
   thd->current_cond
          = &ull->cond;
   cond_wait(&ull->cond, 
             &LOCK_user_locks);
   unlock(&LOCK_user_locks);
   lock(&thd->mysys_var->mutex);
[10 Mar 2009 9:45] Sveta Smirnova
Thank you for the report.

Verified as described.
[3 Jun 2009 7:02] Christoffer Hall
After discussion with Davi and Zhenxing He we have agreed not to fix it in fear of breaking something unnecessarily. The tests should work anyway.