Bug #10600 Locking (or scheduling ?) unreliable / Shutdown without reason ?
Submitted: 12 May 2005 14:26 Modified: 26 Jul 2005 5:24
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.12, 5.0 OS:Other (IRIX (32 + 64 bit), Intel 64)
Assigned to: Mikael Ronström

[12 May 2005 14:26] Joerg Bruehe
Description:
Build of release 4.1.12, based on ChangeSet
  1.2261 05/05/10 14:44:03 lenz@mysql.com +2 -0
  Applied two patches already commited to the main tree to resolve bugs found during the build:
   - removed an "#error PRAGMA" from sql/item_strfunc.h (msvensson)
   - Fixed BUG#10070 (range test failed if InnoDB is not available) (sergefp)

Test "lock_multi" fails in 32 bit ("pro", PS protocol) and 64 bit ("standard", default mode) with the same symptom. (The other 22 runs of this test pass.) Failure message:

lock_multi                     [ fail ]

Errors are (from /usr/people/mysqldev/octane2-64bit/test/mysql-standard-4.1.12-sgi-irix6.5-mips-64bit/mysql-test/var/log/mysqltest-time) :
/usr/people/mysqldev/octane2-64bit/test/mysql-standard-4.1.12-sgi-irix6.5-mips-64bit/bin/mysqltest: At line 91: query 'drop table t2' failed: 1053: Server shutdown in progress
(the last lines may be the most important ones)

How to repeat:
Try the test (several times ?).
[20 May 2005 8:00] Michael Widenius
On 5.0 I have several times seen lock_multi hang and never complete on Intel 64 bit
(sorry, haven't had time to see if I could repeat this somehow)
[20 May 2005 10:03] Antony Curtis
Deadlock hang can occur in FreeBSD 5.4 when using the libpthread library. Unable to reproduce it when using libthr or libc_r. Tail of log file would look like:

050520 12:55:19      18 Query       lock table t1 write, t2 write
                     19 Query       insert t1 select * from t2
                     18 Query       drop table t2

Forcing mysqld to dump core and running gdb reveals:

(gdb) thread 2
[Switching to thread 2 (Thread 2 (LWP 100173))]#0  0x2847f2fb in pthread_testcancel () from /usr/lib/libpthread.so.1
(gdb) bt
#0  0x2847f2fb in pthread_testcancel () from /usr/lib/libpthread.so.1
#1  0x284777fb in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#2  0x284771ca in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#3  0x28477169 in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#4  0x2847cbc9 in _pthread_cond_wait () from /usr/lib/libpthread.so.1
#5  0x2847cdee in pthread_cond_wait () from /usr/lib/libpthread.so.1
#6  0x083250ec in safe_cond_wait (cond=0x17f, mp=0xa49a25c,
    file=0x83e28d8 "thr_lock.c", line=389) at thr_mutex.c:202
#7  0x08322e95 in wait_for_lock (wait=0xa49a284, data=0xa4f1dec,
    in_wait_list=0 '\0') at thr_lock.c:389
#8  0x083235eb in thr_lock (data=0xa4f1dec,
    lock_type=TL_WRITE_CONCURRENT_INSERT) at thr_lock.c:613
#9  0x083240d5 in thr_multi_lock (data=0xa48e668, count=2) at thr_lock.c:853
#10 0x080c1c74 in mysql_lock_tables (thd=0xa4d9018, tables=0xa4f4400, count=2,
    ignore_global_read_lock=false) at lock.cc:126
#11 0x081076e5 in lock_tables (thd=0xa4d9018, tables=0xa4f4098, count=383)
    at sql_base.cc:2240
#12 0x08107285 in open_and_lock_tables (thd=0xa4d9018, tables=0xa4f4098)
    at sql_base.cc:2082
#13 0x080e0086 in mysql_execute_command (thd=0xa4d9018) at sql_parse.cc:3174
#14 0x080e5e99 in mysql_parse (thd=0xa4d9018,
    inBuf=0xa4f4028 "insert t1 select * from t2", length=172855384)
    at sql_parse.cc:5252
#15 0x080dbe22 in dispatch_command (command=COM_QUERY, thd=0xa4d9018,
    packet=0xa4e7019 " insert t1 select * from t2", packet_length=28)
    at sql_parse.cc:1651
#16 0x080db6b1 in do_command (thd=0xa4d9018) at sql_parse.cc:1454
#17 0x080da935 in handle_one_connection (arg=0x17f) at sql_parse.cc:1114
#18 0x28477902 in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#19 0x2848276c in _thread_state_running () from /usr/lib/libpthread.so.1
(gdb) thread 3
[Switching to thread 3 (Thread 3 (LWP 100177))]#0  0x2847f2fb in pthread_testcancel () from /usr/lib/libpthread.so.1
(gdb) bt
#0  0x2847f2fb in pthread_testcancel () from /usr/lib/libpthread.so.1
#1  0x284777fb in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#2  0x284771ca in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#3  0x28477169 in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#4  0x2847cbc9 in _pthread_cond_wait () from /usr/lib/libpthread.so.1
#5  0x2847cdee in pthread_cond_wait () from /usr/lib/libpthread.so.1
#6  0x083250ec in safe_cond_wait (cond=0x17f, mp=0x84f463c,
    file=0x83ad1b8 "sql_table.cc", line=215) at thr_mutex.c:202
#7  0x0819ba1d in mysql_rm_table_part2 (thd=0xa4b0018, tables=0xa4d4090,
    if_exists=false, drop_temporary=false, drop_view=false,
    dont_log_query=false) at sql_table.cc:215
#8  0x0819b635 in mysql_rm_table (thd=0xa4b0018, tables=0xa4d4090,
    if_exists=0 '\0', drop_temporary=0 '\0') at sql_table.cc:88
#9  0x080e0630 in mysql_execute_command (thd=0xa4b0018) at sql_parse.cc:3297
#10 0x080e5e99 in mysql_parse (thd=0xa4b0018, inBuf=0xa4d4028 "drop table t2",
    length=172687448) at sql_parse.cc:5252
#11 0x080dbe22 in dispatch_command (command=COM_QUERY, thd=0xa4b0018,
    packet=0xa4c9019 "drop table t2", packet_length=14) at sql_parse.cc:1651
#12 0x080db6b1 in do_command (thd=0xa4b0018) at sql_parse.cc:1454
#13 0x080da935 in handle_one_connection (arg=0x17f) at sql_parse.cc:1114
#14 0x28477902 in pthread_mutexattr_init () from /usr/lib/libpthread.so.1
#15 0x2848276c in _thread_state_running () from /usr/lib/libpthread.so.1
[25 May 2005 15:04] Ingo Strüwing
I was not able to reproduce it on IRIX with self-compiled binaries, neither with octane2-4.1-debug, nor with octane2-64bit-4.1-debug. However kill.test failed with wrong results everytime and needed to be disabled. Will now try on BSD.
[26 May 2005 7:06] Ingo Strüwing
I was not able to reproduce it on FreeBSD with self-compiled binaries, neither with bsd53-4.1-debug, nor with an additional --with-pthread. On this machine, kill.test passed.
[26 May 2005 13:20] Ingo Strüwing
I was not able to reproduce it on Linux/IA64 with self-compiled binaries, neither
with quadita2-glibc23-4.1-debug, nor with quadita2-icc-glibc23-4.1-debug.
On this machine, kill.test passed too.
[22 Jun 2005 19:49] Igor Ivanov
hello

I can reproduce this bug in libpthread from different programms.
It happend ONLY during very intensive work with pthreads and
not depend directly on mysql.

Final stability of mysql with bug in this lib is not best - it is like 
toy  now and can be crashed very well under extremal load with
backtraces that ends inside pthread_testcancel/libpthread.

I think nobody will reproduce it on something like single sql requests.
Must be done some high load (like sql crash test) to reproduce 
this bug.

Crash in pthread_testcancel () inside libpthread is very usual story...

If anybody can help me to solve this bug in library - I can
join this task and create crashes and backtraces for him in
fast way (just run any programm with intensive thread create
and deletion)

Igor
[30 Jun 2005 19:07] Ingo Strüwing
My test script.

Attachment: bug10600-1.sh (application/x-sh, text), 3.90 KiB.

[30 Jun 2005 19:56] Ingo Strüwing
I tried again and succeded in repeating this on Linux 2.4.27 with the attached shell script (normally I work on linux 2.6 only).

I inserted sleeps and DBUG_PRINTs so that I forced a certain call pattern for the two threads. The result is that the pattern works thousands of times until it fails. The problem seems to be that suddenly the attempt to read lock the write locked 't2' leads to a 'wait_for_lock' inside of 'thr_lock' instead of a rejection, which would lead to a 'wait_for_tables' inside of 'mysql_lock_tables'. The latter pattern is the correct one as the tables gets closed so that the drop can succeed.

I did not yet have enough time to track down what happens inside 'thr_lock', but it might have something to do with the pthreads problem mentioned by Igor. It would be nice to have one of his test programs, though I can't promise that I will try to fix the pthreads library. At least it would be nice to check if it behaves better with Linux 2.6 than with 2.4.
[2 Jul 2005 1:51] Mikael Ronström
Analysis of code around the bug

Attachment: docs.txt (text/plain), 21.88 KiB.

[11 Jul 2005 22:38] Mikael Ronström
Also managed to reproduce it on 4-way Linux box by running the part where it sometimes
hangs in a loop for a couple of thousand times.
So fix should hopefully be close now.
[12 Jul 2005 16:00] Mikael Ronström
Thx to patch by Ingo it was really easy to reproduce it and I now have it even nailed
in a debugger and can start looking at what has happened that caused the threads to hang.
[13 Jul 2005 13:32] Mikael Ronström
I think I know the problem and the solution as well now (however tests to verify this still
remains).

Assume t2 is the table to be dropped and t1 the other table.
The problem is that remove_table_from_cache has found that another thread has the table t2
to be dropped open. Before calling the remove_table_from_cache the write lock was converted
to a WRITE_ONLY that aborts also future lock requests.

However our problem is that we also hold a lock on another table t1 through a lock table request.
This table doesn't get its lock converted to WRITE_ONLY. Thus if another thread has opened
the tables but not yet started locking them and starts by locking the other table t1 it will block on
this request waiting for the other thread to wake it up whilst the other thread hasn't even started
locking yet, when it comes to the locking part it will try to lock the other table t1 and will block on
this since the other thread holds the lock and will not signal anything and
boom we have an eternal wait with two threads both waiting for each other (it is not really a
deadlock, it is more an assumption that the other will wake me up when he hasn't gotten any
such message).

The probably best fix is for remove_table_from_cache to only report back false in the case when
it actually has succeeded in waking some other thread from sleep. The other possible solution
is to convert all locks to WRITE_ONLY but this isn't very safe since it relies too much on the
other thread and will fail if the other lock is a READ lock.
[18 Jul 2005 21:17] Mikael Ronström
Problem analysed, fixed, patched and now ready for review
Bug goes all the way from 4.0 and up
[21 Jul 2005 0:51] Mikael Ronström
Has been pushed to 4.0, 4.1, 5.0 and soon to 5.1 as well
[21 Jul 2005 7:58] Mikael Ronström
The versions for it to be included in is
4.0.26
4.1.14
5.0.10
5.1.1
[21 Jul 2005 8:02] Mikael Ronström
Attempt on a short description:

When two threads compete for the same table there could occur a deadlock if one thread has
also a lock on another table through LOCK TABLE and the thread is attempting to remove the
table in some manner and the other thread want locks on both tables.
[21 Jul 2005 10:59] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/27418
[21 Jul 2005 10:59] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/27419
[26 Jul 2005 5:24] Paul Dubois
Noted in 4.0.26, 4.1.14, 5.0.10, 5.1.1 changelogs.

Mikael, thanks for the description of the problem.
Very helpful.