Bug #20850 Assert during slave shutdown in many rpl_* tests
Submitted: 4 Jul 2006 15:48 Modified: 28 Jul 2006 6:11
Reporter: Kristian Nielsen
Status: Closed
Category:Server: Replication Severity:S3 (Non-critical)
Version:5.0.24, 5.1.12 OS:Linux (Linux)
Assigned to: Target Version:

[4 Jul 2006 15:48] Kristian Nielsen
Description:
Occasionally, during shutdown the slave crashes in rpl_* tests.

The crash is an assert (actually fprintf()+abort()) like this:

safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 3535 at 
slave.h, line 209
mysqld got signal 6;

Here is a stack trace obtained with gdb:

#0  0xffffe410 in ?? ()
#1  0x40700434 in ?? ()
#2  0x00000006 in ?? ()
#3  0x083b0509 in write_core (sig=24205) at stacktrace.c:220
#4  0x08217650 in handle_segfault (sig=6) at mysqld.cc:2175
#5  <signal handler called>
#6  0xffffe410 in ?? ()
#7  0x4070076c in ?? ()
#8  0x00000006 in ?? ()
#9  0x400e6ce9 in abort () from /lib/tls/libc.so.6
#10 0x0866e589 in safe_mutex_destroy (mp=0x8c6d520, 
    file=0x5f4b <Address 0x5f4b out of bounds>, line=0) at thr_mutex.c:281
#11 0x083a16ce in end_slave () at slave.h:209
#12 0x0821a0b3 in kill_server_thread (arg=0x4063c9cc) at mysqld.cc:887
#13 0x40047a13 in start_thread () from /lib/tls/libpthread.so.0
#14 0x401759da in clone () from /lib/tls/libc.so.6

The mutex that is destroyed while still locked is this one:

        pthread_mutex_destroy(&run_lock);

How to repeat:
The failures appear quite random. However, I was able to reproduce reliably by running
two tests in a loop:

(cd mysql-test && for i in `seq 1 1000`; do echo XXX $i XXX; MTR_BUILD_THREAD=4 perl
mysql-test-run.pl --vardir=/dev/shm/v4 --tmpdir=/dev/shm/t4 --timer --ps-protocol
--mysqld=--binlog-format=row rpl_loaddatalocal rpl_row_stop_middle_update || exit 1;
done)

Had to let it run for quite a while though, failed on iteration 36 in one attempt, for
example.
[5 Jul 2006 8:58] Greg Lehey
Running the test case on my machine (Fedora Core 5/i386) works fine.  Will investigate on
other platforms if necessary.
[6 Jul 2006 3:46] Greg Lehey
I have now run the test over 6000 times, about equally on i386/Fedora Core 5 and
amd64/Fedora <mumble>.  In each case, the software was compiled with compile-*-debug-max.
 I shall now test it on the machine where it originally failed.
[6 Jul 2006 12:02] Kristian Nielsen
I found a way to reproduce it easily and reliably, with this patch:

===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc	2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc	2006-07-06 11:55:33 +02:00
@@ -3550,6 +3550,7 @@
   delete thd;
   pthread_mutex_unlock(&LOCK_thread_count);
   pthread_cond_broadcast(&mi->stop_cond);	// tell the world we are done
+sleep(5);
   pthread_mutex_unlock(&mi->run_lock);
   my_thread_end();
   pthread_exit(0);

So if thread scheduling happens at exactly the spot where the sleep(5) is inserted above,
the other thread may attempt to destroy the mi->run_lock mutex before it is unlocked.

This also immediately suggests the following fix:

===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc	2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc	2006-07-06 11:58:02 +02:00
@@ -866,6 +866,8 @@
     */
     terminate_slave_threads(active_mi,SLAVE_FORCE_ALL);
     end_master_info(active_mi);
+    pthread_mutex_lock(&active_mi->run_lock);
+    pthread_mutex_unlock(&active_mi->run_lock);
     delete active_mi;
     active_mi= 0;
   }

This makes sure that we do not destroy the mutex before the other thread has had the time
to unlock it after signalling "we are done". But I am not sure if that is the correct way
to fix it.
[7 Jul 2006 9:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8889
[7 Jul 2006 9:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8890
[7 Jul 2006 9:56] Greg Lehey
The problem was caused by a race condition at the end of
handle_slave_io(), the slave I/O thread: on termination, it was
clearing the valieabls slave_running and io_thd before final
housekeeping, opening the possibility of a race condition with the
server termination code (end_slave(), called from close_connection()).

Fixed by moving the code to the very end of the thread.
[7 Jul 2006 18:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8920
[25 Jul 2006 6:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9528
[25 Jul 2006 6:23] Greg Lehey
By oversight the fix to this bug was only committed to the 5.1 tree.  This patch is for
the 5.0(.25) tree.
[25 Jul 2006 12:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9532
[25 Jul 2006 13:03] Joerg Bruehe
The patch has been pushed to the build clone of 5.0.24.
[28 Jul 2006 6:11] Paul DuBois
Noted in 5.0.24, 5.1.12 changelogs.

A race condition during slave server shutdown caused an assert failure.