MySQL Bugs: #20850: Assert during slave shutdown in many rpl

Bug #20850	Assert during slave shutdown in many rpl_* tests
Submitted:	4 Jul 2006 13:48	Modified:	28 Jul 2006 4:11
Reporter:	Kristian Nielsen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.0.24, 5.1.12	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any

Description:
Occasionally, during shutdown the slave crashes in rpl_* tests.

The crash is an assert (actually fprintf()+abort()) like this:

safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 3535 at 
slave.h, line 209
mysqld got signal 6;

Here is a stack trace obtained with gdb:

#0  0xffffe410 in ?? ()
#1  0x40700434 in ?? ()
#2  0x00000006 in ?? ()
#3  0x083b0509 in write_core (sig=24205) at stacktrace.c:220
#4  0x08217650 in handle_segfault (sig=6) at mysqld.cc:2175
#5  <signal handler called>
#6  0xffffe410 in ?? ()
#7  0x4070076c in ?? ()
#8  0x00000006 in ?? ()
#9  0x400e6ce9 in abort () from /lib/tls/libc.so.6
#10 0x0866e589 in safe_mutex_destroy (mp=0x8c6d520, 
    file=0x5f4b <Address 0x5f4b out of bounds>, line=0) at thr_mutex.c:281
#11 0x083a16ce in end_slave () at slave.h:209
#12 0x0821a0b3 in kill_server_thread (arg=0x4063c9cc) at mysqld.cc:887
#13 0x40047a13 in start_thread () from /lib/tls/libpthread.so.0
#14 0x401759da in clone () from /lib/tls/libc.so.6

The mutex that is destroyed while still locked is this one:

        pthread_mutex_destroy(&run_lock);

How to repeat:
The failures appear quite random. However, I was able to reproduce reliably by running two tests in a loop:

(cd mysql-test && for i in `seq 1 1000`; do echo XXX $i XXX; MTR_BUILD_THREAD=4 perl mysql-test-run.pl --vardir=/dev/shm/v4 --tmpdir=/dev/shm/t4 --timer --ps-protocol --mysqld=--binlog-format=row rpl_loaddatalocal rpl_row_stop_middle_update || exit 1; done)

Had to let it run for quite a while though, failed on iteration 36 in one attempt, for example.

Running the test case on my machine (Fedora Core 5/i386) works fine.  Will investigate on other platforms if necessary.

I have now run the test over 6000 times, about equally on i386/Fedora Core 5 and amd64/Fedora <mumble>.  In each case, the software was compiled with compile-*-debug-max.  I shall now test it on the machine where it originally failed.

I found a way to reproduce it easily and reliably, with this patch:

===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc	2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc	2006-07-06 11:55:33 +02:00
@@ -3550,6 +3550,7 @@
   delete thd;
   pthread_mutex_unlock(&LOCK_thread_count);
   pthread_cond_broadcast(&mi->stop_cond);	// tell the world we are done
+sleep(5);
   pthread_mutex_unlock(&mi->run_lock);
   my_thread_end();
   pthread_exit(0);

So if thread scheduling happens at exactly the spot where the sleep(5) is inserted above, the other thread may attempt to destroy the mi->run_lock mutex before it is unlocked.

This also immediately suggests the following fix:

===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc	2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc	2006-07-06 11:58:02 +02:00
@@ -866,6 +866,8 @@
     */
     terminate_slave_threads(active_mi,SLAVE_FORCE_ALL);
     end_master_info(active_mi);
+    pthread_mutex_lock(&active_mi->run_lock);
+    pthread_mutex_unlock(&active_mi->run_lock);
     delete active_mi;
     active_mi= 0;
   }

This makes sure that we do not destroy the mutex before the other thread has had the time to unlock it after signalling "we are done". But I am not sure if that is the correct way to fix it.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8889

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8890

The problem was caused by a race condition at the end of
handle_slave_io(), the slave I/O thread: on termination, it was
clearing the valieabls slave_running and io_thd before final
housekeeping, opening the possibility of a race condition with the
server termination code (end_slave(), called from close_connection()).

Fixed by moving the code to the very end of the thread.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8920

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9528

By oversight the fix to this bug was only committed to the 5.1 tree.  This patch is for the 5.0(.25) tree.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9532

The patch has been pushed to the build clone of 5.0.24.

Noted in 5.0.24, 5.1.12 changelogs.

A race condition during slave server shutdown caused an assert failure.