| Bug #20850 | Assert during slave shutdown in many rpl_* tests | ||
|---|---|---|---|
| Submitted: | 4 Jul 2006 13:48 | Modified: | 28 Jul 2006 4:11 |
| Reporter: | Kristian Nielsen | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
| Version: | 5.0.24, 5.1.12 | OS: | Linux (Linux) |
| Assigned to: | CPU Architecture: | Any | |
[5 Jul 2006 6:58]
Greg Lehey
Running the test case on my machine (Fedora Core 5/i386) works fine. Will investigate on other platforms if necessary.
[6 Jul 2006 1:46]
Greg Lehey
I have now run the test over 6000 times, about equally on i386/Fedora Core 5 and amd64/Fedora <mumble>. In each case, the software was compiled with compile-*-debug-max. I shall now test it on the machine where it originally failed.
[6 Jul 2006 10:02]
Kristian Nielsen
I found a way to reproduce it easily and reliably, with this patch:
===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc 2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc 2006-07-06 11:55:33 +02:00
@@ -3550,6 +3550,7 @@
delete thd;
pthread_mutex_unlock(&LOCK_thread_count);
pthread_cond_broadcast(&mi->stop_cond); // tell the world we are done
+sleep(5);
pthread_mutex_unlock(&mi->run_lock);
my_thread_end();
pthread_exit(0);
So if thread scheduling happens at exactly the spot where the sleep(5) is inserted above, the other thread may attempt to destroy the mi->run_lock mutex before it is unlocked.
This also immediately suggests the following fix:
===== sql/slave.cc 1.275 vs edited =====
--- 1.275/sql/slave.cc 2006-06-30 14:08:17 +02:00
+++ edited/sql/slave.cc 2006-07-06 11:58:02 +02:00
@@ -866,6 +866,8 @@
*/
terminate_slave_threads(active_mi,SLAVE_FORCE_ALL);
end_master_info(active_mi);
+ pthread_mutex_lock(&active_mi->run_lock);
+ pthread_mutex_unlock(&active_mi->run_lock);
delete active_mi;
active_mi= 0;
}
This makes sure that we do not destroy the mutex before the other thread has had the time to unlock it after signalling "we are done". But I am not sure if that is the correct way to fix it.
[7 Jul 2006 7:41]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8889
[7 Jul 2006 7:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8890
[7 Jul 2006 7:56]
Greg Lehey
The problem was caused by a race condition at the end of handle_slave_io(), the slave I/O thread: on termination, it was clearing the valieabls slave_running and io_thd before final housekeeping, opening the possibility of a race condition with the server termination code (end_slave(), called from close_connection()). Fixed by moving the code to the very end of the thread.
[7 Jul 2006 16:31]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8920
[25 Jul 2006 4:18]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/9528
[25 Jul 2006 4:23]
Greg Lehey
By oversight the fix to this bug was only committed to the 5.1 tree. This patch is for the 5.0(.25) tree.
[25 Jul 2006 10:30]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/9532
[25 Jul 2006 11:03]
Joerg Bruehe
The patch has been pushed to the build clone of 5.0.24.
[28 Jul 2006 4:11]
Paul DuBois
Noted in 5.0.24, 5.1.12 changelogs. A race condition during slave server shutdown caused an assert failure.

Description: Occasionally, during shutdown the slave crashes in rpl_* tests. The crash is an assert (actually fprintf()+abort()) like this: safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 3535 at slave.h, line 209 mysqld got signal 6; Here is a stack trace obtained with gdb: #0 0xffffe410 in ?? () #1 0x40700434 in ?? () #2 0x00000006 in ?? () #3 0x083b0509 in write_core (sig=24205) at stacktrace.c:220 #4 0x08217650 in handle_segfault (sig=6) at mysqld.cc:2175 #5 <signal handler called> #6 0xffffe410 in ?? () #7 0x4070076c in ?? () #8 0x00000006 in ?? () #9 0x400e6ce9 in abort () from /lib/tls/libc.so.6 #10 0x0866e589 in safe_mutex_destroy (mp=0x8c6d520, file=0x5f4b <Address 0x5f4b out of bounds>, line=0) at thr_mutex.c:281 #11 0x083a16ce in end_slave () at slave.h:209 #12 0x0821a0b3 in kill_server_thread (arg=0x4063c9cc) at mysqld.cc:887 #13 0x40047a13 in start_thread () from /lib/tls/libpthread.so.0 #14 0x401759da in clone () from /lib/tls/libc.so.6 The mutex that is destroyed while still locked is this one: pthread_mutex_destroy(&run_lock); How to repeat: The failures appear quite random. However, I was able to reproduce reliably by running two tests in a loop: (cd mysql-test && for i in `seq 1 1000`; do echo XXX $i XXX; MTR_BUILD_THREAD=4 perl mysql-test-run.pl --vardir=/dev/shm/v4 --tmpdir=/dev/shm/t4 --timer --ps-protocol --mysqld=--binlog-format=row rpl_loaddatalocal rpl_row_stop_middle_update || exit 1; done) Had to let it run for quite a while though, failed on iteration 36 in one attempt, for example.