Bug #20850 | Assert during slave shutdown in many rpl_* tests | ||
---|---|---|---|
Submitted: | 4 Jul 2006 13:48 | Modified: | 28 Jul 2006 4:11 |
Reporter: | Kristian Nielsen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.0.24, 5.1.12 | OS: | Linux (Linux) |
Assigned to: | CPU Architecture: | Any |
[4 Jul 2006 13:48]
Kristian Nielsen
[5 Jul 2006 6:58]
Greg Lehey
Running the test case on my machine (Fedora Core 5/i386) works fine. Will investigate on other platforms if necessary.
[6 Jul 2006 1:46]
Greg Lehey
I have now run the test over 6000 times, about equally on i386/Fedora Core 5 and amd64/Fedora <mumble>. In each case, the software was compiled with compile-*-debug-max. I shall now test it on the machine where it originally failed.
[6 Jul 2006 10:02]
Kristian Nielsen
I found a way to reproduce it easily and reliably, with this patch: ===== sql/slave.cc 1.275 vs edited ===== --- 1.275/sql/slave.cc 2006-06-30 14:08:17 +02:00 +++ edited/sql/slave.cc 2006-07-06 11:55:33 +02:00 @@ -3550,6 +3550,7 @@ delete thd; pthread_mutex_unlock(&LOCK_thread_count); pthread_cond_broadcast(&mi->stop_cond); // tell the world we are done +sleep(5); pthread_mutex_unlock(&mi->run_lock); my_thread_end(); pthread_exit(0); So if thread scheduling happens at exactly the spot where the sleep(5) is inserted above, the other thread may attempt to destroy the mi->run_lock mutex before it is unlocked. This also immediately suggests the following fix: ===== sql/slave.cc 1.275 vs edited ===== --- 1.275/sql/slave.cc 2006-06-30 14:08:17 +02:00 +++ edited/sql/slave.cc 2006-07-06 11:58:02 +02:00 @@ -866,6 +866,8 @@ */ terminate_slave_threads(active_mi,SLAVE_FORCE_ALL); end_master_info(active_mi); + pthread_mutex_lock(&active_mi->run_lock); + pthread_mutex_unlock(&active_mi->run_lock); delete active_mi; active_mi= 0; } This makes sure that we do not destroy the mutex before the other thread has had the time to unlock it after signalling "we are done". But I am not sure if that is the correct way to fix it.
[7 Jul 2006 7:41]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8889
[7 Jul 2006 7:49]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8890
[7 Jul 2006 7:56]
Greg Lehey
The problem was caused by a race condition at the end of handle_slave_io(), the slave I/O thread: on termination, it was clearing the valieabls slave_running and io_thd before final housekeeping, opening the possibility of a race condition with the server termination code (end_slave(), called from close_connection()). Fixed by moving the code to the very end of the thread.
[7 Jul 2006 16:31]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/8920
[25 Jul 2006 4:18]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/9528
[25 Jul 2006 4:23]
Greg Lehey
By oversight the fix to this bug was only committed to the 5.1 tree. This patch is for the 5.0(.25) tree.
[25 Jul 2006 10:30]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/9532
[25 Jul 2006 11:03]
Joerg Bruehe
The patch has been pushed to the build clone of 5.0.24.
[28 Jul 2006 4:11]
Paul DuBois
Noted in 5.0.24, 5.1.12 changelogs. A race condition during slave server shutdown caused an assert failure.