Bug #38716 | slave crashed after 'stop slave' during concurrent stop/start/reset slave | ||
---|---|---|---|
Submitted: | 11 Aug 2008 11:08 | Modified: | 13 May 2009 12:34 |
Reporter: | Shane Bester (Platinum Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
Version: | 5.0.66a-debug, 5.1.26-debug | OS: | Any |
Assigned to: | Andrei Elkin | CPU Architecture: | Any |
[11 Aug 2008 11:08]
Shane Bester
[11 Aug 2008 11:11]
MySQL Verification Team
full stack trace from 5.0.66a
Attachment: bug38716_some_info.txt (text/plain), 1.17 KiB.
[20 Aug 2008 2:58]
MySQL Verification Team
to repeat: -------------- setup a self-replicating debug master: mysqld-debug --console --skip-grant-tables --server-id=5 --log-bin --port=3306 --replicate-same-server-id --slave-skip-errors=1050 --skip-innodb --skip-name-resolve --slave-net-timeout=1 --net-read-timeout=1 --net-write-timeout=1 change master to master_host='127.0.0.1', master_port=3306, master_user='root', master_password=''; start slave; run the attached bug38716.c against the server. if it hangs, kill server and start again (although the hang could be considered part of the bug too)
[20 Aug 2008 2:58]
MySQL Verification Team
testcase that will expose a few different assertions if run as described above.
Attachment: bug38716.c (text/plain), 7.43 KiB.
[20 Aug 2008 3:15]
MySQL Verification Team
infinite hangs may also be seen instead of crashes.
[21 Aug 2008 10:55]
Lars Thalmann
Shane Bester wrote: > The following may very well all be part of the same overall bug: > > Bug#38240 > Bug#38715 > Bug#38716 > > The above 3 all relate to various locking/mutex issues in the > replication code during concurrent flush logs/stop/start > slave/reset slave. > > A reason for different bug reports was slightly different crashes in > each case. I'm guessing the devs will eventually set them to > duplicates and fix everything in one go.
[14 Apr 2009 1:30]
rancpine cui
hi, Does Bug #31374 relate to this bug too?
[15 Apr 2009 17:40]
Andrei Elkin
Rancpine, thanks for the suggestion. Indeed Bug #31374 failure description fits to this bug. bug38716.c triggers an assert in safe_mutex_lock () called from terminate_slave_thread(). The happens on the 6.0 server as well.
[16 Apr 2009 15:11]
Andrei Elkin
thd == 0 could not be caught with my tests on 6.0 version. There have been two issues found: 1. terminate_slave_thread tried to acquire run_lock mutex for the 2nd time when start_slave_threads() executed if (error) terminate_slave_threads(mi, thread_mask & SLAVE_IO, 1); Error must have caused by killing (bug38716.c). In this case the stack is close to the reported but it contains start_slave() 2. a deadlock reported as a separate Bug #44312 It's not clear if it was really thd == 0 in the original report, as it's not explained a method to find out the value. If that was an assert crash we should think on other reason. I examined 5.0.66 code to find out that slave.cc:710 is IF_DBUG(int err= ) pthread_kill(thd->real_id, thr_client_alarm); and it follows with DBUG_ASSERT(err != EINVAL);
[17 Apr 2009 10:21]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72381 2842 Andrei Elkin 2009-04-17 Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset Bug #44312 deadlock between IO thread and SLAVE START the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly different form. terminate_slave_threads() should not acquire run_lock when it's called from or it will face an assert in terminate_slave_thread() because of the term_lock. No other crashes has been found using the regression test program. Another issues is a deadlock as described separately in Bug #44312. It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in reverse order. Fixed: terminate_slave_threads() does not request locking in start_slave_threads(); start_slave() grabs run_lock:s for lesser time that avoids the deadlock with IO thread.
[24 Apr 2009 16:22]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72790 2842 Andrei Elkin 2009-04-24 Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset Bug #44312 deadlock between IO thread and SLAVE START the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly different form. terminate_slave_threads() should not acquire run_lock when it's called from start_slave_threads(); that leads to an assertion on run_lock mutex. No other crashes has been found using the regression test program. Another issues is a deadlock as described separately in Bug #44312. It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in reverse order. Fixed: terminate_slave_threads() does not request locking in start_slave_threads(); rotate_relay_log() does not acquire mi->run_lock which is safe. The rli->inited is not guarded by this mutex, and locking of the mutex in the function contradicts the safe pattern of locking with run_lock.
[24 Apr 2009 16:35]
Andrei Elkin
Notice, the reported stack occured with 5.0 version that has not got yet bug#29968 fixes (unguarded access to slave_running status variable, that is a matter of bug#38694 that is about to commit). Therefore 6.0 is much more stable to the stress testing (the last patch sustained it about half an hour).
[27 Apr 2009 21:18]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72858 2842 Andrei Elkin 2009-04-28 Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset Bug #44312 deadlock between IO thread and SLAVE START The issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly different form. terminate_slave_threads() should not acquire run_lock when it's called from slave_thread()->start_slave_threads(); that leads to an assertion on run_lock mutex. OTH, init_slave()->start_slave_threads() path requires terminate_slave_threads() to be invoked with skip_lock == false. No other crashes has been found using the regression stress test program. Another issues is a deadlock as described separately in Bug #44312. It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in reverse order. Fixed: terminate_slave_threads() does not request locking in start_slave_threads() when it's called from start_slave() and does it when it's called from init_slave(); rotate_relay_log() does not acquire mi->run_lock which is safe. The rli->inited is not guarded by this mutex, and locking of the mutex in the function contradicts the safe pattern of locking with run_lock.
[13 May 2009 3:31]
Bugs System
Pushed into 6.0.12-alpha (revid:alik@sun.com-20090513032549-rxa73jbxd1qv09xc) (version source revid:aelkin@mysql.com-20090427211821-4lne3342gva8ghzt) (merge vers: 6.0.11-alpha) (pib:6)
[13 May 2009 12:34]
Jon Stephens
Documented bugfix in the 6.0.12 changelog as follows: Issuing concurrent STOP SLAVE, START SLAVE, and RESET SLAVE statements using different connections caused the replication slave to crash.
[3 Dec 2009 13:50]
Jon Stephens
Also documented in the 5.6.0 changelog. See BUG#48463.
[7 Mar 2010 1:50]
Paul DuBois
Moved 5.6.0 changelog entry to 5.5.3.