Bug #38694 | Race condition in replication thread shutdown | ||
---|---|---|---|
Submitted: | 9 Aug 2008 18:24 | Modified: | 19 May 2009 10:03 |
Reporter: | Mark Callaghan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.0.37, 5.0.68, 5.0.76-debug | OS: | Any |
Assigned to: | Andrei Elkin | CPU Architecture: | Any |
Tags: | replication, shutdown |
[9 Aug 2008 18:24]
Mark Callaghan
[9 Aug 2008 18:32]
Mark Callaghan
changed synopsis
[9 Aug 2008 19:08]
MySQL Verification Team
looks like bug #27675?
[9 Aug 2008 19:19]
Mark Callaghan
Yes, I think it is a duplicate. But it has more details on the cause of the failure.
[11 Aug 2008 8:41]
MySQL Verification Team
Thanks for the analysis Mark. Verified. Build a --debug=full slave with a small patch to put a sleep before unlocking rli->run_lock. Start the slave with --skip-innodb --server-id=1 So in handle_slave_io we have this: printf("sleeping before unlocking!\n"); sleep(10); pthread_mutex_unlock(&rli->run_lock); // tell the world we are done then run inserts against the master. While this happens, run mysqladmin shutdown on the slave. crash happens on the line pthread_mutex_unlock(&rli->run_lock); mysqld.exe!handle_slave_sql(void * arg=0x04e0bf50) Line 4108 mysqld.exe!pthread_start(void * param=0x04d7ffc8) Line 85 mysqld.exe!_callthreadstart() Line 293 mysqld.exe!_threadstart(void * ptd=0x05d66530) Line 277
[3 Dec 2008 9:04]
Mats Kindahl
Is this bug verified towards 5.1 or 5.0? I suspect it is already fixed in 5.1.
[17 Jan 2009 16:10]
MySQL Verification Team
crashes in 5.0.76-debug: safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 3835 at slave.h, line 462 0x81c15d4 handle_segfault + 671 0x110416 (?) 0xb4b028 (?) 0x84bb181 safe_mutex_destroy + 162 0x831e504 st_master_info::~st_master_info() + 38 0x831deac end_slave() + 214 0x81c577b close_connections() + 1172 0x81c5bce kill_server(void*) + 451 0x81c5c15 kill_server_thread + 23 0xcc232f (?) I couldn't repeat this in 5.1.31.
[26 Jan 2009 8:40]
Mats Kindahl
This seems to be yet another version of the following bugs: BUG#20850, BUG#27675, BUG#25306, BUG#25245.
[22 Apr 2009 16:34]
Andrei Elkin
Note, that [11 Aug 2008 10:41] comments mentioning crash in pthread_mutex_unlock(&rli->run_lock) executing by the SQL thread is a subject of bug#25306 fixed in 5.0.38. The bug description scenaio, reported for 5.0.37, and most probably the comments of [17 Jan 17:10] claim mi[->rli]->run_lock is being taken by a repliction thread at time of delete active_mi in end_slave(). Considering the latest source code, the terminator thread invokes terminate_slave_threads(active_mi, SLAVE_FORCE_ALL, [0]); terminate_slave_thread(); [0] of terminate_slave_threads() stands for acquiring `run_lock' prior to read `slave_running' value. Once `slave_running' is found to be 0 `run_lock' is released and end_slave() gets to `delete active_mi'. slave_running is reset to 0 at the end of handle_slave_* and within a critical section guarded by run_lock mutex. That means the terminatior can't get the reset case of slave_running == 0, However it actually could catch a unset case of the equality. I.e the slave threads are in the initialization stage having not set yet slave_running := 1. That means a race condition between `start slave' and `shutdown'.
[23 Apr 2009 17:12]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72723 2730 Andrei Elkin 2009-04-23 Bug #38694 Race condition in replication thread shutdown There were two issues: - a race condition between init_slave() and end_slave(). The thread running init_slave() does not wait for the slave threads signalling and it unlock LOCK_active_mi for the end_slave() running terminator to find the being started slave threads running status 0, i.e unchanged yet from the inititial value. That lets it to run delete active_mi without any obstacles. The slave threads crash e.g at trying to lock run_lock that belong in active_mi. - unguarded access to mi->slave_running by the shutdown thread calling end_slave(). Fixed: - to wait for the slave threads ack by init_slave(); - to remove the unguarded read of the running status and perform reading it in terminate_slave_thread() at time run_lock is taken.
[24 Apr 2009 9:33]
Andrei Elkin
Setting BUG#44312 + BUG#38716, BUG#38694 in-progress back, after getting couple of *new* issues at testing The issues of the 3 bugs remains valid and hopefully the patches have solved them. However, I think we would like to not see any failures with restart of shutdown ever.
[24 Apr 2009 17:54]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72798 2730 Andrei Elkin 2009-04-24 Bug #38694 Race condition in replication thread shutdown Two issues were identified out of the bug report: - a race condition between init_slave() and end_slave(). The thread running init_slave() does not wait for the being started slave threads' signalling. It unlocks LOCK_active_mi to let the end_slave() running terminator to find the slave threads running status as 0, i.e unchanged yet from the inititial value. The terminator in that case goes straight to delete active_mi without any ado. The slave threads will crash e.g at trying to lock run_lock that belong in the deleted active_mi. - unguarded access to mi->slave_running by the shutdown thread calling end_slave() that is bug#29968 (alas happened not to be cross-linked with the current bug) Fixed: - to wait for the slave threads ack by init_slave(); - to remove the unguarded read of the running status and perform reading it in terminate_slave_thread() at time run_lock is taken (mostly bug#29968 backporting, still with some improvements over that patch - see the error reporting from terminate_slave_thread()).
[24 Apr 2009 18:44]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72804 2730 Andrei Elkin 2009-04-24 Bug #38694 Race condition in replication thread shutdown The issue of the current bug is unguarded access to mi->slave_running by the shutdown thread calling end_slave() that is bug#29968 (alas happened not to be cross-linked with the current bug) Fixed: - to remove the unguarded read of the running status and perform reading it in terminate_slave_thread() at time run_lock is taken (mostly bug#29968 backporting, still with some improvements over that patch - see the error reporting from terminate_slave_thread()). Issue of bug38716 is fixed here for 5.0 branch as well. Note: There has been a separate artifact identified - a race condition between init_slave() and end_slave() - reported as Bug#44467.
[28 Apr 2009 11:46]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/72901 2730 Andrei Elkin 2009-04-28 Bug #38694 Race condition in replication thread shutdown The issue of the current bug is unguarded access to mi->slave_running by the shutdown thread calling end_slave() that is bug#29968 (alas happened not to be cross-linked with the current bug) Fixed: with removing the unguarded read of the running status and perform reading it in terminate_slave_thread() at time run_lock is taken (mostly bug#29968 backporting, still with some improvements over that patch - see the error reporting from terminate_slave_thread()). Issue of bug#38716 is fixed here for 5.0 branch as well. Note: There has been a separate artifact identified - a race condition between init_slave() and end_slave() - reported as Bug#44467.
[30 Apr 2009 13:21]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/73158 2885 Andrei Elkin 2009-04-30 moving bug#38694 test files into rpl suite
[5 May 2009 18:52]
Bugs System
Pushed into 5.0.82 (revid:davi.arnaut@sun.com-20090505184158-dvmedh8n472y8np5) (version source revid:davi.arnaut@sun.com-20090505184158-dvmedh8n472y8np5) (merge vers: 5.0.82) (pib:6)
[5 May 2009 19:39]
Bugs System
Pushed into 5.1.35 (revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (version source revid:davi.arnaut@sun.com-20090505190206-9xmh7dlc6kom8exp) (merge vers: 5.1.35) (pib:6)
[6 May 2009 14:06]
Bugs System
Pushed into 6.0.12-alpha (revid:svoj@sun.com-20090506125450-yokcmvqf2g7jhujq) (version source revid:joro@sun.com-20090430145106-gsb4j17nl4yf9htf) (merge vers: 6.0.12-alpha) (pib:6)
[7 May 2009 12:57]
Jon Stephens
Questions for Docs: 1. Was this issue confined to 5.0 only? 2. If not, what were the affected and fixed versions? Thanks!
[19 May 2009 10:03]
Jon Stephens
Documented bugfix in the 5.0.82 changelog as follows: Restarting the replication slave -- either by using STOP SLAVE plus START SLAVE, or by restarting the slave mysqld process -- could sometimes cause the slave to crash when using a debug version of the server.
[15 Jun 2009 8:26]
Bugs System
Pushed into 5.1.35-ndb-6.3.26 (revid:jonas@mysql.com-20090615074202-0r5r2jmi83tww6sf) (version source revid:jonas@mysql.com-20090615070837-9pccutgc7repvb4d) (merge vers: 5.1.35-ndb-6.3.26) (pib:6)
[15 Jun 2009 9:06]
Bugs System
Pushed into 5.1.35-ndb-7.0.7 (revid:jonas@mysql.com-20090615074335-9hcltksp5cu5fucn) (version source revid:jonas@mysql.com-20090615072714-rmfkvrbbipd9r32c) (merge vers: 5.1.35-ndb-7.0.7) (pib:6)
[15 Jun 2009 9:46]
Bugs System
Pushed into 5.1.35-ndb-6.2.19 (revid:jonas@mysql.com-20090615061520-sq7ds4yw299ggugm) (version source revid:jonas@mysql.com-20090615054654-ebgpz7elwu1xj36j) (merge vers: 5.1.35-ndb-6.2.19) (pib:6)