Bug #29968 | rpl_ndb_circular.test and rpl_ndb_log.test fail | ||
---|---|---|---|
Submitted: | 23 Jul 2007 7:22 | Modified: | 6 Sep 2007 12:18 |
Reporter: | Ingo Strüwing | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Replication | Severity: | S2 (Serious) |
Version: | 5.1.21 | OS: | Linux (valgrind) |
Assigned to: | Mats Kindahl | CPU Architecture: | Any |
Tags: | pbfail |
[23 Jul 2007 7:22]
Ingo Strüwing
[23 Jul 2007 8:06]
Ingo Strüwing
Same problem does also show up with rpl_ndb_log. Warnings file contains: slave.err: rpl_ndb.rpl_ndb_log: safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 2565 at rpl_rli.cc, line 68 slave.err: rpl_ndb.rpl_ndb_log: Attempting backtrace. You can use the following information to find out Seen in sam push, platform 'debx86-b' 'double whopper'. There are also two server logs, but without additional information.
[28 Jul 2007 11:06]
Ingo Strüwing
Also seen in push Fri Jul 27 18:48:53 2007 acurtis, platform 'sapsrv1' Autopush, https://intranet.mysql.com/secure/pushbuild/showdir.pl?dir=mysql-5.1 slave.err: rpl.rpl_truncate_3innodb: safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 2565 at rpl_rli.cc, line 68 slave.err: rpl.rpl_truncate_3innodb: Attempting backtrace. You can use the following information to find out
[1 Aug 2007 6:58]
Timothy Smith
Just a brief "me too" comment. rpl_ndb_log: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=mysql-5.1-runtime&order=409 On sapsrv1 and pb-valgrind, the same error: slave.err: rpl_ndb.rpl_ndb_log: safe_mutex: Trying to destroy a mutex that was locked at slave.cc, line 2565 at rpl_rli.cc, line 68 slave.err: rpl_ndb.rpl_ndb_log: Attempting backtrace. You can use the following information to find out rpl_ndb_circular: https://intranet.mysql.com/secure/pushbuild/showpush.pl?dir=mysql-5.1-new-maint&order=710 Same error as above, in pb-valgrind. Ramil tried to reproduce the problem by introducing various delays, but wasn't able to reproduce it.
[16 Aug 2007 13:41]
Mats Kindahl
Here is a summary of the findings. 1. The assertion fires because the ``mi->rli.run_lock`` mutex is still locked when destroying the mutex. For this assertion, the mutex was locked is in the shutdown sequence inside the ``handle_slave_sql`` function and the mutex is destroyed in the ``~RELAY_LOG_INFO()`` destructor. 2. The only place where ``~RELAY_LOG_INFO()`` is called is as part of the destruction of ``~MASTER_INFO()``. 3. The only place where a master info instance is destroyed is in the following code in ``end_slave()``:: terminate_slave_threads(active_mi,SLAVE_FORCE_ALL); end_master_info(active_mi); delete active_mi; active_mi= 0; 4. The function ``terminate_slave_threads()`` call ``terminate_slave_thread()`` on each of the slave threads. 5. For the SQL thread, the call to ``terminate_slave_thread()`` does the following (code simplified to match this use only):: pthread_mutex_lock(&mi->rli.run_lock); while (mi->rli.slave_running) pthread_cond_timedwait(&mi->rli.stop_cond, &mi->rli.run_lock, ...); pthread_mutex_unlock(&mi->rli.run_lock); This call is executed last (i.e., after terminating the I/O thread). 6. According to the bug report, the run lock is aquired at line 2565 in slave.cc and from reading the code, we see that it is released at line 2603 in the same file. This is at the end of the ``handle_slave_sql()`` function. 7. From 5 and 6, we know that ``mi->rli.run_lock`` is not aquired again after exiting the ``terminate_slave_thread()`` call, and furthat that the destructor is called while the SQL thread is executing somewhere between 2565 and 2603. 8. From 5, we know that the while loop cannot terminate until ``mi->rli.slave_running`` is set to 0, which occurs at line 2570 in slave.cc. From 7 and 1, this means that the destructor is called in the range 2570 to 2603. After removing uninteresting lines that obviously cannot affect the terminating thread, we get the following lines (rewritten with the correct variables for this case):: rli->slave_running= 0; ... pthread_mutex_unlock(&mi->rli.data_lock); ... pthread_cond_broadcast(&mi->rli.data_cond); ... pthread_cond_broadcast(&mi->rli.stop_cond); pthread_mutex_unlock(&mi->rli.run_lock); 9. According to the standard, ``pthread_cond_timedwait`` *shall* acquire the mutex even if a timeout or a spurious wake-up occurs, so from 5 and 8, we can see that ``terminate_slave_thread()`` shall not be able to exit before the mutex is unlocked by ``handle_slave_sql``. I have problems seeing anything but a bug in the PThreads library causing this crash. Since we are about to terminate the slave threads and shut down the server, we might just as well execute an pthread_mutex_unlock() before destroying the corresponding mutex inside the destructor. This will stop the assertion from firing, but since this defeats the purpose of having an assertion at all, I'm reluctant to implement the bugfix this way.
[21 Aug 2007 14:00]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/32811 ChangeSet@1.2555, 2007-08-21 16:00:21+02:00, mats@kindahl-laptop.dnsalias.net +2 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): There appears to be a bug in some implementations of pthread_cond_timedwait() in that it does not lock the associated mutex when returning, so this patch attempts a workaround to acquire the mutex even in that case to prevent the mutex from being destroyed while still being locked by the the slave thread.
[23 Aug 2007 12:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/32942 ChangeSet@1.2555, 2007-08-23 14:42:21+02:00, mats@kindahl-laptop.dnsalias.net +4 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): There appears to be a bug in some implementations of pthread_cond_timedwait() in that it does not lock the associated mutex when returning, so this patch attempts a workaround to acquire the mutex even in that case to prevent the mutex from being destroyed while still being locked by the the slave thread. Also implementing some change to the safe_mutex_lock() to allow pthread_mutex_trylock() to return immediately when the lock is busy.
[23 Aug 2007 12:48]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/32944 ChangeSet@1.2555, 2007-08-23 14:48:21+02:00, mats@kindahl-laptop.dnsalias.net +4 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): There appears to be a bug in some implementations of pthread_cond_timedwait() in that it does not lock the associated mutex when returning, so this patch attempts a workaround to acquire the mutex even in that case to prevent the mutex from being destroyed while still being locked by the the slave thread. Also implementing some change to the safe_mutex_lock() to allow pthread_mutex_trylock() to return immediately when the lock is busy.
[28 Aug 2007 18:08]
Brian Aker
From Tomas via IRC <tomas> slave.err: rpl_ndb.rpl_ndb_log: Attempting backtrace. You can use the following information to find out --- kostja|dinner is now known as kostja <tomas> brian, this looks like an old bug reported long time ago... and I thought it was fixed...
[29 Aug 2007 7:13]
Mats Kindahl
It seems Andrei managed to find the problem, which seems to be an unguarded read of ``slave_killed``. Look at this code from ``slave.cc``:: if ((thread_mask & (SLAVE_IO|SLAVE_FORCE_ALL)) && mi->slave_running) { DBUG_PRINT("info",("Terminating IO thread")); mi->abort_slave=1; if ((error=terminate_slave_thread(mi->io_thd,io_lock, &mi->stop_cond, &mi->slave_running, skip_lock)) && !force_all) DBUG_RETURN(error); } As can be seen, the ``slave_running`` variable is checked without taking the mutex ``run_lock``, which guards all access to the slave run status. Now, suppose that the slave I/O thread for some reason begins to terminate execution by itself (e.g., because it lost connection with the master, which is quite common). Then we have the following sequence diagram:: terminating I/O thread thread | | -- end_slave() --> | | | -------- I/O failure ------->| | | |--+ | | | mutex_lock(&mi->run_lock) | |<-+ |--+ | | terminate_slave_threads() |<-+ | | | |--+ | | | slave_killed = 0 | |<-+ | | |--+ | | if (... && mi->slave_killed) |<-+ | | | | |--+ | | terminate_slave_threads() return |<-+ | | | | |--+ | | delete active_mi |<-+ | | | | |--+ | | mutex_destroy(&mi->run_lock) (CRASH!) |<-+ | | | |--+ | | | mutex_unlock(&mi->run_lock) | |<-+ The fix is to just remove the unguarded read (this is already checked inside the functions), since that will allow the ``terminate_slave_threads()`` function to exit before the run_lock mutexes have been released.
[29 Aug 2007 7:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/33271 ChangeSet@1.2555, 2007-08-29 09:15:00+02:00, mats@kindahl-laptop.dnsalias.net +2 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): Removing unguarded read of slave_running field from inside terminate_slave_threads(). This could cause premature exit in the event that the slave thread already were shutting down, but isn't finished yet.
[29 Aug 2007 7:30]
Rafal Somla
Good to push. Great job!
[29 Aug 2007 11:30]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/33295 ChangeSet@1.2556, 2007-08-29 13:29:22+02:00, mats@kindahl-laptop.dnsalias.net +1 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): The fields slave_running, io_thd, and sql_thread are guarded by an associated run_lock. A read of these fields were not guarded inside terminate_slave_threads(), which caused an assertion to fire and potentially can cause a segmentation fault in some rare cases. This patch move the reading of the above variables to occur inside terminate_slave_thread() guarded by the associated run_lock.
[29 Aug 2007 13:12]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/33311 ChangeSet@1.2556, 2007-08-29 15:11:33+02:00, mats@kindahl-laptop.dnsalias.net +1 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): The fields slave_running, io_thd, and sql_thread are guarded by an associated run_lock. A read of these fields were not guarded inside terminate_slave_threads(), which caused an assertion to fire. This patch removes that assertion and reorganizes the code in terminate_slave_thread() slightly.
[29 Aug 2007 14:08]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/33318 ChangeSet@1.2571, 2007-08-29 16:06:59+02:00, mats@kindahl-laptop.dnsalias.net +2 -0 BUG#29968 (rpl_ndb_circular.test and rpl_ndb_log.test fail): Removing unguarded read of slave_running field from inside terminate_slave_threads(). This could cause premature exit in the event that the slave thread already were shutting down, but isn't finished yet. The fields slave_running, io_thd, and sql_thread are guarded by an associated run_lock. A read of these fields were not guarded inside terminate_slave_threads(), which caused an assertion to fire. The assertion was removed, and the code reorganized slightly.
[4 Sep 2007 17:11]
Bugs System
Pushed into 5.1.23-beta
[6 Sep 2007 11:59]
Mats Kindahl
The patch has been pushed into both 5.1.22 and 5.1.23-beta. The crash was caused by an assertion inside safe_mutex that will cause an abort() in the case a pthread_mutex_destroy() is called on a mutex that is locked. - For debug builds, it will cause a crash by assertion. - For non-debug builds using a POSIX.1 threads implementation, the behavior is undefined, so there is no way to tell what happens. On my machine, the call to pthread_mutex_destroy() returns EBUSY, and appears to not touch anything else (I can call pthread_mutex_unlock() on the mutex afterwards), so the behavior is benign (since the server is shutting down, all resources will be released anyway). - For Windows, the functions map to the DeleteCriticalSection and LeaveCriticalSection, which has the following text: "After a critical section object has been deleted, do not reference the object in any function that operates on critical sections (such as [...] LeaveCriticalSection) [...]. If you attempt to do so, memory corruption and other unexpected errors can occur."
[6 Sep 2007 12:18]
Jon Stephens
No changes visible to end users; closed without documenting.