Bug #56666 Race condition between the server main thread and the kill server thread
Submitted: 8 Sep 2010 21:24 Modified: 12 Nov 2013 2:19
Reporter: Marc ALFF Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version: OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[8 Sep 2010 21:24] Marc ALFF
Description:
Found by analysis and code review.

During the server shutdown, the main() and kill_server_thread() are executing in parallel.

The shutdown itself is orderly, with some synchronization between these two threads, until the variable "ready_to_exit" is set to 1 by the kill server thread, which main() waits on.

But after that point, both threads can execute more cleanup code that causes race conditions.

In particular, code like mysqld_exit() in 5.5, and similar code in 5.1 and earlier, attempts to destroy the same mutexes, etc.

That specific area of the code is not heavily stressed, because after all servers in production are supposed to stay up for a long time and not be shutdown frequently.
Stress on this area of the code (and how the bug was found) comes from the MTR test suite, because it starts and shuts down servers so many times, increasing the probability to expose the bug.

This bug in the server shutdown is believed to be the root cause of some unexplained spurious failures in automated tests.

How to repeat:
Read the code
Follow the code after ready_to_exit=1

Suggested fix:
N/A
[8 Sep 2010 21:25] Marc ALFF
This is a possible root cause of bug#29650, which was never reproduced.
[8 Sep 2010 21:30] Marc ALFF
Found during the analysis of bug#56324.
[16 Nov 2010 8:39] Marc ALFF
See also bug#56760, which is caused by bug#56666.
[8 Dec 2010 9:20] Alexander Nozdrin
Bug#55740 and Bug#58707 have been marked as duplicates of this one.

Increasing priority because it's causing a lot of valgrind warnings.
[12 Nov 2013 2:19] Paul DuBois
Noted in 5.7.3 changelog.

At server shutdown, a race condition between the the main thread and
the shutdown thread could cause server failure.