Bug #56324 | Race Condition while shutting down MySQL: "PSI_server" | ||
---|---|---|---|
Submitted: | 27 Aug 2010 11:06 | Modified: | 4 Nov 2010 1:19 |
Reporter: | Alfranio Tavares Correia Junior | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Performance Schema | Severity: | S3 (Non-critical) |
Version: | next-mr, next-mr-bugfixing | OS: | Any |
Assigned to: | Marc ALFF | CPU Architecture: | Any |
Tags: | mutex, performance schema, race condition |
[27 Aug 2010 11:06]
Alfranio Tavares Correia Junior
[27 Aug 2010 11:19]
Alfranio Tavares Correia Junior
See also http://tinyurl.com/2am8a7b
[27 Aug 2010 15:29]
Marc ALFF
Possibly related to bug#55105
[27 Aug 2010 15:57]
MySQL Verification Team
Verified on Fedora 13 X86_64: rpl.rpl_change_master 'row' w3 [ fail ] Found warnings/errors in server log file! Test ended at 2010-08-27 11:02:59 line Attempting backtrace. You can use the following information to find out ^ Found warnings in /home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/mysqld.2.err ok - saving '/home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/rpl.rpl_change_master-row/' to '/home/miguel/bzr/mr-bugfix/mysql-test/var/log/rpl.rpl_change_master-row/' - found 'core.28822' (0/5) Trying 'dbx' to get a backtrace Trying 'gdb' to get a backtrace Core generated by '/home/miguel/bzr/mr-bugfix/sql/mysqld' Output from gdb follows. The first stack trace is from the failing thread. The following stack traces are from all threads (so the failing one is duplicated). -------------------------- Missing separate debuginfo for Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/b7/82f01bb2e2b191a950d5202e5d809f4c3f4ac8 [New Thread 29005] [New Thread 28822] Core was generated by `/home/miguel/bzr/mr-bugfix/sql/mysqld --defaults-group-suffix=.2 --defaults-fil'. Program terminated with signal 11, Segmentation fault. #0 0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0 #0 0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0 #1 0x0000000000a40346 in my_write_core (sig=11) at /home/miguel/bzr/mr-bugfix/mysys/stacktrace.c:328 #2 0x000000000054b394 in handle_segfault (sig=11) at /home/miguel/bzr/mr-bugfix/sql/mysqld.cc:2501 #3 <signal handler called>
[27 Aug 2010 22:55]
Marc ALFF
Alfranio, with the script given, I also run into a different failure: ===== Run number 44 ==== Logging: ./mtr --parallel=auto --mem rpl_change_master 100827 17:38:05 [Note] Plugin 'FEDERATED' is disabled. MySQL Version 5.5.6 Checking supported features... - skipping ndbcluster, mysqld not compiled with ndbcluster - SSL connections supported - binaries are debug compiled Collecting tests... - adding combinations for rpl vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var Checking leftover processes... Removing old var directory... Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'... - symlinking 'var' to '/dev/shm/var_auto_py0V' Installing system database... Using parallel: 3 Using server port 43494 worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009 ============================================================================== TEST RESULT TIME (ms) ------------------------------------------------------------ worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039 worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029 rpl.rpl_change_master 'stmt' w3 [ pass ] 46 rpl.rpl_change_master 'row' w1 [ pass ] 63 rpl.rpl_change_master 'mix' w2 [ pass ] 60 ------------------------------------------------------------ The servers were restarted 0 times Spent 0.169 of 12 seconds executing testcases Completed: All 3 tests were successful. ===== Run number 45 ==== Logging: ./mtr --parallel=auto --mem rpl_change_master 100827 17:38:18 [Note] Plugin 'FEDERATED' is disabled. MySQL Version 5.5.6 Checking supported features... - skipping ndbcluster, mysqld not compiled with ndbcluster - SSL connections supported - binaries are debug compiled Collecting tests... - adding combinations for rpl vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var Checking leftover processes... Removing old var directory... Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'... - symlinking 'var' to '/dev/shm/var_auto_FNEg' Installing system database... Using parallel: 3 Using server port 54559 ============================================================================== TEST RESULT TIME (ms) ------------------------------------------------------------ worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009 worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039 worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029 rpl.rpl_change_master 'stmt' w1 [ pass ] 55 rpl.rpl_change_master 'mix' w2 [ pass ] 54 rpl.rpl_change_master 'row' w3 [ fail ] Test ended at 2010-08-27 17:38:25 CURRENT_TEST: rpl.rpl_change_master mysqltest: At line 83: query 'CHANGE MASTER TO relay_log_file='slave-relay-bin.000005', relay_log_pos=4' failed: 1380: Failed initializing relay log position: Could not find target log during relay log initialization =========== I agree there is a race condition related to PSI_server, but there might also be something else in the rpl.rpl_change_master test case.
[3 Sep 2010 17:31]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/117535 3195 Marc Alff 2010-09-03 Bug#56324 Race Condition while shutting down MySQL: "PSI_server" Before this fix, the server could crash during shutdown, due to race conditions, that occured when killing the server. In particular, the performance schema instrumentation handle, PSI_server, and the performance schema itself would be cleaned up too soon, causing race conditions with a running kill server thread. The specifics of the race condition found are that: the main thread executing "PSI_server= NULL" can cause crashes in other threads still running, which are executing "if (PSI_server != NULL) PSI_server->xxx()" as part of the performance schema instrumentation. While the bug was reported for the kill server thread, in theory the same crash could happen with the signal thread, as found by code analysis. This fix re works the code in mysqld_main() used during shutdown, and in particular: - the performance schema is now shut down _after_ the signal thread is completed, - the performance schema is now shut down _after_ the kill server thread is completed, This code has been tested with running 300 times in a row the following tests in parallel, which have been known to fail with race conditions in the past: - rpl_change_master - binlog_max_extension - events_restart - rpl_heartbeat_basic and no crash of test failure has been seen with the changed code.
[8 Sep 2010 20:59]
Marc ALFF
After more analysis, the previous patch is abandoned. I think it is too complicated and too risky. Looking for a simpler solution
[9 Sep 2010 21:34]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/117942 3204 Marc Alff 2010-09-09 Bug#56324 Race Condition while shutting down MySQL: "PSI_server" Before this fix, the server could crash during shutdown, due to race conditions, that occured when killing the server. In particular, the performance schema instrumentation handle, PSI_server, and the performance schema itself would be cleaned up too soon, causing race conditions with a running kill server thread. The specifics of the race condition found are that: the main thread executing "PSI_server= NULL" can cause crashes in other threads still running, which are executing "if (PSI_server != NULL) PSI_server->xxx()" as part of the performance schema instrumentation. While the bug was reported for the kill server thread, in theory the same crash could happen with the signal thread, as found by code analysis. The correct fix would be to only shutdown the performance schema and set PSI_server to NULL after every other thread is guaranteed to be completed, including the kill_server_thread. However, due to the existing mysqld server design, this is not the case. See in particular bug number 56666. The work around used to fix this race condition is to simply not perform the call to shutdown_performance_schema() when the server exits, and to keep the PSI_server pointer unchanged. This will cause memory leaks to be reported by tools like valgrind, but no memory leak actually happen because the process is about to exit(). As a result, the file mysql-test/valgrind.supp has been updated to filter out these false positive messages. This code has been tested with running in a loop the following tests in parallel, which have been known to fail with race conditions in the past: - rpl_change_master - binlog_max_extension - events_restart - rpl_heartbeat_basic and no crash of test failure has been seen with the changed code.
[13 Sep 2010 23:36]
Christopher Powers
Fix looks good.
[14 Sep 2010 21:57]
Marc ALFF
Pushed into: - mysql-5.5-bugfixing - mysql-trunk-bugfixing - mysql-next-mr-bugfixing
[2 Oct 2010 18:12]
Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alexander.nozdrin@oracle.com-20101002180948-852x1cuv7c6i85ea) (version source revid:alexander.nozdrin@oracle.com-20101002180857-an32jpuwzemsp4f2) (merge vers: 5.6.1-m4) (pib:21)
[2 Oct 2010 18:13]
Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[2 Oct 2010 18:16]
Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (version source revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (merge vers: 5.5.7-rc) (pib:21)
[4 Nov 2010 1:19]
Paul DuBois
Noted in 5.5.7, 5.6.1 changelogs. The server could crash during shutdown due to a race condition relating to Performance Schema cleanup.