| Bug #56324 | Race Condition while shutting down MySQL: "PSI_server" | ||
|---|---|---|---|
| Submitted: | 27 Aug 2010 11:06 | Modified: | 4 Nov 2010 1:19 |
| Reporter: | Alfranio Tavares Correia Junior | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Performance Schema | Severity: | S3 (Non-critical) |
| Version: | next-mr, next-mr-bugfixing | OS: | Any |
| Assigned to: | Marc ALFF | CPU Architecture: | Any |
| Tags: | mutex, performance schema, race condition | ||
[27 Aug 2010 11:19]
Alfranio Tavares Correia Junior
See also http://tinyurl.com/2am8a7b
[27 Aug 2010 15:29]
Marc ALFF
Possibly related to bug#55105
[27 Aug 2010 15:57]
MySQL Verification Team
Verified on Fedora 13 X86_64:
rpl.rpl_change_master 'row' w3 [ fail ] Found warnings/errors in server log file!
Test ended at 2010-08-27 11:02:59
line
Attempting backtrace. You can use the following information to find out
^ Found warnings in /home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/mysqld.2.err
ok
- saving '/home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/rpl.rpl_change_master-row/' to '/home/miguel/bzr/mr-bugfix/mysql-test/var/log/rpl.rpl_change_master-row/'
- found 'core.28822' (0/5)
Trying 'dbx' to get a backtrace
Trying 'gdb' to get a backtrace
Core generated by '/home/miguel/bzr/mr-bugfix/sql/mysqld'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
Missing separate debuginfo for
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/b7/82f01bb2e2b191a950d5202e5d809f4c3f4ac8
[New Thread 29005]
[New Thread 28822]
Core was generated by `/home/miguel/bzr/mr-bugfix/sql/mysqld --defaults-group-suffix=.2 --defaults-fil'.
Program terminated with signal 11, Segmentation fault.
#0 0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0
#0 0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0
#1 0x0000000000a40346 in my_write_core (sig=11) at /home/miguel/bzr/mr-bugfix/mysys/stacktrace.c:328
#2 0x000000000054b394 in handle_segfault (sig=11) at /home/miguel/bzr/mr-bugfix/sql/mysqld.cc:2501
#3 <signal handler called>
[27 Aug 2010 22:55]
Marc ALFF
Alfranio,
with the script given, I also run into a different failure:
===== Run number 44 ====
Logging: ./mtr --parallel=auto --mem rpl_change_master
100827 17:38:05 [Note] Plugin 'FEDERATED' is disabled.
MySQL Version 5.5.6
Checking supported features...
- skipping ndbcluster, mysqld not compiled with ndbcluster
- SSL connections supported
- binaries are debug compiled
Collecting tests...
- adding combinations for rpl
vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'...
- symlinking 'var' to '/dev/shm/var_auto_py0V'
Installing system database...
Using parallel: 3
Using server port 43494
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
==============================================================================
TEST RESULT TIME (ms)
------------------------------------------------------------
worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039
worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029
rpl.rpl_change_master 'stmt' w3 [ pass ] 46
rpl.rpl_change_master 'row' w1 [ pass ] 63
rpl.rpl_change_master 'mix' w2 [ pass ] 60
------------------------------------------------------------
The servers were restarted 0 times
Spent 0.169 of 12 seconds executing testcases
Completed: All 3 tests were successful.
===== Run number 45 ====
Logging: ./mtr --parallel=auto --mem rpl_change_master
100827 17:38:18 [Note] Plugin 'FEDERATED' is disabled.
MySQL Version 5.5.6
Checking supported features...
- skipping ndbcluster, mysqld not compiled with ndbcluster
- SSL connections supported
- binaries are debug compiled
Collecting tests...
- adding combinations for rpl
vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'...
- symlinking 'var' to '/dev/shm/var_auto_FNEg'
Installing system database...
Using parallel: 3
Using server port 54559
==============================================================================
TEST RESULT TIME (ms)
------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039
worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029
rpl.rpl_change_master 'stmt' w1 [ pass ] 55
rpl.rpl_change_master 'mix' w2 [ pass ] 54
rpl.rpl_change_master 'row' w3 [ fail ]
Test ended at 2010-08-27 17:38:25
CURRENT_TEST: rpl.rpl_change_master
mysqltest: At line 83: query 'CHANGE MASTER TO relay_log_file='slave-relay-bin.000005', relay_log_pos=4' failed: 1380: Failed initializing relay log position: Could not find target log during relay log initialization
===========
I agree there is a race condition related to PSI_server,
but there might also be something else in the rpl.rpl_change_master
test case.
[3 Sep 2010 17:31]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/117535 3195 Marc Alff 2010-09-03 Bug#56324 Race Condition while shutting down MySQL: "PSI_server" Before this fix, the server could crash during shutdown, due to race conditions, that occured when killing the server. In particular, the performance schema instrumentation handle, PSI_server, and the performance schema itself would be cleaned up too soon, causing race conditions with a running kill server thread. The specifics of the race condition found are that: the main thread executing "PSI_server= NULL" can cause crashes in other threads still running, which are executing "if (PSI_server != NULL) PSI_server->xxx()" as part of the performance schema instrumentation. While the bug was reported for the kill server thread, in theory the same crash could happen with the signal thread, as found by code analysis. This fix re works the code in mysqld_main() used during shutdown, and in particular: - the performance schema is now shut down _after_ the signal thread is completed, - the performance schema is now shut down _after_ the kill server thread is completed, This code has been tested with running 300 times in a row the following tests in parallel, which have been known to fail with race conditions in the past: - rpl_change_master - binlog_max_extension - events_restart - rpl_heartbeat_basic and no crash of test failure has been seen with the changed code.
[8 Sep 2010 20:59]
Marc ALFF
After more analysis, the previous patch is abandoned. I think it is too complicated and too risky. Looking for a simpler solution
[9 Sep 2010 21:34]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/117942 3204 Marc Alff 2010-09-09 Bug#56324 Race Condition while shutting down MySQL: "PSI_server" Before this fix, the server could crash during shutdown, due to race conditions, that occured when killing the server. In particular, the performance schema instrumentation handle, PSI_server, and the performance schema itself would be cleaned up too soon, causing race conditions with a running kill server thread. The specifics of the race condition found are that: the main thread executing "PSI_server= NULL" can cause crashes in other threads still running, which are executing "if (PSI_server != NULL) PSI_server->xxx()" as part of the performance schema instrumentation. While the bug was reported for the kill server thread, in theory the same crash could happen with the signal thread, as found by code analysis. The correct fix would be to only shutdown the performance schema and set PSI_server to NULL after every other thread is guaranteed to be completed, including the kill_server_thread. However, due to the existing mysqld server design, this is not the case. See in particular bug number 56666. The work around used to fix this race condition is to simply not perform the call to shutdown_performance_schema() when the server exits, and to keep the PSI_server pointer unchanged. This will cause memory leaks to be reported by tools like valgrind, but no memory leak actually happen because the process is about to exit(). As a result, the file mysql-test/valgrind.supp has been updated to filter out these false positive messages. This code has been tested with running in a loop the following tests in parallel, which have been known to fail with race conditions in the past: - rpl_change_master - binlog_max_extension - events_restart - rpl_heartbeat_basic and no crash of test failure has been seen with the changed code.
[13 Sep 2010 23:36]
Christopher Powers
Fix looks good.
[14 Sep 2010 21:57]
Marc ALFF
Pushed into: - mysql-5.5-bugfixing - mysql-trunk-bugfixing - mysql-next-mr-bugfixing
[2 Oct 2010 18:12]
Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alexander.nozdrin@oracle.com-20101002180948-852x1cuv7c6i85ea) (version source revid:alexander.nozdrin@oracle.com-20101002180857-an32jpuwzemsp4f2) (merge vers: 5.6.1-m4) (pib:21)
[2 Oct 2010 18:13]
Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[2 Oct 2010 18:16]
Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (version source revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (merge vers: 5.5.7-rc) (pib:21)
[4 Nov 2010 1:19]
Paul DuBois
Noted in 5.5.7, 5.6.1 changelogs. The server could crash during shutdown due to a race condition relating to Performance Schema cleanup.

Description: Mysql sporadically crashes due to a race condition: Thread-1 (Performance schema) executes if (PSI_server) ---> PSI_server->delete_current_thread(); <--- (mysys/my_thr_init.c:409) and crashes because PSI_server is NULL. Thread-2 (main) concurrently executes if (PSI_server) { PSI_server->delete_current_thread(); PSI_server= NULL; } shutdown_performance_schema(); #endif mysqld_exit(0); <--- (sql/mysqld:4791) Stack Trace: ------------ thread 2 (process 21288): #0 0x0000003c3da0a687 in pthread_cond_timedwait@@glibc_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000db2c86 in safe_cond_timedwait (cond=0x19bfa40, mp=0x19bf940, abstime=0x7fff31c3f970, file=0x1036c48 "/export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/include/mysql/psi/mysql_thread.h", line=1089) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/thr_mutex.c:278 #2 0x0000000000da4d47 in inline_mysql_cond_timedwait (that=0x19bfa40, mutex=0x19bf940, abstime=0x7fff31c3f970, src_file=0x1036b30 "/export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c", src_line=257) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/include/mysql/psi/mysql_thread.h:1089 #3 0x0000000000da46e6 in my_thread_global_end () at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c:256 #4 0x0000000000d9cc1e in my_end (infoflag=0) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_init.c:245 #5 0x000000000053016e in mysqld_exit (exit_code=0) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1392 #6 0x000000000053ac8c in mysqld_main (argc=54, argv=0x9f91558) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:4791 #7 0x0000000000525d45 in main (argc=8, argv=0x7fff31c3fe08) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/main.cc:24 thread 1 (process 21442): #0 0x0000003c3da0b122 in pthread_kill () from /lib64/libpthread.so.0 #1 0x0000000000da7ebe in my_write_core (sig=11) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/stacktrace.c:326 #2 0x0000000000532332 in handle_segfault (sig=11) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:2502 #3 <signal handler called> #4 0x0000000000da512f in my_thread_end () at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c:409 #5 0x0000000000530e89 in unireg_end () at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1359 #6 0x0000000000539885 in kill_server (sig_ptr=0x0) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1286 #7 0x00000000005398cd in kill_server_thread (arg=0x415f90d8) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1309 #8 0x0000000000d46c25 in pfs_spawn_thread (arg=0xa802f20) at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/storage/perfschema/pfs.cc:1061 #9 0x0000003c3da062f7 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003c3ced1b6d in clone () from /lib64/libc.so.6 How to repeat: #!/bin/bash for i in `seq 1 300`; do ./mtr --parallel=auto --mem rpl_change_master 2>&1 | tee output.log res=`grep fail output.log` if [ "!$res" != "!" ]; then exit $? fi done