MySQL Bugs: #56324: Race Condition while shutting down MySQL: "PSI

Bug #56324	Race Condition while shutting down MySQL: "PSI_server"
Submitted:	27 Aug 2010 11:06	Modified:	4 Nov 2010 1:19
Reporter:	Alfranio Tavares Correia Junior	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Performance Schema	Severity:	S3 (Non-critical)
Version:	next-mr, next-mr-bugfixing	OS:	Any
Assigned to:	Marc ALFF	CPU Architecture:	Any
Tags:	mutex, performance schema, race condition

Description:
Mysql sporadically crashes due to a race condition:

Thread-1 (Performance schema) executes

   if (PSI_server)
     ---> PSI_server->delete_current_thread(); <--- (mysys/my_thr_init.c:409)

and crashes because PSI_server is NULL.

Thread-2 (main) concurrently executes

  if (PSI_server)
  {
    PSI_server->delete_current_thread();
    PSI_server= NULL; 
  }
  shutdown_performance_schema();
#endif
  mysqld_exit(0); <--- (sql/mysqld:4791)

Stack Trace:
------------

thread 2 (process 21288):
#0 0x0000003c3da0a687 in pthread_cond_timedwait@@glibc_2.3.2 ()
from /lib64/libpthread.so.0
#1 0x0000000000db2c86 in safe_cond_timedwait (cond=0x19bfa40, mp=0x19bf940,
abstime=0x7fff31c3f970,
file=0x1036c48 "/export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/include/mysql/psi/mysql_thread.h", line=1089)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/thr_mutex.c:278
#2 0x0000000000da4d47 in inline_mysql_cond_timedwait (that=0x19bfa40,
mutex=0x19bf940, abstime=0x7fff31c3f970,
src_file=0x1036b30 "/export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c", src_line=257)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/include/mysql/psi/mysql_thread.h:1089
#3 0x0000000000da46e6 in my_thread_global_end ()
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c:256
#4 0x0000000000d9cc1e in my_end (infoflag=0)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_init.c:245
#5 0x000000000053016e in mysqld_exit (exit_code=0)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1392
#6 0x000000000053ac8c in mysqld_main (argc=54, argv=0x9f91558)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:4791
#7 0x0000000000525d45 in main (argc=8, argv=0x7fff31c3fe08)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/main.cc:24
thread 1 (process 21442):
#0 0x0000003c3da0b122 in pthread_kill () from /lib64/libpthread.so.0
#1 0x0000000000da7ebe in my_write_core (sig=11)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/stacktrace.c:326
#2 0x0000000000532332 in handle_segfault (sig=11)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:2502
#3 <signal handler called>
#4 0x0000000000da512f in my_thread_end ()
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/mysys/my_thr_init.c:409
#5 0x0000000000530e89 in unireg_end ()
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1359
#6 0x0000000000539885 in kill_server (sig_ptr=0x0)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1286
#7 0x00000000005398cd in kill_server_thread (arg=0x415f90d8)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/sql/mysqld.cc:1309
#8 0x0000000000d46c25 in pfs_spawn_thread (arg=0xa802f20)
at /export/home3/pb2/build/sb_2-none-1281610976.58/mysql-next-mr-bugfixing-gcov/storage/perfschema/pfs.cc:1061
#9 0x0000003c3da062f7 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003c3ced1b6d in clone () from /lib64/libc.so.6

How to repeat:
#!/bin/bash

for i in `seq 1 300`; do
  ./mtr --parallel=auto --mem rpl_change_master 2>&1 | tee output.log
  res=`grep fail output.log`
  if [ "!$res" != "!" ]; then
    exit $?
  fi
done

See also http://tinyurl.com/2am8a7b

Possibly related to bug#55105

Verified on Fedora 13 X86_64:

rpl.rpl_change_master 'row'              w3 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2010-08-27 11:02:59
line
Attempting backtrace. You can use the following information to find out
^ Found warnings in /home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/mysqld.2.err
ok

 - saving '/home/miguel/bzr/mr-bugfix/mysql-test/var/3/log/rpl.rpl_change_master-row/' to '/home/miguel/bzr/mr-bugfix/mysql-test/var/log/rpl.rpl_change_master-row/'
 - found 'core.28822' (0/5)

Trying 'dbx' to get a backtrace

Trying 'gdb' to get a backtrace
Core generated by '/home/miguel/bzr/mr-bugfix/sql/mysqld'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/b7/82f01bb2e2b191a950d5202e5d809f4c3f4ac8
[New Thread 29005]
[New Thread 28822]
Core was generated by `/home/miguel/bzr/mr-bugfix/sql/mysqld --defaults-group-suffix=.2 --defaults-fil'.
Program terminated with signal 11, Segmentation fault.
#0  0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0
#0  0x000000334320c61c in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000000a40346 in my_write_core (sig=11) at /home/miguel/bzr/mr-bugfix/mysys/stacktrace.c:328
#2  0x000000000054b394 in handle_segfault (sig=11) at /home/miguel/bzr/mr-bugfix/sql/mysqld.cc:2501
#3  <signal handler called>

Alfranio,

with the script given, I also run into a different failure:

===== Run number 44 ====
Logging: ./mtr  --parallel=auto --mem rpl_change_master
100827 17:38:05 [Note] Plugin 'FEDERATED' is disabled. 
MySQL Version 5.5.6                                    
Checking supported features...                         
 - skipping ndbcluster, mysqld not compiled with ndbcluster
 - SSL connections supported                               
 - binaries are debug compiled                             
Collecting tests...                                        
 - adding combinations for rpl                             
vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var
Checking leftover processes...                                       
Removing old var directory...                                        
Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'...
 - symlinking 'var' to '/dev/shm/var_auto_py0V'                                          
Installing system database...                                                            
Using parallel: 3                                                                        
Using server port 43494                                                                  
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009                   

==============================================================================

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039
worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029
rpl.rpl_change_master 'stmt'             w3 [ pass ]     46           
rpl.rpl_change_master 'row'              w1 [ pass ]     63           
rpl.rpl_change_master 'mix'              w2 [ pass ]     60           
------------------------------------------------------------          
The servers were restarted 0 times                                    
Spent 0.169 of 12 seconds executing testcases                         

Completed: All 3 tests were successful.

===== Run number 45 ====
Logging: ./mtr  --parallel=auto --mem rpl_change_master
100827 17:38:18 [Note] Plugin 'FEDERATED' is disabled. 
MySQL Version 5.5.6                                    
Checking supported features...                         
 - skipping ndbcluster, mysqld not compiled with ndbcluster
 - SSL connections supported                               
 - binaries are debug compiled                             
Collecting tests...                                        
 - adding combinations for rpl                             
vardir: /home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var
Checking leftover processes...                                       
Removing old var directory...                                        
Creating var directory '/home/malff/BZR_TREE/mysql-5.5-bugfixing-56324/mysql-test/var'...
 - symlinking 'var' to '/dev/shm/var_auto_FNEg'                                          
Installing system database...                                                            
Using parallel: 3                                                                        
Using server port 54559                                                                  

==============================================================================

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 13030..13039
worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 13020..13029
rpl.rpl_change_master 'stmt'             w1 [ pass ]     55           
rpl.rpl_change_master 'mix'              w2 [ pass ]     54           
rpl.rpl_change_master 'row'              w3 [ fail ]                  
        Test ended at 2010-08-27 17:38:25                             

CURRENT_TEST: rpl.rpl_change_master
mysqltest: At line 83: query 'CHANGE MASTER TO relay_log_file='slave-relay-bin.000005', relay_log_pos=4' failed: 1380: Failed initializing relay log position: Could not find target log during relay log initialization                                                   

===========

I agree there is a race condition related to PSI_server,
but there might also be something else in the rpl.rpl_change_master
test case.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117535

3195 Marc Alff	2010-09-03
      Bug#56324 Race Condition while shutting down MySQL: "PSI_server"
      
      Before this fix, the server could crash during shutdown,
      due to race conditions, that occured when killing the server.
      
      In particular, the performance schema instrumentation handle,
      PSI_server, and the performance schema itself would be cleaned up
      too soon, causing race conditions with a running kill server thread.
      
      The specifics of the race condition found are that:
      the main thread executing "PSI_server= NULL" can cause crashes in
      other threads still running, which are executing
      "if (PSI_server != NULL) PSI_server->xxx()"
      as part of the performance schema instrumentation.
      
      While the bug was reported for the kill server thread,
      in theory the same crash could happen with the signal thread,
      as found by code analysis.
      
      This fix re works the code in mysqld_main() used during shutdown,
      and in particular:
      - the performance schema is now shut down _after_ the signal thread
      is completed,
      - the performance schema is now shut down _after_ the kill server thread
      is completed,
      
      This code has been tested with running 300 times in a row the following
      tests in parallel, which have been known to fail with race conditions
      in the past:
      - rpl_change_master
      - binlog_max_extension
      - events_restart
      - rpl_heartbeat_basic
      and no crash of test failure has been seen with the changed code.

After more analysis, the previous patch is abandoned.
I think it is too complicated and too risky.
Looking for a simpler solution

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117942

3204 Marc Alff	2010-09-09
      Bug#56324 Race Condition while shutting down MySQL: "PSI_server"
      
      Before this fix, the server could crash during shutdown,
      due to race conditions, that occured when killing the server.
      
      In particular, the performance schema instrumentation handle,
      PSI_server, and the performance schema itself would be cleaned up
      too soon, causing race conditions with a running kill server thread.
      
      The specifics of the race condition found are that:
      the main thread executing "PSI_server= NULL" can cause crashes in
      other threads still running, which are executing
      "if (PSI_server != NULL) PSI_server->xxx()"
      as part of the performance schema instrumentation.
      
      While the bug was reported for the kill server thread,
      in theory the same crash could happen with the signal thread,
      as found by code analysis.
      
      The correct fix would be to only shutdown the performance schema
      and set PSI_server to NULL after every other thread is guaranteed
      to be completed, including the kill_server_thread.
      
      However, due to the existing mysqld server design, this is not the case.
      See in particular bug number 56666.
      
      The work around used to fix this race condition is to simply not
      perform the call to shutdown_performance_schema() when the server exits,
      and to keep the PSI_server pointer unchanged.
      
      This will cause memory leaks to be reported by tools like valgrind,
      but no memory leak actually happen because the process is about to exit().
      
      As a result, the file mysql-test/valgrind.supp has been updated
      to filter out these false positive messages.
      
      This code has been tested with running in a loop the following
      tests in parallel, which have been known to fail with race conditions
      in the past:
      - rpl_change_master
      - binlog_max_extension
      - events_restart
      - rpl_heartbeat_basic
      and no crash of test failure has been seen with the changed code.

Fix looks good.

Pushed into:
- mysql-5.5-bugfixing
- mysql-trunk-bugfixing
- mysql-next-mr-bugfixing

Pushed into mysql-trunk 5.6.1-m4 (revid:alexander.nozdrin@oracle.com-20101002180948-852x1cuv7c6i85ea) (version source revid:alexander.nozdrin@oracle.com-20101002180857-an32jpuwzemsp4f2) (merge vers: 5.6.1-m4) (pib:21)

Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)

Pushed into mysql-5.5 5.5.7-rc (revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (version source revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (merge vers: 5.5.7-rc) (pib:21)

Noted in 5.5.7, 5.6.1 changelogs.

The server could crash during shutdown due to a race condition
relating to Performance Schema cleanup.