MySQL Bugs: #38240: crash in rotate_relay_log during 'flush logs' and 'kill' or shutdown of server

Bug #38240	crash in rotate_relay_log during 'flush logs' and 'kill' or shutdown of server
Submitted:	19 Jul 2008 15:49	Modified:	17 Mar 2010 14:00
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.0.66a-debug, 5.0.89, 5.1.26-debug, 5.1.30	OS:	Any
Assigned to:	Andrei Elkin	CPU Architecture:	Any
Tags:	active_mi

Description:
When executing a replication workload involving FLUSH LOGS, mysqld crashed as follows:

#0  0x00110416 in __kernel_vsyscall ()
#1  0x00581c78 in pthread_kill () from /lib/libpthread.so.0
#2  0x0857895b in my_write_core (sig=11) at stacktrace.c:310
#3  0x08245c75 in handle_segfault (sig=11) at mysqld.cc:2507
#4  <signal handler called>
#5  0x0857c9c1 in safe_mutex_lock (mp=0x11ac, try_lock=0 '\0', file=0x86c6f2b "slave.cc", line=4019) at thr_mutex.c:97
#6  0x083bfb11 in rotate_relay_log (mi=0x0) at slave.cc:4019
#7  0x08250963 in reload_acl_and_cache (thd=0xa6b9888, options=2, tables=0x0, write_to_binlog=0xb27d8ad7) at sql_parse.cc:6521
#8  0x0825c48b in mysql_execute_command (thd=0xa6b9888) at sql_parse.cc:3748
#9  0x0825f95c in mysql_parse (thd=0xa6b9888, inBuf=0xa6bb220 "FLUSH LOGS", length=10, found_semicolon=0xb27d925c) at sql_parse.cc:5650
#10 0x08260533 in dispatch_command (command=COM_QUERY, thd=0xa6b9888, packet=0xa6c2901 "FLUSH LOGS", packet_length=10) at sql_parse.cc:1137
#11 0x08261692 in do_command (thd=0xa6b9888) at sql_parse.cc:794
#12 0x0824ede6 in handle_one_connection (arg=0xa6b9888) at sql_connect.cc:1115
#13 0x0057d32f in start_thread () from /lib/libpthread.so.0
#14 0x0049a27e in clone () from /lib/libc.so.6

The crash is here:

92
93
94      int safe_mutex_lock(safe_mutex_t *mp, my_bool try_lock, const char *file, uint line)
95      {
96        int error;
97        if (!mp->file)
98        {
99          fprintf(stderr,
100                 "safe_mutex: Trying to lock unitialized mutex at %s, line %d\n",
101                 file, line);

(gdb) print mp
$1 = (safe_mutex_t *) 0x11ac
(gdb) print mp->file
Cannot access memory at address 0x11dc

(gdb) frame 6
#6  0x083bfb11 in rotate_relay_log (mi=0x0) at slave.cc:4019
4019      pthread_mutex_lock(&mi->run_lock);
(gdb) print mi
$4 = (Master_info *) 0x0

How to repeat:
If this happens again, a test case will be provided.

Philip, I got this crash on 5.0.66a.  

I ran FLUSH LOGS 20 times per second while doing inserts and then 'mysqladmin shutdown' killed the server. Looks like a race condition with mutex lock/free ?

mi was null in my case:

mysqld-debug.exe!rotate_relay_log(st_master_info * mi=0x00000000)  Line 5266
mysqld-debug.exe!reload_acl_and_cache  Line 7099
mysqld-debug.exe!mysql_execute_command Line 4271
mysqld-debug.exe!mysql_parse  Line 6178
mysqld-debug.exe!dispatch_command Line 1875
mysqld-debug.exe!do_command Line 1581
mysqld-debug.exe!handle_one_connection
mysqld-debug.exe!pthread_start

I meant mi is freed before the mutex is not needed anymore..

Shane, many thanks for reproducing this. Can you please attach a test case and then set this bug to Open/Verified? I got this crash only once, that is why the bug is in Analyzing.

Thank you!

Shane, I am transferring this bug back to you. Please attach the test case you have and then set it to "Open" or "Verified". Thank you.

to repeat:
---------------

setup a single debug mysql server to replicate to itself:

mysqld-debug  --console --skip-grant-tables --server-id=5 --log-bin --port=3306 --replicate-same-server-id  --slave-skip-errors=1050 --skip-innodb

change master to master_host='127.0.0.1', master_port=3306, master_user='root', master_password='';
start slave;

run the attached bug38240.c testcase.  while it runs, execute 'mysqladmin shutdown'.  if no crash happens, start the server and retry it. it's timing related.

testcase that will expose a few different assertions if run as described above.

Attachment: bug38240.c (text/plain), 5.81 KiB.

Shane Bester wrote:
> The following may very well all be part of the same overall bug:
>
> Bug#38240
> Bug#38715
> Bug#38716
>
> The above 3 all relate to various locking/mutex issues in the
> replication code during concurrent flush logs/stop/start
> slave/reset slave.
>
> A reason for different bug reports was slightly different crashes in
> each case.  I'm guessing the devs will eventually set them to
> duplicates and fix everything in one go.

still happens in 5.1.30 release build:

0000000078EF1254    ntdll.dll!RtlEnterCriticalSection()
000000014012E6EA    mysqld.exe!rotate_relay_log()[slave.cc:4025]
0000000140198EB5    mysqld.exe!reload_acl_and_cache()[sql_parse.cc:6668]
00000001401A03A9    mysqld.exe!mysql_execute_command()[sql_parse.cc:3859]
00000001401A2F06    mysqld.exe!mysql_parse()[sql_parse.cc:5791]
00000001401A3C1A    mysqld.exe!dispatch_command()[sql_parse.cc:1202]
00000001401A4CD7    mysqld.exe!do_command()[sql_parse.cc:857]
0000000140246327    mysqld.exe!handle_one_connection()[sql_connect.cc:1115]
00000001402B82C5    mysqld.exe!pthread_start()[my_winthread.c:85]
00000001403CAC37    mysqld.exe!_callthreadstart()[thread.c:295]
00000001403CAD05    mysqld.exe!_threadstart()[thread.c:275]
0000000077D6B69A    kernel32.dll!BaseThreadStart()
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 00000000033A68B0=flush logs
thd->thread_id=2
thd->killed=KILL_CONNECTION

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76404

2801 Andrei Elkin	2009-06-16
      Bug #38240  Crash in safe_mutex_lock () thr_mutex.c line 97 on rotate_relay_log ()
      
      The reason for the crash was rotate_relay_log (mi=0x0) did not verify the passed
      value of mi, but might have returned early seeing it as NULL.
      
      Fixed with deploying a check of the mi arg value.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76406

2802 Andrei Elkin	2009-06-16
      Bug #38240  Crash in safe_mutex_lock () thr_mutex.c line 97 on rotate_relay_log
      
      deploying an assert to guard the only possible mi == NULL case may be as
      a result of killing of running rotate_relay_log() thread

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76414

2801 Andrei Elkin	2009-06-16
      Bug #38240 Crash in safe_mutex_lock () thr_mutex.c line 97 on rotate_relay_log 
      
      The reason for the crash was rotate_relay_log (mi=0x0) did not verify the passed
      value of active_mi. 
      There are more cases where active_mi is supposed to be non-zero e.g
      change_master(), stop_slave(), and it's reasonable
      to protect from a similar crash all of them with common fixes.
      
      Fixed with spliting end_slave() on slave threads release and slave data
      clean-up parts (a new close_active_mi()). The new function is invoked
      at the very end of close_connections() so that all users of active_mi
      are proven to have left.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76430

2801 Andrei Elkin	2009-06-17
      Bug #38240 Crash in safe_mutex_lock () thr_mutex.c line 97 on rotate_relay_log 
            
      The reason for the crash was rotate_relay_log (mi=0x0) did not verify the passed
      value of active_mi. 
      There are more cases where active_mi is supposed to be non-zero e.g
      change_master(), stop_slave(), and it's reasonable
      to protect from a similar crash all of them with common fixes.
            
      Fixed with spliting end_slave() on slave threads release and slave data
      clean-up parts (a new close_active_mi()). The new function is invoked
      at the very end of close_connections() so that all users of active_mi
      are proven to have left.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/76882

2972 Andrei Elkin	2009-06-23
      Bug #38240 Crash in safe_mutex_lock () thr_mutex.c line 97 on rotate_relay_log 
                  
      The reason for the crash was rotate_relay_log (mi=0x0) did not verify
      the passed value of active_mi.  There are more cases where active_mi
      is supposed to be non-zero e.g change_master(), stop_slave(), and it's
      reasonable to protect from a similar crash all of them with common
      fixes.
                  
      Fixed with spliting end_slave() in slave threads release and slave
      data clean-up parts (a new close_active_mi()). The new function is
      invoked at the very end of close_connections() so that all users of
      active_mi are proven to have left.

Pushed into 5.1.37 (revid:joro@sun.com-20090708131116-kyz8iotbum8w9yic) (version source revid:aelkin@mysql.com-20090623091004-x0q3ig5ktrwf7skf) (merge vers: 5.1.37) (pib:11)

Pushed into 5.1.37 (revid:joro@sun.com-20090708131116-kyz8iotbum8w9yic) (version source revid:aelkin@mysql.com-20090623091004-x0q3ig5ktrwf7skf) (merge vers: 5.1.37) (pib:11)

Documented in the 5.1.37 changelog as follows:

        In certain cases, executing a great many FLUSH LOGS 
        statements during replication could cause the master to 
        crash.

Pushed into 5.4.4-alpha (revid:anozdrin@bk-internal.mysql.com-20090710111017-bnh2cau84ug1hvei) (version source revid:aelkin@mysql.com-20090623113739-wmdndanxrv5723gu) (merge vers: 5.4.4-alpha) (pib:11)

Pushed into 5.1.37-ndb-7.0.8 (revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (version source revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)

Pushed into 5.1.37-ndb-6.3.27 (revid:jonas@mysql.com-20090826105955-bkj027t47gfbamnc) (version source revid:jonas@mysql.com-20090826105955-bkj027t47gfbamnc) (merge vers: 5.1.37-ndb-6.3.27) (pib:11)

Pushed into 5.1.37-ndb-6.2.19 (revid:jonas@mysql.com-20090825194404-37rtosk049t9koc4) (version source revid:jonas@mysql.com-20090825194404-37rtosk049t9koc4) (merge vers: 5.1.37-ndb-6.2.19) (pib:11)

Pushed into 5.1.35-ndb-7.1.0 (revid:magnus.blaudd@sun.com-20090827163030-6o3kk6r2oua159hr) (version source revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)

Can this really crash the master? Looks like something that can really only affect the slave?

James, I think you are right, the reference to master in the changelog entry is probably a typo. It should only affect servers cofigured as slaves.

The changelog entry currently reads,

"Replication: Shutting down the server while executing FLUSH LOGS, CHANGE MASTER TO, or STOP SLAVE could sometimes cause mysqld to crash. (Bug#38240)"

It doesn't say anything about the master. Would you prefer I change "the server" and/or "mysqld" to "the slave"?

Thanks!

still affects 5.0.89, both release and debug binaries crash during shutdown:

ntdll.dll!RtlEnterCriticalSection()
mysqld-debug.exe!rotate_relay_log()[slave.cc:5345]
mysqld-debug.exe!reload_acl_and_cache()[sql_parse.cc:7394]
mysqld-debug.exe!mysql_execute_command()[sql_parse.cc:4521]
mysqld-debug.exe!mysql_parse()[sql_parse.cc:6459]
mysqld-debug.exe!dispatch_command()[sql_parse.cc:1959]
mysqld-debug.exe!do_command()[sql_parse.cc:1640]
mysqld-debug.exe!handle_one_connection()[sql_parse.cc:1233]
mysqld-debug.exe!pthread_start()[my_winthread.c:85]
mysqld-debug.exe!_threadstart()[thread.c:196]
kernel32.dll!FlsSetValue()
t some variables.
s may be invalid and cause the dump to abort...
t 053F37D8=flush logs

Jon, yes, please say "replication slave" or similar to make it unambiguous about what can be affect.

Jon, I comfirm James' suggestion. cheers, Andrei.

Modified changelog entry, which now reads:

        Shutting down the slave while executing FLUSH LOGS, CHANGE 
        MASTER TO, or STOP SLAVE could sometimes cause it to crash.

Set Need Merge status, because (a) it appears that this was supposed go into 5.0 as well, but never made it there, and (b) if this affects later versions, then it should be merged to them as well.

See also: BUG#50364.

Appears no further changelog entries should be required, so I'm closing.

Thanks!