MySQL Bugs: #38716: slave crashed after 'stop slave' during concurrent stop/start/reset slave

Bug #38716	slave crashed after 'stop slave' during concurrent stop/start/reset slave
Submitted:	11 Aug 2008 11:08	Modified:	13 May 2009 12:34
Reporter:	Shane Bester (Platinum Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.0.66a-debug, 5.1.26-debug	OS:	Any
Assigned to:	Andrei Elkin	CPU Architecture:	Any

Description:
related to bug #38715

while running stop/start slave and killing random connections on the slave, it crashed
like this:

mysqld-nt.exe!terminate_slave_thread(THD * thd=0x00000000 Line 710
mysqld-nt.exe!terminate_slave_threads  Line 658
mysqld-nt.exe!stop_slave               Line 915
mysqld-nt.exe!mysql_execute_command
mysqld-nt.exe!mysql_parse
mysqld-nt.exe!dispatch_command
mysqld-nt.exe!do_command
mysqld-nt.exe!handle_one_connection
mysqld-nt.exe!pthread_start

in terminate_slave_thread() we have
while (*slave_running)			// Should always be true
{
   DBUG_PRINT("loop", ("killing slave thread"));
   pthread_mutex_lock(&thd->LOCK_delete); <--- crash, thd is null

How to repeat:
have a replicating slave. in two threads execute:

stop slave
reset slave
start slave
kill <random id>

full stack trace from 5.0.66a

Attachment: bug38716_some_info.txt (text/plain), 1.17 KiB.

to repeat:
--------------
setup a self-replicating debug master:

mysqld-debug  --console --skip-grant-tables --server-id=5 --log-bin --port=3306 --replicate-same-server-id  --slave-skip-errors=1050 --skip-innodb --skip-name-resolve --slave-net-timeout=1 --net-read-timeout=1 --net-write-timeout=1

change master to master_host='127.0.0.1', master_port=3306, master_user='root', master_password='';
start slave;

run the attached bug38716.c against the server.  if it hangs, kill server and start again (although the hang could be considered part of the bug too)

testcase that will expose a few different assertions if run as described above.

Attachment: bug38716.c (text/plain), 7.43 KiB.

infinite hangs may also be seen instead of crashes.

Shane Bester wrote:
> The following may very well all be part of the same overall bug:
>
> Bug#38240
> Bug#38715
> Bug#38716
>
> The above 3 all relate to various locking/mutex issues in the
> replication code during concurrent flush logs/stop/start
> slave/reset slave.
>
> A reason for different bug reports was slightly different crashes in
> each case.  I'm guessing the devs will eventually set them to
> duplicates and fix everything in one go.

hi,
  Does Bug #31374 relate to this bug too?

Rancpine, thanks for the suggestion.

Indeed  Bug #31374 failure description fits to this bug.
bug38716.c triggers an assert in  safe_mutex_lock () called from
terminate_slave_thread().

The happens on the 6.0 server as well.

thd == 0 could not be caught with my tests on 6.0 version.
There have been two issues found:

1. terminate_slave_thread tried to acquire run_lock mutex for the 2nd
   time when  start_slave_threads() executed

    if (error)
      terminate_slave_threads(mi, thread_mask & SLAVE_IO, 1);

  Error must have caused by killing (bug38716.c).
  In this case the stack is close to the reported but it contains start_slave()

2. a deadlock reported as a separate Bug #44312

It's not clear if it was really thd == 0 in the original report, as
it's not explained a method to find out the value. If that was an
assert crash we should think on other reason. I examined 5.0.66 code
to find out that slave.cc:710 is
    IF_DBUG(int err= ) pthread_kill(thd->real_id, thr_client_alarm);
and it follows with
    DBUG_ASSERT(err != EINVAL);

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72381

2842 Andrei Elkin	2009-04-17
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from 
       or it will face an assert in terminate_slave_thread()
      because of the term_lock.
      No other crashes has been found using the regression test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in reverse order.
      
      Fixed:
      terminate_slave_threads() does not request locking in start_slave_threads();
      start_slave() grabs run_lock:s for lesser time that avoids the deadlock
      with IO thread.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72790

2842 Andrei Elkin	2009-04-24
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from start_slave_threads(); that leads to an assertion on run_lock mutex.
      No other crashes has been found using the regression test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in
      reverse order.
      
      Fixed:
      
      terminate_slave_threads() does not request locking in start_slave_threads();
      
      rotate_relay_log() does not acquire mi->run_lock which is safe. 
      The rli->inited is not guarded by this mutex, and locking of the mutex 
      in the function contradicts the safe pattern of locking with run_lock.

Notice, the reported stack occured with 5.0 version that has not got
yet bug#29968 fixes (unguarded access to slave_running status
variable, that is a matter of bug#38694 that is about to commit). 
Therefore 6.0 is much more stable to the stress testing
(the last patch sustained it about half an hour).

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72858

2842 Andrei Elkin	2009-04-28
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      The issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from slave_thread()->start_slave_threads(); 
      that leads to an assertion on run_lock mutex.
      OTH, init_slave()->start_slave_threads() path requires terminate_slave_threads() to
      be invoked with skip_lock == false.
      
      No other crashes has been found using the regression stress test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in
      reverse order.
      
      Fixed:
      
      terminate_slave_threads() does not request locking in start_slave_threads()
      when it's called from start_slave() and does it when it's called from init_slave();
      
      rotate_relay_log() does not acquire mi->run_lock which is safe. 
      The rli->inited is not guarded by this mutex, and locking of the mutex 
      in the function contradicts the safe pattern of locking with run_lock.

Pushed into 6.0.12-alpha (revid:alik@sun.com-20090513032549-rxa73jbxd1qv09xc) (version source revid:aelkin@mysql.com-20090427211821-4lne3342gva8ghzt) (merge vers: 6.0.11-alpha) (pib:6)

Documented bugfix in the 6.0.12 changelog as follows:

        Issuing concurrent STOP SLAVE, START SLAVE, and RESET SLAVE statements
        using different connections caused the replication slave to crash.

Also documented in the 5.6.0 changelog. See BUG#48463.

Moved 5.6.0 changelog entry to 5.5.3.