MySQL Bugs: #44467: init_slave() and end

Bug #44467	init_slave() and end_slave() race condition
Submitted:	24 Apr 2009 18:30	Modified:	13 Jul 2009 4:22
Reporter:	Andrei Elkin	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.0+	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	replication slave start-stop

Description:
The thread running init_slave() does not wait for the being started 
slave threads' signalling. It unlocks LOCK_active_mi to let 
the end_slave() running terminator to find the slave threads
running status as 0, i.e unchanged yet from the inititial value.
The terminator in that case goes straight to delete active_mi without any ado.
The slave threads will crash e.g at trying to lock run_lock that belong in
the deleted active_mi.

How to repeat:
There is no deterministic test in how to repeat. It can be simulated
as the following:

1. start a self-replicating server with gdb e.g
shell> gdb mysqld
(gdb) set args  --console --skip-grant-tables --server-id=5 --log-bin \
--port=3306 --replicate-same-server-id --skip-name-resolve \
 --slave-net-timeout=1 --net-read-timeout=1 --net-write-timeout=1

2. set br in init_slave()
3. set br in end_slave()
4. set br in handle_slave_sql()
5. run
6. finish init_slave()
   don't let the slave sql thread to proceed from the 4th br point
   forcing it to stay (e.g call sleep(1))
7. run mysqladmin shutdown
   to get to end_slave()
   finish end_slave()
8. release the slave sql thread from the 4th br point anchor.
   The slave sql should come to rli->run_lock and segfault/assert.

 

Suggested fix:
Alas it's not possible to call start_slave_threads() with 
the wait == 1 in init_slave() because of 

#4  <signal handler called>
#5  0x08216ecc in THD::enter_cond (this=0x0, cond=0x892be58, mutex=0x892bde8, 
    msg=0x8647e18 "Waiting for slave thread to start") at sql_class.h:1653
#6  0x0838cfee in start_slave_thread (h_func=0x83886b9 <handle_slave_io>, 
    start_lock=0x892bde8, cond_lock=0x892bde8, start_cond=0x892be58, 
    slave_running=0x892dae0, slave_run_id=0x892dae4, mi=0x892b068, 
    high_priority=true) at slave.cc:827
#7  0x0838d7c8 in start_slave_threads (need_slave_mutex=true, 
    wait_for_start=true, mi=0x892b068, 
    master_info_fname=0x85d5e70 "master.info", 
    slave_info_fname=0x85d5e7c "relay-log.info", thread_mask=3) at slave.cc:876
#8  0x0838e2ab in init_slave () at slave.cc:186

Adding a new slave_status value to be set in the init_slave() running
thread might be helpful.
The end_slave() running shutdown thread respects any non-zero
slave_status to wait till it gets reset.

When reporting, I acknowledge that it's pretty hard to hit this bug in
practice. A combination of events such as start server, its almost
instant shutdown and scheduling like in how-to-repeat is not that
probable.

See also bug #33864

I will set this to verified because Andrei already gave all necessary informations here.