Bug #44312 deadlock between IO thread and SLAVE START
Submitted: 16 Apr 2009 9:47 Modified: 13 May 2009 12:48
Reporter: Andrei Elkin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.0 OS:Any
Assigned to: Andrei Elkin CPU Architecture:Any
Triage: Triaged: D1 (Critical)

[16 Apr 2009 9:47] Andrei Elkin
Description:
start slave;
stop slave sql_thread;
start slave;

can lead to the user thread and the IO thread deadlock.
The code analysis shows acquiring 

The order for IO thread is mi->data_lock, mi->run_lock
(process_io_rotate() -> rotate_relay_log()) 
and the reverse  for the user thread (start_slave() -> init_info()).

The deadlock can be detected with running bug#38716 regression test
(provided that a patch fixing the latter bug's reported assert is
applied).

How to repeat:
To build and execute bug38716 program against the server started and
configured per bug#38716 instructions. To quote:

   setup a debug build master replicating to itself:

   mysqld-debug  --console --skip-grant-tables --server-id=5 --log-bin --port=3306
   --replicate-same-server-id  --slave-skip-errors=1050 --skip-innodb

   change master to master_host='127.0.0.1', master_port=3306, master_user='root',
   master_password='';
   start slave;

   then run the attached bug38716.c testcase against the server.

Suggested fix:
To analyze on the ordering issue. As the IO thread is not supposed to
restart it might be that acquiring one of the mutex' by STARTing SLAVE
thread is unnecessary.
[17 Apr 2009 10:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72381

2842 Andrei Elkin	2009-04-17
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from 
       or it will face an assert in terminate_slave_thread()
      because of the term_lock.
      No other crashes has been found using the regression test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in reverse order.
      
      Fixed:
      terminate_slave_threads() does not request locking in start_slave_threads();
      start_slave() grabs run_lock:s for lesser time that avoids the deadlock
      with IO thread.
[24 Apr 2009 16:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72790

2842 Andrei Elkin	2009-04-24
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      the issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from start_slave_threads(); that leads to an assertion on run_lock mutex.
      No other crashes has been found using the regression test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in
      reverse order.
      
      Fixed:
      
      terminate_slave_threads() does not request locking in start_slave_threads();
      
      rotate_relay_log() does not acquire mi->run_lock which is safe. 
      The rli->inited is not guarded by this mutex, and locking of the mutex 
      in the function contradicts the safe pattern of locking with run_lock.
[27 Apr 2009 21:19] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72858

2842 Andrei Elkin	2009-04-28
      Bug #38716 slave crashed after 'stop slave' during concurrent stop/start/reset
      Bug #44312  deadlock between IO thread and SLAVE START
      
      The issue in terminate_slave_threads() of bug#38716 was reproduced in a slighly
      different form. terminate_slave_threads() should not acquire run_lock when it's
      called from slave_thread()->start_slave_threads(); 
      that leads to an assertion on run_lock mutex.
      OTH, init_slave()->start_slave_threads() path requires terminate_slave_threads() to
      be invoked with skip_lock == false.
      
      No other crashes has been found using the regression stress test program.
      Another issues is a deadlock as described separately in Bug #44312.
      It was caused by grabbing two mutexes by IO thread and STARTing SLAVE thread in
      reverse order.
      
      Fixed:
      
      terminate_slave_threads() does not request locking in start_slave_threads()
      when it's called from start_slave() and does it when it's called from init_slave();
      
      rotate_relay_log() does not acquire mi->run_lock which is safe. 
      The rli->inited is not guarded by this mutex, and locking of the mutex 
      in the function contradicts the safe pattern of locking with run_lock.
[13 May 2009 3:30] Bugs System
Pushed into 6.0.12-alpha (revid:alik@sun.com-20090513032549-rxa73jbxd1qv09xc) (version source revid:aelkin@mysql.com-20090427211821-4lne3342gva8ghzt) (merge vers: 6.0.11-alpha) (pib:6)
[13 May 2009 12:48] Jon Stephens
Documented bugfix in the 6.0.12 changelog as follows:

        Issuing the following statements, in the order shown, could
        cause a deadlock between the user thread and I/O thread:

   START SLAVE;
   STOP SLAVE SQL_THREAD;
   START SLAVE;