Bug #2921 Replication problem on mutex lock in mySQL-4.0.18
Submitted: 22 Feb 2004 14:11 Modified: 11 Mar 2004 7:27
Reporter: Dathan Pattishall Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.0.18 OS:Linux (RedHat 7.3)
Assigned to: Michael Widenius

[22 Feb 2004 14:11] Dathan Pattishall
Description:
mySQL will wait forever if 2 SLAVE start is issued, nearly at the same time when the SQL thread is running but the IO thread is not and the master is down.

The message displayed is IO thread waiting for mutex lock on slave

something like that.

To recover I had to do a kill -9 on the process.

Issuing a KILL [id} through mySQL does not work.

How to repeat:

Its in the description

Suggested fix:

Deadlock detection in the replication layer. Also allow mutex timeouts on a single object, additionally getting a single to shutdown will remove all locks on Replication threads and SHUTDOWN cleanly.
[23 Feb 2004 10:39] Dathan Pattishall
mysql> show processlist;
+----------+-------------+------------------+------------+---------+-------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id       | User        | Host             | db         | Command | Time  | State
    | Info                                                                                                 |
+----------+-------------+------------------+------------+---------+-------+-----------------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
|        2 | system user |                  | NULL       | Connect | 86957 | Has read all relay log; waiting for the I/O slave thread to update it | NULL                                                                                                 |
| 37105473 | mon         | 10.10.1.19:50788 | mysql      | Query   | 79839 | Waiting for slave thread to start
    | SLAVE START                                                                                          |
| 37105474 | mon         | 10.10.1.19:50789 | mysql      | Query   | 79839 | NULL
    | SLAVE START                                                                                          |
| 37105477 | system user |                  | NULL       | Connect | 79839 | Waiting for slave mutex on exit
    | NULL                                                                                                 |
| 46076237 | root        | localhost        | NULL       | Sleep   | 254   |
    | NULL                                                                                                 |
| 46096195 | root        | localhost        | NULL       | Query   | 0     | NULL
    | show processlist
[28 Feb 2004 15:45] Guilhem Bichot
Thanks for your very good bug report!

Comment for myself:
| 37105473 | mon         | 10.10.1.19:50788 | mysql      | Query   |
79839 | Waiting for slave thread to start
    | SLAVE START                                                      
                                   |
| 37105474 | mon         | 10.10.1.19:50789 | mysql      | Query   |
79839 | NULL
    | SLAVE START 

First SLAVE START calls lock_slave_threads() which locks mi->run_lock and rli->run_lock. Then it wants to start the I/O thread: it creates this thread, then wants to wait for this thread to say "I have done all start steps, I'm ready"; for this wait it goes into a pthread_cond_wait(...,mi->run_lock) thus releasing mi->run_lock. So when it's waiting on the condition, it has only rli->run_lock, and not mi->run_lock (one sees the only problem: unlocking must of course be done in the reverse order of locking).
Then the 2nd START SLAVE comes; it calls lock_slave_threads(), which successfully locks mi->run_lock, then blocks because rli->run_lock is locked by the 1st.
When 1st wakes up, pthread_cond_wait() tries to lock mi->run_lock, but it's locked by the 2nd so it blocks. Deadlock.

Now I just have to fix it :)
[9 Mar 2004 23:26] Michael Widenius
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

I fixed this by changing so that the SQL thread is started first. This ensures that the mutex are unlocked in the right order.

Fix will be in 4.0.19 and 4.1.2
[11 Mar 2004 7:27] Guilhem Bichot
Fixed in 4.0 ChangeSet@1.1738.1.1, 2004-03-11 16:23:35+01:00, guilhem@mysql.com
(using LOCK_active_mi).