MySQL Bugs: #70307: Another deadlock on FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS

Bug #70307	Another deadlock on FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS
Submitted:	12 Sep 2013 0:33	Modified:	11 Dec 2014 9:27
Reporter:	Yoshinori Matsunobu (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.13	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This is similar to bug#68460, but is caused by different reason, and repeats on 5.6.13.

In the following scenario, deadlock happens, and can't be recovered without instance restart.

master:
 insert; (auto committed)

sql thread:
 finished executing insert statement, not started executing commit yet

slave client1:
 flush tables with read lock; -> done
 (holding global read lock)

sql thread:
 commit; -> blocked
 (holding rli->data_lock, waiting for MDL_COMMIT)

slave client2:
 show slave status; -> blocked
 (holding LOCK_active_mi and mi->data_lock, waiting for rli->data_lock)

slave client1:
 show slave status; -> blocked (deadlock)
 (waiting for LOCK_active_mi)

How to repeat:
The easiest way to repeat "finished executing insert statement, not started executing commit yet on SQL thread" is probably adding a sleep(10) at the beginning of Xid_log_event::do_apply_event().

Similar to to this:
http://bugs.mysql.com/bug.php?id=67879

Hello Yoshinori,

Thank you for the bug report.
I've observed similar(but without sleep(10) etc at the beginning of Xid_log_event::do_apply_event()) deadlock.

Thanks,
Umesh

Testcase results..

Attachment: 70307.txt (text/plain), 43.14 KiB.

Is it a true deadlock?  Looks like it waits for "lock_wait_timeout" to expire, since a thread is calling MDL_wait::timed_wait. If you kill the thread waiting MDL lock, does it unblock everything?

As mentioned above the deadlock is caused in the following scenario.
Slave has executed flush tables with global read lock on connection1. Sql thread
is trying to execute a DML by acquiring rli->data_lock and waiting
for MDL commit lock. Meanwhile on the same connection1 SHOW SLAVE STATUS
command is executed which is waiting for rli->data_lock and it hangs.

To resolve the deadlock one of them has to release the lock.
We cannot kill the client connection which is executing show slave status
as it is waiting on a mutex, it will not be able to detect the kill
command. So the only possibility would be to kill the sql thread.
Killing the sql thread will cause the rli->data_lock to be released and
result in resolving deadlock.

Please note killing and restarting the sql thread which is waiting for
MDL lock will cause a transaction to be skipped on slave. This issue
has been fixed as part of Bug#69873 and this fix will be available in
upcoming releases.

It can be simplified without client2 in that case, while the root cause still the same.

bug68460 fixed the deadlock in TABLE mode for RPL info in MTR worker threads. While for single thread replication mode, the deadlock still exist with exactly the same behavior, but the cause is different.

The HANG case is:

FTWRL, then SHOW SLAVE STATUS hang, waiting rli->data_lock

worker thread hang, waiting MDL_COMMIT but holding rli->data_lock!

Looking into the code, it's obviously buggy with --slave_parallel_workers=0:

 7431 int Xid_log_event::do_apply_event(Relay_log_info const *rli)
 7442   mysql_mutex_lock(&rli_ptr->data_lock);
 7495   error= do_commit(the);
 7502   mysql_mutex_unlock(&rli_ptr->data_lock);

Holding rli->data_lock but acquiring MDL_COMMIT during commit stage for slave worker thread, but FTWRL+SHOW SLAVE STATUS would block any commit while acquiring rli->data_lock, that caused the deadlock in SLAVE.

<from bug68460>
Backtrace:

hread 27 (Thread 0x7f7198240700 (LWP 18877)):
#0  0x00000038f0c0b7cb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000662250 in MDL_wait::timed_wait(MDL_context_owner*, timespec*, bool, PSI_stage_info_v1 const*) ()
    at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:1188
#2  0x00000000006640c2 in MDL_context::acquire_lock(MDL_request*, unsigned long) () at /u01/osconn/dev/alisql/sql/mdl.cc:2271
#3  0x00000000005b2fdf in ha_commit_trans(THD*, bool, bool) () at /u01/osconn/dev/alisql/sql/handler.cc:1474
#4  0x000000000078ce09 in trans_commit(THD*) ()
#5  0x000000000089111d in Xid_log_event::do_apply_event(Relay_log_info const*) () at /u01/osconn/dev/alisql/sql/log_event.cc:7344
#6  0x0000000000896bee in Log_event::apply_event(Relay_log_info*) () at /u01/osconn/dev/alisql/sql/log_event.cc:3031
#7  0x00000000008ceed9 in apply_event_and_update_pos(Log_event**, THD*, Relay_log_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3502
#8  0x00000000008d1e45 in handle_slave_sql () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3894
#9  0x0000000000b521c0 in pfs_spawn_thread ()
#10 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000038f08e68ed in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f71981be700 (LWP 18892)):
#0  0x00000038f0c0e074 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000038f0c0937b in _L_lock_1023 () from /lib64/libpthread.so.0
#2  0x00000038f0c092fe in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000008c40dd in inline_mysql_mutex_lock.constprop.83 () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:673
#4  0x00000000008cda84 in show_slave_status(THD*, Master_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:2683
#5  0x00000000006fa63d in mysql_execute_command(THD*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:3195
#6  0x00000000006fd498 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:6729
#7  0x00000000006ff7c0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:1634
#8  0x00000000006c391d in do_handle_one_connection(THD*) ()
#9  0x00000000006c3969 in handle_one_connection ()
#10 0x0000000000b521c0 in pfs_spawn_thread ()
#11 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000038f08e68ed in clone () from /lib64/libc.so.6

I am not sure how we can set this to 'won't fix'. The problem still exists in 5.6.21. It doesn't need MTS to trigger the hang. And once we are hung there is no easy way out of it. The thread doing timed wait on MDL is replication sql thread. We can't set lock_wait_timeout for background threads. For this thread the timeout is set to default which I think is one year. 

I suggest we reopen this issue and revisit the design that leads to deadlock.

We experience the slave hang frequently now -- already twice today. Please fix the problem in 5.6.

This is a serious valid issue. Why Wont FIX Status for it.

This deadlock issue is fixed as part of Oracle bug 19843808.

We experience multiple failures due to this bug every day in version 5.6.

The fix in Oracle bug 19843808 is only for version 5.7.

This bug should be reopened as severity S1 since there is no available workaround for 5.6.

Please backport the fix to version 5.6.

Thanks.

Oracle bug 19843808  is fixed in 5.6 as well.  FIx will be available in upcoming 5.6 release

The fix for this bug is incorporated into 5.6.23, now available.

$ git show -s f42e9f8
commit f42e9f8fd2309e85f25ff0471eb5f7e397311b2a
Author: s.sujatha <sujatha.sivakumar@oracle.com>
Date:   Mon Nov 17 14:47:39 2014 +0530

    Bug#19843808: DEADLOCK ON FLUSH TABLES WITH READ LOCK + SHOW
    SLAVE STATUS
    
    Problem:
    
    If a client thread on an slave does FLUSH TABLES WITH READ
    LOCK; then master does some updates, SHOW SLAVE STATUS in
    the same client will be blocked.
    
    Analysis:
    
    Execute FLUSH TABLES WITH READ LOCK on slave and at the same
    time execute a DML on the master. Then the DML should be
    made to stop at a state "Waiting for commit lock". This
    state means that sql thread is holding rli->data_lock and
    waiting for MDL_COMMIT lock. Now in the same client session
    where FLUSH TABLES WITH READ LOCK was executed issue SHOW
    SLAVE STATUS command. This command will be blocked waiting
    for rli->data_lock causing a dead lock. Once this happens
    it will not be possible to release the global read lock as
    "UNLOCK TABLES" command has to be issued in the same client
    where global read lock was acquired. This causes the dead
    lock.
    
    Fix:
    
    Existing code holds the rli->data_lock for the whole
    duration of commit operation. Instead of holding the lock
    for entire commit duration the code has been restructured in
    such a way that the lock is held only during the period when
    rli object is being updated.