Bug #70307 | Another deadlock on FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS | ||
---|---|---|---|
Submitted: | 12 Sep 2013 0:33 | Modified: | 11 Dec 2014 9:27 |
Reporter: | Yoshinori Matsunobu (OCA) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.6.13 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[12 Sep 2013 0:33]
Yoshinori Matsunobu
[12 Sep 2013 5:30]
MySQL Verification Team
Similar to to this: http://bugs.mysql.com/bug.php?id=67879
[12 Sep 2013 19:39]
MySQL Verification Team
Hello Yoshinori, Thank you for the bug report. I've observed similar(but without sleep(10) etc at the beginning of Xid_log_event::do_apply_event()) deadlock. Thanks, Umesh
[12 Sep 2013 19:40]
MySQL Verification Team
Testcase results..
Attachment: 70307.txt (text/plain), 43.14 KiB.
[1 Oct 2013 6:38]
MySQL Verification Team
Is it a true deadlock? Looks like it waits for "lock_wait_timeout" to expire, since a thread is calling MDL_wait::timed_wait. If you kill the thread waiting MDL lock, does it unblock everything?
[20 Aug 2014 10:33]
Sujatha Sivakumar
As mentioned above the deadlock is caused in the following scenario. Slave has executed flush tables with global read lock on connection1. Sql thread is trying to execute a DML by acquiring rli->data_lock and waiting for MDL commit lock. Meanwhile on the same connection1 SHOW SLAVE STATUS command is executed which is waiting for rli->data_lock and it hangs. To resolve the deadlock one of them has to release the lock. We cannot kill the client connection which is executing show slave status as it is waiting on a mutex, it will not be able to detect the kill command. So the only possibility would be to kill the sql thread. Killing the sql thread will cause the rli->data_lock to be released and result in resolving deadlock. Please note killing and restarting the sql thread which is waiting for MDL lock will cause a transaction to be skipped on slave. This issue has been fixed as part of Bug#69873 and this fix will be available in upcoming releases.
[11 Sep 2014 15:15]
liu hickey
It can be simplified without client2 in that case, while the root cause still the same. bug68460 fixed the deadlock in TABLE mode for RPL info in MTR worker threads. While for single thread replication mode, the deadlock still exist with exactly the same behavior, but the cause is different. The HANG case is: FTWRL, then SHOW SLAVE STATUS hang, waiting rli->data_lock worker thread hang, waiting MDL_COMMIT but holding rli->data_lock! Looking into the code, it's obviously buggy with --slave_parallel_workers=0: 7431 int Xid_log_event::do_apply_event(Relay_log_info const *rli) 7442 mysql_mutex_lock(&rli_ptr->data_lock); 7495 error= do_commit(the); 7502 mysql_mutex_unlock(&rli_ptr->data_lock); Holding rli->data_lock but acquiring MDL_COMMIT during commit stage for slave worker thread, but FTWRL+SHOW SLAVE STATUS would block any commit while acquiring rli->data_lock, that caused the deadlock in SLAVE. <from bug68460> Backtrace: hread 27 (Thread 0x7f7198240700 (LWP 18877)): #0 0x00000038f0c0b7cb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000662250 in MDL_wait::timed_wait(MDL_context_owner*, timespec*, bool, PSI_stage_info_v1 const*) () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:1188 #2 0x00000000006640c2 in MDL_context::acquire_lock(MDL_request*, unsigned long) () at /u01/osconn/dev/alisql/sql/mdl.cc:2271 #3 0x00000000005b2fdf in ha_commit_trans(THD*, bool, bool) () at /u01/osconn/dev/alisql/sql/handler.cc:1474 #4 0x000000000078ce09 in trans_commit(THD*) () #5 0x000000000089111d in Xid_log_event::do_apply_event(Relay_log_info const*) () at /u01/osconn/dev/alisql/sql/log_event.cc:7344 #6 0x0000000000896bee in Log_event::apply_event(Relay_log_info*) () at /u01/osconn/dev/alisql/sql/log_event.cc:3031 #7 0x00000000008ceed9 in apply_event_and_update_pos(Log_event**, THD*, Relay_log_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3502 #8 0x00000000008d1e45 in handle_slave_sql () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3894 #9 0x0000000000b521c0 in pfs_spawn_thread () #10 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0 #11 0x00000038f08e68ed in clone () from /lib64/libc.so.6 Thread 25 (Thread 0x7f71981be700 (LWP 18892)): #0 0x00000038f0c0e074 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000038f0c0937b in _L_lock_1023 () from /lib64/libpthread.so.0 #2 0x00000038f0c092fe in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00000000008c40dd in inline_mysql_mutex_lock.constprop.83 () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:673 #4 0x00000000008cda84 in show_slave_status(THD*, Master_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:2683 #5 0x00000000006fa63d in mysql_execute_command(THD*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:3195 #6 0x00000000006fd498 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:6729 #7 0x00000000006ff7c0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:1634 #8 0x00000000006c391d in do_handle_one_connection(THD*) () #9 0x00000000006c3969 in handle_one_connection () #10 0x0000000000b521c0 in pfs_spawn_thread () #11 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0 #12 0x00000038f08e68ed in clone () from /lib64/libc.so.6
[7 Oct 2014 18:27]
Inaam Rana
I am not sure how we can set this to 'won't fix'. The problem still exists in 5.6.21. It doesn't need MTS to trigger the hang. And once we are hung there is no easy way out of it. The thread doing timed wait on MDL is replication sql thread. We can't set lock_wait_timeout for background threads. For this thread the timeout is set to default which I think is one year. I suggest we reopen this issue and revisit the design that leads to deadlock.
[3 Nov 2014 18:01]
monty solomon
We experience the slave hang frequently now -- already twice today. Please fix the problem in 5.6.
[11 Dec 2014 4:06]
Flavian C
This is a serious valid issue. Why Wont FIX Status for it.
[11 Dec 2014 9:27]
Sujatha Sivakumar
This deadlock issue is fixed as part of Oracle bug 19843808.
[19 Dec 2014 3:03]
monty solomon
We experience multiple failures due to this bug every day in version 5.6. The fix in Oracle bug 19843808 is only for version 5.7. This bug should be reopened as severity S1 since there is no available workaround for 5.6. Please backport the fix to version 5.6. Thanks.
[19 Dec 2014 4:58]
Sujatha Sivakumar
Oracle bug 19843808 is fixed in 5.6 as well. FIx will be available in upcoming 5.6 release
[3 Feb 2015 4:48]
Ben Krug
The fix for this bug is incorporated into 5.6.23, now available.
[12 Feb 2015 12:37]
Laurynas Biveinis
$ git show -s f42e9f8 commit f42e9f8fd2309e85f25ff0471eb5f7e397311b2a Author: s.sujatha <sujatha.sivakumar@oracle.com> Date: Mon Nov 17 14:47:39 2014 +0530 Bug#19843808: DEADLOCK ON FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS Problem: If a client thread on an slave does FLUSH TABLES WITH READ LOCK; then master does some updates, SHOW SLAVE STATUS in the same client will be blocked. Analysis: Execute FLUSH TABLES WITH READ LOCK on slave and at the same time execute a DML on the master. Then the DML should be made to stop at a state "Waiting for commit lock". This state means that sql thread is holding rli->data_lock and waiting for MDL_COMMIT lock. Now in the same client session where FLUSH TABLES WITH READ LOCK was executed issue SHOW SLAVE STATUS command. This command will be blocked waiting for rli->data_lock causing a dead lock. Once this happens it will not be possible to release the global read lock as "UNLOCK TABLES" command has to be issued in the same client where global read lock was acquired. This causes the dead lock. Fix: Existing code holds the rli->data_lock for the whole duration of commit operation. Instead of holding the lock for entire commit duration the code has been restructured in such a way that the lock is held only during the period when rli object is being updated.