| Bug #89331 | Changing lock/release order in group commit causes a deadlock | ||
|---|---|---|---|
| Submitted: | 20 Jan 2018 0:36 | Modified: | 14 Feb 2018 17:02 |
| Reporter: | Aliaksei Sandryhaila | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
| Version: | 8.0.3 | OS: | Red Hat |
| Assigned to: | Venkatesh Duggirala | CPU Architecture: | Any |
[20 Jan 2018 0:36]
Aliaksei Sandryhaila
[22 Jan 2018 6:17]
MySQL Verification Team
Hello Aliaksei, Thank you for the report and feedback. Could you please confirm if the given patch in the bug report is same as that in https://github.com/facebook/mysql-5.6/commit/c2b8ced ? 5.6 link has some changes done in "Stage_manager" which is not seen in the patch provided in "How to repeat". Could you please provide 8.0 patch? Thank you! Thanks, Umesh
[22 Jan 2018 13:22]
MySQL Verification Team
rpl_nogtid suit test results after applying patch on top of 8.0.3 - 020025d9f277fd3283b8c75024f590857f58fbbd
Attachment: 89331_build.results (application/octet-stream, text), 40.51 KiB.
[22 Jan 2018 18:20]
Aliaksei Sandryhaila
Umesh, Thank you for looking into this. I wonder if the failure is due to our testing setup. I will look into this again and get back to you asap.
[23 Jan 2018 5:10]
MySQL Verification Team
Thank you, Aliaksei. We will wait for further details. Regards, Umesh
[24 Jan 2018 20:05]
Aliaksei Sandryhaila
Minimal patch to reproduce the reported issue
Attachment: patch.diff (application/octet-stream, text), 2.39 KiB.
[24 Jan 2018 20:11]
Aliaksei Sandryhaila
Hi Umesh, I still can reproduce the issue. I've uploaded the updated minimal patch to reproduce the issue (see patch.diff). The complete list of steps is: 1. Clone the mysql repo from github, checkout branch 8.0 (8.0.3 RC, last commit: 020025d9) 2. mkdir bld; cd bld; cmake .. -DWITH_DEBUG=1; make 3. ./mysql-test-run --suite=rpl_nogtid # rpl_nogtid.rpl_semi_sync_deadlock passes 4. cd ..; patch -p1 < patch.diff; cd bld 5. ./mysql-test-run --suite=rpl_nogtid # Now rpl_nogtid.rpl_semi_sync_deadlock fails (timeout after 900 seconds due to a deadlock) Can you apply the new patch and see if you can reproduce the issue? Thank you.
[24 Jan 2018 20:13]
Aliaksei Sandryhaila
Missed one step in the reproduction steps above: ./mysql-test-run should be executed from bld/mysql-test directory.
[25 Jan 2018 4:47]
MySQL Verification Team
Thank you for the detailed steps, Aliaksei. I'll reattempt today and get back to you if there is any issue. Regards, Umesh
[25 Jan 2018 9:33]
MySQL Verification Team
Thank you, Aliaksei. With the latest reduced patch I'm seeing the issue finally. Regards, Umesh
[25 Jan 2018 10:14]
MySQL Verification Team
test results
Attachment: 89331_latest.results (application/octet-stream, text), 417.68 KiB.
[25 Jan 2018 10:24]
MySQL Verification Team
Hi Aliaksei,
I observed that test case failing even when ran individually as against initially you confirmed in the report. Could you please confirm on this?
[umshastr@hod03]/export/umesh/server/source/GitSource/mysql-8.0.3/bld/mysql-test:
[umshastr@hod03]/export/umesh/server/source/GitSource/mysql-8.0.3/bld/mysql-test: ./mtr --big-test rpl_nogtid.rpl_semi_sync_deadlock
Logging: /export/umesh/server/source/GitSource/mysql-8.0.3/mysql-test/mysql-test-run.pl --big-test rpl_nogtid.rpl_semi_sync_deadlock
MySQL Version 8.0.3
Checking supported features...
- SSL connections supported
- binaries are debug compiled
Collecting tests...
- adding combinations for rpl_nogtid
Checking leftover processes...
Removing old var directory...
Creating var directory '/export/umesh/server/source/GitSource/mysql-8.0.3/bld/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[1] Trying to dump core for [mysqltest - pid: 27973, winpid: 27973]
worker[1] Trying to dump core for [mysqld.1 - pid: 27912, winpid: 27912]
worker[1] Trying to dump core for [mysqld.2 - pid: 27914, winpid: 27914]
rpl_nogtid.rpl_semi_sync_deadlock 'mix' [ fail ] timeout after 900 seconds
Test ended at 2018-01-25 11:10:45
Thanks,
Umesh
[25 Jan 2018 19:08]
Aliaksei Sandryhaila
Hi Umesh, Yes, I have confirmed that the rpl_nogtid.rpl_semi_sync_deadlock is failing after patch.diff is applied, even when it is run individually. I guess this is a good news since it makes the failure deterministic. We would like to understand why this patch leads to a deadlock. Thank you.
[29 Jan 2018 15:36]
Venkatesh Duggirala
Post by Developer:
==================
Hello Aliaksei
Please find analysis on why the test script is hanging after applying your patch on mysql-8.0.3
Background:
===========
1) Test script has SET GLOBAL debug= '+d,dump_thread_before_read_event';
So dump thread, in Binlog_sender::read_event(), will wait for signal "continue" before sending any events to Slave.
See code below:
DBUG_EXECUTE_IF("dump_thread_before_read_event",
{
const char act[]= "now wait_for signal.continue no_clear_event";
DBUG_ASSERT(!debug_sync_set_action(m_thd,
STRING_WITH_LEN(act)));
};);
2) Problem is happening when rpl_semi_sync_master_wait_point=AFTER_SYNC (default value). AFTER_SYNC means that semi synchronous replication waits just after the binary log file is flushed, but before the engine commits, and so guarantees that no other sessions can see the data before replicated to slave.
3) Test script has SET GLOBAL debug= 'd,semi_sync_3-way_deadlock'; and
SET DEBUG_SYNC= "before_rotate_binlog SIGNAL signal.continue".
So FLUSH LOGS command after acquiring lock_log, it supposed to send signal "continue". (so that dump thread will continue its work).
See code below in MYSQL_BIN_LOG::new_file_impl:
if (need_lock_log)
mysql_mutex_lock(&LOCK_log);
else
mysql_mutex_assert_owner(&LOCK_log);
DBUG_EXECUTE_IF("semi_sync_3-way_deadlock",
DEBUG_SYNC(current_thd, "before_rotate_binlog"););
Analysis:
=========
Problem can be reproduced (explained) just with dump thread + 4 connections (even though test script has 10 connections + dump thread).
Dump thread: Waiting for signal "continue" before sending any events to Slave (see point-1 & 3 in background section)
Connection 1: executed insert into t1 values (1)
=> entered into BGC::commit_stage *after acquiring Lock_commit*
=> waiting for acknowledgement from slave before completing the
commit (see point-2 in background section).
But slave cannot send acknowledgement as
it did not yet receive the events from dump thread.
Connection 2: executed insert into t2 values (2)
=> finished BGC::sync_stage
=> change_stage to BGC::commit_stage
=> Because of the patch holding 'Lock_sync'
and waiting for "Lock_commit"
( which is held by connection 1)
Connection-3: executed insert into t3 values (3)
=> finished BGC::flush_stage
=> change_stage to BGC::sync_stage.
=> Because of the patch, holding "Lock_log"
and waiting for "Lock_sync"
(which is held by connection 2).
Connection-4: executed FLUSH LOGS
=> waiting for Lock_log
=> can send "signal.continue" only after
acquiring Lock_log
(which is held by connection-3)
(see point-3 in background section)
As you can see dump thread is waiting for signal that should come from connection-4's FLUSH LOGs. This forms full cyclic deadlock.
> connection 1 is waiting for dump thread
> connection 2 is waiting for connection 1
> connection 3 is waiting for connection 2
> connection 4 is waiting for connection 3
> dump_thread is waiting for connection 4.
Fix: The test script should be adjust according to your lock reorder patch.
Hope the analysis clear. Please let us know if you have any questions on the same.
Disclaimer: Please note that we have *only* analyzed why the patch is causing the rpl_semi_sync_deadlock.test to be hang. We have not analyzed whether
the patch has any side effects in other areas of replication code.
Regards,
Venkatesh.
[2 Feb 2018 22:26]
Aliaksei Sandryhaila
Hi Venkatesh, Thank you for the in-depth analysis of the deadlock! Following your analysis, I've experimented with the test and noticed that the deadlock can be reproduced even with 3 connections (by setting $n= 4) in mysql-test/extra/rpl_tests/rpl_semi_sync_deadlock.test. However, your deadlock scenario requires 4 connections. Can you please explain why the deadlock occurs with 3 connections? Thank you.
[3 Feb 2018 2:40]
Venkatesh Duggirala
Post by Developer ================= Hello Aliaksei If you talk in terms of $n, even in my analysis $n=4 only. Please see that in my analysis, I have '3' insert queries. And in the '4' th connection 'FLUSH LOGS' is executed which is needed to explain the analysis. That's why I have mentioned that we need '4' connections. Please observe that in the test script $n is to create connection1...connection$(n-1) and execute $(n-1) insert queries + there will be one connection 'master1' that executes 'FLUSH LOGS'. So when you are setting $n=4 , they are '3' insert connections + flush logs query from "master1" which is exactly same as my analysis. Hope I am clear, please let me know if you have any other questions on my analysis. Regards, Venkatesh.
[3 Feb 2018 5:14]
Aliaksei Sandryhaila
Venkatesh, Thank you for the clarification! Indeed, I missed that we create one more connection to flush logs. We will investigate how to adjust our patch and/or the tests to make them pass.
[5 Feb 2018 5:45]
MySQL Verification Team
Hello Aliaksei Since this turned out to be an expected behavior(!bg), do you agree to close the bug for now? Please confirm! Thanks, Umesh
[5 Feb 2018 19:02]
Aliaksei Sandryhaila
Modified patch to avoid deadlock
Attachment: modified_patch.txt (text/plain), 2.97 KiB.
[5 Feb 2018 19:10]
Aliaksei Sandryhaila
Hi Venkatesh,
I've slightly modified our patch to avoid the cyclic deadlock (see modified_patch.txt).Specifically, I modified sql/binlog.cc to execute line
DBUG_EXECUTE_IF("semi_sync_3-way_deadlock", DEBUG_SYNC(current_thd, "before_rotate_binlog"););
in new_file_impl() to execute before acquiring LOCK_log.
This modification should have resolved the cyclic deadlock because it should unblock the dump thread. However, the deadlock still occurs. Can you please clarify why this happens?
Thank you.
[13 Feb 2018 9:04]
Venkatesh Duggirala
Post by Developer:
==================
Hello Aliaksei
For your patch to work, the correct place to put DEBUG_SYNC point is
inside MYSQL_BIN_LOG::rotate_and_purge function and *not the place
you have specified in* modified_patch.diff.
Below is the diff that solves the problem. But please note that *to me*
it looks like this change is going to change the main intention
of the test script purpose. Please analyze the same before
you do the changes
The test script is written to find out if there is a deadlock occurs
when a FLUSH LOGs hold LOCK_log and wait for sessions to commit.
You can see from the commit log of mysql-test/extra/rpl_tests/rpl_semi_sync_deadlock.test,
the purpose of the test.
Copy pasting the text below for quick reference
=======================================================================================
* Deadlock on semi-synchronous replication
WL#6355 introduces a feature that semi-synchronous waits for ACK from
slave when the transaction's events is written to disc but before the
transaction is committed.
It introduced a three way deadlock.
SESSION 1(Rotate) SESSION 2(Trx Commit) DUMP THREAD
--------- --------- -----------
1 Flush Stage
Signal update
2 Rotate(hold LOCK_log)
WAITING for Session2 to commit
3 Sync Stage WAITING for Session1
to release LOCK_log
4 semisync
WAITING for ACK
Session1 had to hold LOCK_log to block other sessions appending new events.
It also blocked dump threads reading events.
=======================================================================================
In any case, please find the below diff that shows where exactly you have to put
"before_rotate_binlog" sync point.
@@ -7026,6 +7035,8 @@ int MYSQL_BIN_LOG::rotate_and_purge(THD *thd, bool force_rotate) {
ha_binlog_wait(thd);
DBUG_ASSERT(!is_relay_log);
+ DBUG_EXECUTE_IF("semi_sync_3-way_deadlock",
+ DEBUG_SYNC(current_thd, "before_rotate_binlog"););
mysql_mutex_lock(&LOCK_log);
Please let me know if you have any more questions on the same.
Regards,
Venkatesh.
[14 Feb 2018 17:01]
Aliaksei Sandryhaila
Venkatesh, Thank you for the final clarification!
[14 Feb 2018 17:02]
Aliaksei Sandryhaila
The reason for the deadlock has been verified.
