Bug #59147 | rpl_circular_for_4_hosts fails sporadically | ||
---|---|---|---|
Submitted: | 24 Dec 2010 7:22 | Modified: | 9 Feb 2011 20:59 |
Reporter: | Alexander Nozdrin | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | Tests: Replication | Severity: | S7 (Test Cases) |
Version: | 5.6.1, 5.5, mysql-trunk, 5.1 | OS: | Any |
Assigned to: | Luis Soares | CPU Architecture: | Any |
Tags: | pb2, test failure |
[24 Dec 2010 7:22]
Alexander Nozdrin
[12 Jan 2011 11:39]
Joerg Bruehe
Happened quite widespread in the release build of 5.6.1-m5.
[18 Jan 2011 5:54]
Anitha Gopi
Omer, This is failing ocassionally in PB2 and hence needs to be investigated. See http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=&test_ou... Changing the status to verified
[25 Jan 2011 14:50]
Alfranio Tavares Correia Junior
The same problem happens in 5.1: http://pb2.norway.sun.com/?template=mysql_show_test_failure&search=yes&push_id=1926484&tes...
[25 Jan 2011 15:30]
Bjørn Munch
It may be that the fix for Bug #59279 (forcing a server restart before this test) would also resolve this. If so, it should now stop happening on trunk. But I don't see it in trunk for quite a number of pushes anyway...
[28 Jan 2011 1:41]
Luis Soares
ANALYSIS ======== The failing part of the test case correspond to the part of the test in which the four rows are inserted like this: - A inserts a=13 - B inserts a=14 - C inserts a=15 - D inserts a=16 The wrong result shows: - Master A missing rows: B,a=13 and C,a=14 - Master B missing rows: C,a=15 - Master C not missing rows - Master D missing rows: B,a=13, C,a=14 and A,a=15 Apparently, the syncs operations between the servers is wrong (this happens sporadically) and goes something like this: 1. D->A (that's why both Master A and D have row a=16) 2. A->B (that's why both Master A and B have rows a=13, a=16) 3. B->C (that's why both Master B and C have rows a=13, a=14) 4. C->? (where are rows a=13, a=14, a=15 at D?) Early Conclusions: C1. Some rows are missing from some servers - C ends up with all expected rows - B misses C row - A misses B and C row - D misses all other rows except its own C2. Per C1, it seems that either: 1. the circular replication path is broken at sync time 2. the sync order is wrong (shifted) 3. one sync is missing C3. Per C1 and C2, this scenario could happen if server D IO thread is not running at the time of sync (so the circular replication is broken, see: C2.1) Looking at the test case we find that before the call to rpl_sync.inc, the connection C->D is actually re-established. For that the slave is stopped in both servers, the new topology is set and the slave(s) are restarted. We can also find, that the test issues the command START SLAVE and proceeds immediately with the rest of the test commands (including a call to rpl_sync.inc). This means that there is no waiting for the slave threads to be completely started. At this point in the test case, we can also verify that rpl_only_running_threads is set to 1 (it is set just a little bit earlier in the test case). This variable makes rpl_sync.inc to bypass the syncing of slave thread that does not show Running= 'Yes' in SHOW SLAVE STATUS. As a consequence, we may very well be facing a sync failure due to an unfortunate set of events: 1. rpl_only_running_threads=1 => bypasses silently syncing of slave thread that is not fully started 2. START SLAVE command => is not waiting for the slave threads to be fully started before continuing with the rest of the test #1 and #2 potentially make a hole in the replication chain to go unnoticed (except for the wrong/short result in the end). To validate this theory, I have checked what's in mysqld.log for server 'D' and found out that the instructions corresponding to the ones inside rpl_sync.inc (that check if slave threads are running) are the following: 2 Query INSERT INTO t1(b,c) VALUES('D',4) 2 Query SHOW SLAVE STATUS 2 Query SHOW SLAVE STATUS 2 Query SELECT IF('Connecting' = 'Yes', 1, '') 2 Query SELECT IF('Yes' = 'Yes', 1, '') 2 Query SHOW SLAVE STATUS Given that the Slave_IO_Running property can have three states: 1. MYSQL_SLAVE_NOT_RUN => Shows 'No' 2. MYSQL_SLAVE_RUN_NOT_CONNECT => Shows 'Connecting' (for 5.1.46+ and 5.5) 3. MYSQL_SLAVE_RUN_CONNECT => Shows 'Yes' All in all, this validates the above theory, explaining both: - the wrong result and - why the failure is sporadic.
[28 Jan 2011 1:53]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/129825 3561 Luis Soares 2011-01-28 BUG#59147: rpl_circular_for_4_hosts fails sporadically There is one part of the test case that needs to break and re-establish the circular topology. For this the test stops the slave threads on a couple of servers and restarts them with START SLAVE. However, no check is done on the status of the IO or SQL threads before proceeding with the subsequent commands. Because rpl_only_running_threads is set to 1 this can lead to silently not syncing all slave threads as expected, ultimately resulting in unexpected results (and consequently on a failing test run). We fix this by replacing the START SLAVE instructions with calls to --source include/start_slave.inc, which will wait for the slave threads to be running (show 'Yes' in Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before proceeding.
[28 Jan 2011 1:54]
Luis Soares
Test regression introduced by patch for BUG#49978.
[28 Jan 2011 9:11]
Sven Sandberg
Luís, excellent analysis of this bug. I think we should fix $only_running_threads to be less race-prone, and make it sync the IO thread even if it is "Connecting": replace: --let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' = 'Yes', 1, '')` by: --let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' != 'No', 1, '')` Then, we also need to fix START SLAVE so that we can't have Slave_IO_Running=No after START SLAVE. We already have proper synchronization with the SQL thread so that Slave_SQL_Running=Yes after START SLAVE. To fix the IO thread, just set slave_running=MYSQL_RUN_NOT_CONNECT *before* broadcasting the condition that the IO thread has started. (While we are there, I'd suggest we change mi->slave_running to an enum and assign symbolic values to it, not 0 and 1.) I think there have been test several bugs the last years because of this missing sync, so would be great to have this fixed.
[3 Feb 2011 15:56]
Sven Sandberg
I was wrong in the last paragraph of last comment: Slave_IO_Running is already set to Connecting atomically when slave starts. So nothing needs to be done in the server source, this is a test-only bug. I think Luís initial patch should be pushed (to remove races from rpl_circular_for_4_hosts), and also we should update rpl_sync.inc by replacing: --let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' = 'Yes', 1, '')` by: --let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' != 'No', 1, '')`
[3 Feb 2011 15:59]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/130334 3572 Luis Soares 2011-02-03 BUG#59147: rpl_circular_for_4_hosts fails sporadically There is one part of the test case that needs to break and re-establish the circular topology. For this the test stops the slave threads on a couple of servers and restarts them with START SLAVE. However, no check is done on the status of the IO or SQL threads before proceeding with the subsequent commands. Because rpl_only_running_threads is set to 1 this can lead to silently not syncing all slave threads as expected, ultimately resulting in unexpected results (and consequently on a failing test run). We fix this by replacing the START SLAVE instructions with calls to --source include/start_slave.inc, which will wait for the slave threads to be running (show 'Yes' in Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before proceeding.
[3 Feb 2011 16:10]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/130335 3572 Luis Soares 2011-02-03 BUG#59147: rpl_circular_for_4_hosts fails sporadically There is one part of the test case that needs to break and re-establish the circular topology. For this the test stops the slave threads on a couple of servers and restarts them with START SLAVE. However, no check is done on the status of the IO or SQL threads before proceeding with the subsequent commands. Because rpl_only_running_threads is set to 1 this can lead to silently not syncing all slave threads as expected, ultimately resulting in unexpected results (and consequently on a failing test run). We fix this by replacing the START SLAVE instructions with calls to --source include/start_slave.inc, which will wait for the slave threads to be running (show 'Yes' in Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before proceeding. Additionally, we change rpl_sync.inc to make the IO thread report that it is running when its running status is any other than 'No'.
[4 Feb 2011 14:59]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/130426 3576 Luis Soares 2011-02-04 [merge] BUG#59147: Automerged bzr bundle form bug report into mysql-5.1 .
[4 Feb 2011 15:01]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/130428 3301 Luis Soares 2011-02-04 [merge] BUG#59147: automerge mysql-5.1 into mysql-5.5.
[4 Feb 2011 15:02]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/130429 3595 Luis Soares 2011-02-04 [merge] BUG#59147: automerge mysql-5.5 into mysql-trunk.
[4 Feb 2011 15:03]
Bugs System
Pushed into mysql-5.1 5.1.56 (revid:luis.soares@oracle.com-20110204145751-34t52v63kxu7fqpx) (version source revid:luis.soares@oracle.com-20110204145751-34t52v63kxu7fqpx) (merge vers: 5.1.56) (pib:24)
[4 Feb 2011 15:03]
Bugs System
Pushed into mysql-trunk 5.6.2 (revid:luis.soares@oracle.com-20110204150014-a8qb9s0cd9xtpiud) (version source revid:luis.soares@oracle.com-20110204150014-a8qb9s0cd9xtpiud) (merge vers: 5.6.2) (pib:24)
[4 Feb 2011 15:03]
Bugs System
Pushed into mysql-5.5 5.5.10 (revid:luis.soares@oracle.com-20110204145904-2u4neak21owtv38p) (version source revid:luis.soares@oracle.com-20110204145904-2u4neak21owtv38p) (merge vers: 5.5.10) (pib:24)
[4 Feb 2011 15:40]
Luis Soares
Queued in mysql-5.1 and merged to mysql-5.5 and mysql-trunk.
[5 Feb 2011 9:14]
Jon Stephens
Test changes only, closed without further action.
[9 Feb 2011 20:59]
Jon Stephens
Set back to Closed; see my previous comment.