MySQL Bugs: #59147: rpl_circular_for_4

Bug #59147	rpl_circular_for_4_hosts fails sporadically
Submitted:	24 Dec 2010 7:22	Modified:	9 Feb 2011 20:59
Reporter:	Alexander Nozdrin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Tests: Replication	Severity:	S7 (Test Cases)
Version:	5.6.1, 5.5, mysql-trunk, 5.1	OS:	Any
Assigned to:	Luis Soares	CPU Architecture:	Any
Tags:	pb2, test failure

Description:
rpl_circular_for_4_hosts fails sporadically with the symptoms as follows:
@@ -171,14 +171,11 @@
 SELECT 'Master A',a,b FROM t1 WHERE c = 4 ORDER BY a,b;
 Master A	a	b
 Master A	13	A
-Master A	14	B
-Master A	15	C
 Master A	16	D
 SELECT 'Master B',a,b FROM t1 WHERE c = 4 ORDER BY a,b;
 Master B	a	b
 Master B	13	A
 Master B	14	B
-Master B	15	C
 Master B	16	D
 SELECT 'Master C',a,b FROM t1 WHERE c = 4 ORDER BY a,b;
 Master C	a	b
@@ -188,9 +185,6 @@
 Master C	16	D
 SELECT 'Master D',a,b FROM t1 WHERE c = 4 ORDER BY a,b;
 Master D	a	b
-Master D	13	A
-Master D	14	B
-Master D	15	C
 Master D	16	D

How to repeat:
XRef: http://pb2.norway.sun.com/?template=mysql_show_test_failure&test_failure_id=&test_output_i...

Happened quite widespread in the release build of 5.6.1-m5.

Omer,
This is failing ocassionally in PB2 and hence needs to be investigated. See http://pb2.norway.sun.com/web.py?template=mysql_show_test_failure&test_failure_id=&test_ou...

Changing the status to verified

The same problem happens in 5.1:

http://pb2.norway.sun.com/?template=mysql_show_test_failure&search=yes&push_id=1926484&tes...

It may be that the fix for Bug #59279 (forcing a server restart before this test) would also resolve this. If so, it should now stop happening on trunk. But I don't see it in trunk for quite a number of pushes anyway...

ANALYSIS
========

The failing part of the test case correspond to the part of the test
in which the four rows are inserted like this:

 - A inserts a=13
 - B inserts a=14
 - C inserts a=15
 - D inserts a=16

The wrong result shows:

  - Master A missing rows: B,a=13 and C,a=14
  - Master B missing rows: C,a=15
  - Master C not missing rows
  - Master D missing rows: B,a=13, C,a=14 and A,a=15

Apparently, the syncs operations between the servers is wrong (this
happens sporadically) and goes something like this:

1. D->A (that's why both Master A and D have row a=16)
2. A->B (that's why both Master A and B have rows a=13, a=16)
3. B->C (that's why both Master B and C have rows a=13, a=14)
4. C->? (where are rows a=13, a=14, a=15 at D?)

Early Conclusions:

  C1. Some rows are missing from some servers
     - C ends up with all expected rows
     - B misses C row
     - A misses B and C row
     - D misses all other rows except its own
  
  C2. Per C1, it seems that either: 
      1. the circular replication path is broken at sync time
      2. the sync order is wrong (shifted)
      3. one sync is missing

  C3. Per C1 and C2, this scenario could happen if server D IO thread
      is not running at the time of sync (so the circular replication
      is broken, see: C2.1)

Looking at the test case we find that before the call to rpl_sync.inc,
the connection C->D is actually re-established. For that the slave is
stopped in both servers, the new topology is set and the slave(s) are
restarted. 

We can also find, that the test issues the command START SLAVE and
proceeds immediately with the rest of the test commands (including a
call to rpl_sync.inc). This means that there is no waiting for the
slave threads to be completely started.

At this point in the test case, we can also verify that
rpl_only_running_threads is set to 1 (it is set just a little bit
earlier in the test case). This variable makes rpl_sync.inc to bypass
the syncing of slave thread that does not show Running= 'Yes' in SHOW
SLAVE STATUS. 

As a consequence, we may very well be facing a sync failure due to an
unfortunate set of events:

  1. rpl_only_running_threads=1 => bypasses silently syncing of slave
                                   thread that is not fully started

  2. START SLAVE command => is not waiting for the slave threads to be
                            fully started before continuing with the
                            rest of the test

  #1 and #2 potentially make a hole in the replication chain to go
  unnoticed (except for the wrong/short result in the end).

To validate this theory, I have checked what's in mysqld.log for
server 'D' and found out that the instructions corresponding to the
ones inside rpl_sync.inc (that check if slave threads are running) are
the following:

  2 Query     INSERT INTO t1(b,c) VALUES('D',4)
  2 Query     SHOW SLAVE STATUS
  2 Query     SHOW SLAVE STATUS
  2 Query     SELECT IF('Connecting' = 'Yes', 1, '')
  2 Query     SELECT IF('Yes' = 'Yes', 1, '')
  2 Query     SHOW SLAVE STATUS

Given that the Slave_IO_Running property can have three states:

  1. MYSQL_SLAVE_NOT_RUN => Shows 'No'
  2. MYSQL_SLAVE_RUN_NOT_CONNECT => Shows 'Connecting' (for 5.1.46+ and 5.5)
  3. MYSQL_SLAVE_RUN_CONNECT => Shows 'Yes'

All in all, this validates the above theory, explaining both:

  - the wrong result and
  - why the failure is sporadic.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/129825

3561 Luis Soares	2011-01-28
      BUG#59147: rpl_circular_for_4_hosts fails sporadically
      
      There is one part of the test case that needs to break
      and re-establish the circular topology. For this the test
      stops the slave threads on a couple of servers and restarts
      them with START SLAVE. However, no check is done on the
      status of the IO or SQL threads before proceeding with
      the subsequent commands.
      
      Because rpl_only_running_threads is set to 1 this can lead
      to silently not syncing all slave threads as expected, 
      ultimately resulting in unexpected results (and consequently
      on a failing test run).
      
      We fix this by replacing the START SLAVE instructions with
      calls to --source include/start_slave.inc, which will wait
      for the slave threads to be running (show 'Yes' in 
      Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before 
      proceeding.

Test regression introduced by patch for BUG#49978.

Luís, excellent analysis of this bug.

I think we should fix $only_running_threads to be less race-prone, and make it sync the IO thread even if it is "Connecting":

replace:
--let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' = 'Yes', 1, '')`
by:
--let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' != 'No', 1, '')`

Then, we also need to fix START SLAVE so that we can't have Slave_IO_Running=No after START SLAVE. We already have proper synchronization with the SQL thread so that Slave_SQL_Running=Yes after START SLAVE. To fix the IO thread, just set slave_running=MYSQL_RUN_NOT_CONNECT *before* broadcasting the condition that the IO thread has started. (While we are there, I'd suggest we change mi->slave_running to an enum and assign symbolic values to it, not 0 and 1.) I think there have been test several bugs the last years because of this missing sync, so would be great to have this fixed.

I was wrong in the last paragraph of last comment: Slave_IO_Running is already set to Connecting atomically when slave starts. So nothing needs to be done in the server source, this is a test-only bug.

I think Luís initial patch should be pushed (to remove races from rpl_circular_for_4_hosts), and also we should update rpl_sync.inc by replacing:
--let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' = 'Yes', 1, '')`
by:
--let $_rpl_slave_io_running= `SELECT IF('$_rpl_slave_io_running' != 'No', 1, '')`

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/130334

3572 Luis Soares	2011-02-03
      BUG#59147: rpl_circular_for_4_hosts fails sporadically
            
      There is one part of the test case that needs to break
      and re-establish the circular topology. For this the test
      stops the slave threads on a couple of servers and restarts
      them with START SLAVE. However, no check is done on the
      status of the IO or SQL threads before proceeding with
      the subsequent commands.
            
      Because rpl_only_running_threads is set to 1 this can lead
      to silently not syncing all slave threads as expected, 
      ultimately resulting in unexpected results (and consequently
      on a failing test run).
            
      We fix this by replacing the START SLAVE instructions with
      calls to --source include/start_slave.inc, which will wait
      for the slave threads to be running (show 'Yes' in 
      Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before 
      proceeding.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/130335

3572 Luis Soares	2011-02-03
      BUG#59147: rpl_circular_for_4_hosts fails sporadically
            
      There is one part of the test case that needs to break
      and re-establish the circular topology. For this the test
      stops the slave threads on a couple of servers and restarts
      them with START SLAVE. However, no check is done on the
      status of the IO or SQL threads before proceeding with
      the subsequent commands.
            
      Because rpl_only_running_threads is set to 1 this can lead
      to silently not syncing all slave threads as expected, 
      ultimately resulting in unexpected results (and consequently
      on a failing test run).
            
      We fix this by replacing the START SLAVE instructions with
      calls to --source include/start_slave.inc, which will wait
      for the slave threads to be running (show 'Yes' in 
      Slave_IO|SQL_Running fields of SHOW SLAVE STATUS) before 
      proceeding. Additionally, we change rpl_sync.inc to make the
      IO thread report that it is running when its running status
      is any other than 'No'.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/130426

3576 Luis Soares	2011-02-04 [merge]
      BUG#59147: Automerged bzr bundle form bug report into mysql-5.1 .

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/130428

3301 Luis Soares	2011-02-04 [merge]
      BUG#59147: automerge mysql-5.1 into mysql-5.5.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/130429

3595 Luis Soares	2011-02-04 [merge]
      BUG#59147: automerge mysql-5.5 into mysql-trunk.

Pushed into mysql-5.1 5.1.56 (revid:luis.soares@oracle.com-20110204145751-34t52v63kxu7fqpx) (version source revid:luis.soares@oracle.com-20110204145751-34t52v63kxu7fqpx) (merge vers: 5.1.56) (pib:24)

Pushed into mysql-trunk 5.6.2 (revid:luis.soares@oracle.com-20110204150014-a8qb9s0cd9xtpiud) (version source revid:luis.soares@oracle.com-20110204150014-a8qb9s0cd9xtpiud) (merge vers: 5.6.2) (pib:24)

Pushed into mysql-5.5 5.5.10 (revid:luis.soares@oracle.com-20110204145904-2u4neak21owtv38p) (version source revid:luis.soares@oracle.com-20110204145904-2u4neak21owtv38p) (merge vers: 5.5.10) (pib:24)

Queued in mysql-5.1 and merged to mysql-5.5 and mysql-trunk.

Test changes only, closed without further action.

Set back to Closed; see my previous comment.