MySQL Bugs: #12429: Replication tests fail: "Slave_IO

Bug #12429	Replication tests fail: "Slave_IO_Running" (?) differs
Submitted:	8 Aug 2005 11:24	Modified:	26 Apr 2006 15:39
Reporter:	Joerg Bruehe	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.0.11	OS:	various Unix
Assigned to:	Matthias Leich	CPU Architecture:	Any

Description:
Build of 5.0.11, based on ChangeSet
  1.1950.1.1 05/08/06 03:10:35 kent@mysql.com +1 -0
  make_win_src_distribution.sh:
    Added missing backslashes

Test failure:
rpl_relayrotate                [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 80: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_relayrotate.result
--- r/rpl_relayrotate.reject
***************
*** 18,22
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #       #       master-bin.000001
    Yes     Yes                                                     0               0       735186  #       None
0       No                                              #                                                                     drop table t1;
--- 18,22 
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #       #       master-bin.000001
    No      Yes                                                     0               0       735186  #       None
0       No                                              #
  drop table t1;
-------------------------------------------------------

Occurred on platform / product / in test run:
bsd53-5.0-cluster.log   normal
bsd53-5.0-max.log   normal + ps-protocol
bsd53-5.0-pro.log   normal + ps-protocol
bsd53-5.0-standard.log   normal
buildqnx2-5.0-debug.log   normal + ps-protocol
buildqnx2-5.0-pro.log   normal + ps-protocol
buildqnx2-5.0-standard.log   normal + ps-protocol                                                                           cane-linuxthreads-5.0-debug.log   normal
etpglb0-5.0-debug.log   normal + ps-protocol 
etpglb0-5.0-max.log   normal + ps-protocol 
etpglb0-5.0-standard.log   normal + ps-protocol 
hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-max.log   normal
hammer-glibc23-5.0-pro.log   normal + ps-protocol 
hammer-glibc23-5.0-standard.log   normal + ps-protocol 
ita2-rhas21-5.0-debug.log   normal
ita2-rhas21-5.0-pro.log   normal + ps-protocol
ita2-rhas21-5.0-standard.log   normal + ps-protocol
sol10-x86-5.0-max.log   ps-protocol
sol10-x86-5.0-pro.log   ps-protocol
sol9x86-5.0-cluster.log   normal

How to repeat:
Build + test

rpl_until                      [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 88: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776     slave-relay-bin.000004  #       master-bin.000001       Yes     No                                                      0               0       608     #       Relay   slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
--- 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776     slave-relay-bin.000004  #       master-bin.000001       No      No                                                      0               0       608     #       Relay   slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
-------------------------------------------------------

Occurred:
etpglb0-5.0-debug.log   normal
etpglb0-5.0-max.log   normal
etpglb0-5.0-max.log   ps-protocol
etpglb0-5.0-standard.log   normal
etpglb0-5.0-standard.log   ps-protocol
sol10-sparc-a-64bit-5.0-cluster.log   normal
sol10-sparc-a-64bit-5.0-max.log   normal
sol10-sparc-a-64bit-5.0-standard.log   normal
sunfire100b-5.0-pro.log   ps-protocol

Further occurrence:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....

Occurred:
rhas3-x86-glibc23-5.0-cluster.log   normal
rhas3-x86-glibc23-5.0-debug.log   normal
rhas3-x86-glibc23-5.0-pro.log   normal
rhas3-x86-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
intelxeon3-5.0-debug.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 55,61
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
  show slave status;
.....

intelxeon3-5.0-cluster.log   normal
intelxeon3-5.0-max.log   normal
intelxeon3-5.0-pro.log   normal
intelxeon3-5.0-pro.log   ps-protocol
intelxeon3-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
***************
*** 49,55
  2
  show slave status;
.....

hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-debug.log   ps-protocol
hammer-glibc23-5.0-max.log   ps-protocol
hammer-glibc23-5.0-pro.log   ps-protocol
hammer-glibc23-5.0-standard.log   ps-protocol
Same, but also lines 31-37:
hammer-glibc23-5.0-cluster.log   normal
hammer-glibc23-5.0-debug.log   normal
hammer-glibc23-5.0-pro.log   normal
hammer-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....
***************
*** 55,61                                                                                                                     start slave until master_log_file='master-bin.000001', master_log_pos=776;                                                  show slave status;
.....

rhas3-x86-glibc23-5.0-max.log   normal

In a new build (without "yassl"), it reoccurred, but in different test runs ("normal" vs. "PS protocol"), so it seems not exactly reproducible. Race condition?

All replication tests that have a sleep in the test with a fixed number of seconds sooner or later will have a similar problem. In order to fix them, we need to add some extra features to mysqltest.c and the server itself. For example, we have no way of knowing exactly when the slave I/O or SQL thread would hit some kind of an error that we are trying to simulate. At this time, all we can do is sleep and hope the problem happened soon enough. But once in a while you hit a peculiar scheduler or thread implementation where it does not happen as soon as you expected. So in short, to avoid those problem, additional features are required in the test suite and in the server.

Needs a new feature in mysqltest to be fixed.

There are three testcases which sometimes suffer from the fact that
a "SHOW SLAVE STATUS" comes too early
   rpl_relayrotate : line 61
   rpl_until           : line 53
   rpl_deadlock    : line 107
If this is the case we get for the show result column Slave_IO_Running
the value "No" instead of the expected "Yes".

My current experience is, that an extreme fast IO system increases
the probability of getting this problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/32379

There may be a mix of different issues in this bug report. I see that at least for rpl_relayrotate and rpl_until, the difference between the expected result and the observed result is only in "Slave_IO_Running". It's understandable that one cannot be sure if it will be Yes or No, as the synchronization done in the tests, MASTER_POS_WAIT() or START SLAVE UNTIL, applies only to the slave SQL thread. So this difference can simply be avoided by replacing column 11 by "#" in SHOW SLAVE STATUS. Note that this solution is NOT a hack, not a simple problem masking. It is a real reasoning.
In the test we don't care if the slave I/O thread is running, we don't need it, only the slave SQL thread is of interest. So while it's ok to implement a solution which waits until the slave I/O thread has the "Yes" state (see Matthias' patch), it was not strictly neededin order to fix this "bug".

Even more, we just don't need SHOW SLAVE STATUS in rpl_relayrotate. The fact that the previous SELECT returns 8000 is all that's important. So one can remove SHOW SLAVE STATUS from the test and re-enable this test.

Snip of an answer of Guilhem given to me:

> You wrote that at least for rpl_relayrotate and rpl_until,
> the difference between the expected result and the observed
> result is only in "Slave_IO_Running".
> 
> This is also valid for rpl_deadlock 
> which failed with mysql-5.1.1-alpha builds 05/08/25
> on  "cane" (i386-unknown-freebsd4.7) debug normal
> 
> *** r/rpl_deadlock.result     Thu Aug 25 17:17:09 2005
> 09924: --- r/rpl_deadlock.reject      Sat Aug 27 00:03:56 2005
> 09925: ***************
> 09926: *** 83,87 ****
> 09927:   22
> 09928:   show slave status;
> 09929:   Slave_IO_State       Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09930: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       18911   #       #       master-bin.000001       Yes     Yes                                                     0               0       18911   #       None            0       No                                              #
> 09931:   drop table t1,t2,t3,t4;
> 09932: --- 83,87 ----
> 09933:   22
> 09934:   show slave status;
> 09935:   Slave_IO_State       Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09936: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       18911   #       #       master-bin.000001       No      Yes                                                     0               0       18911   #       None            0       No                                              #
> 09937:   drop table t1,t2,t3,t4;
> 09938: -------------------------------------------------------
> 
> If 
> - the value for "Slave_IO_Running" does not matter

right

> and
> - it can be expected that the other values do not change
>   between executing the previous command 
>      select * from t2;   and
>   the show slave status

right again. The values which are "random" are:
1) slave_io_running
2) relay_log_file
3) relay_log_pos
4) relay_log_space

1) you know why
2)3)4) is because depending on timing, the slave may disconnect and
reconnect, and every time it reconnects the relay log grows by circa
48 bytes. So 3) (=offset where the slave SQL thread is in the relay
log), 2) (=name of that relay log, if relay log has grown this may
trigger a relay log rotation creating a new relay log), 4) (relay log
grows so occupied space grows too).

2) usually is not random (so it is not replaced by # in all results),
but in theory it is.

> we could also mangle the Slave_IO_Running column with
> replace_result.
> 
> What do you think ?

I think this would be a good idea.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33039

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33214

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33215

These patches of today affect test "rpl_until":
1) The original change is backported to 4.1, and
2) it is extended to 2 other places in that test (the status is checked 3 times).

The bug is fixed, but the test cases are still disabled in 4.1.

I discovered today the same instability in MySQL 4.1
rpl_deadlock.test line 85.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/4867

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/5074

Got approval for bug fix via email

Fix pushed to 4.1.19

No changelog entry needed