Bug #12429 Replication tests fail: "Slave_IO_Running" (?) differs
Submitted: 8 Aug 2005 11:24 Modified: 26 Apr 2006 15:39
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.0.11 OS:various Unix
Assigned to: Matthias Leich CPU Architecture:Any

[8 Aug 2005 11:24] Joerg Bruehe
Description:
Build of 5.0.11, based on ChangeSet
  1.1950.1.1 05/08/06 03:10:35 kent@mysql.com +1 -0
  make_win_src_distribution.sh:
    Added missing backslashes

Test failure:
rpl_relayrotate                [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 80: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_relayrotate.result
--- r/rpl_relayrotate.reject
***************
*** 18,22
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #       #       master-bin.000001
    Yes     Yes                                                     0               0       735186  #       None
0       No                                              #                                                                     drop table t1;
--- 18,22 
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #       #       master-bin.000001
    No      Yes                                                     0               0       735186  #       None
0       No                                              #
  drop table t1;
-------------------------------------------------------

Occurred on platform / product / in test run:
bsd53-5.0-cluster.log   normal
bsd53-5.0-max.log   normal + ps-protocol
bsd53-5.0-pro.log   normal + ps-protocol
bsd53-5.0-standard.log   normal
buildqnx2-5.0-debug.log   normal + ps-protocol
buildqnx2-5.0-pro.log   normal + ps-protocol
buildqnx2-5.0-standard.log   normal + ps-protocol                                                                           cane-linuxthreads-5.0-debug.log   normal
etpglb0-5.0-debug.log   normal + ps-protocol 
etpglb0-5.0-max.log   normal + ps-protocol 
etpglb0-5.0-standard.log   normal + ps-protocol 
hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-max.log   normal
hammer-glibc23-5.0-pro.log   normal + ps-protocol 
hammer-glibc23-5.0-standard.log   normal + ps-protocol 
ita2-rhas21-5.0-debug.log   normal
ita2-rhas21-5.0-pro.log   normal + ps-protocol
ita2-rhas21-5.0-standard.log   normal + ps-protocol
sol10-x86-5.0-max.log   ps-protocol
sol10-x86-5.0-pro.log   ps-protocol
sol9x86-5.0-cluster.log   normal

How to repeat:
Build + test
[8 Aug 2005 13:58] Joerg Bruehe
rpl_until                      [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 88: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776     slave-relay-bin.000004  #       master-bin.000001       Yes     No                                                      0               0       608     #       Relay   slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
--- 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776     slave-relay-bin.000004  #       master-bin.000001       No      No                                                      0               0       608     #       Relay   slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
-------------------------------------------------------

Occurred:
etpglb0-5.0-debug.log   normal
etpglb0-5.0-max.log   normal
etpglb0-5.0-max.log   ps-protocol
etpglb0-5.0-standard.log   normal
etpglb0-5.0-standard.log   ps-protocol
sol10-sparc-a-64bit-5.0-cluster.log   normal
sol10-sparc-a-64bit-5.0-max.log   normal
sol10-sparc-a-64bit-5.0-standard.log   normal
sunfire100b-5.0-pro.log   ps-protocol
[8 Aug 2005 14:10] Joerg Bruehe
Further occurrence:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....

Occurred:
rhas3-x86-glibc23-5.0-cluster.log   normal
rhas3-x86-glibc23-5.0-debug.log   normal
rhas3-x86-glibc23-5.0-pro.log   normal
rhas3-x86-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
intelxeon3-5.0-debug.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 55,61
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
  show slave status;
.....

intelxeon3-5.0-cluster.log   normal
intelxeon3-5.0-max.log   normal
intelxeon3-5.0-pro.log   normal
intelxeon3-5.0-pro.log   ps-protocol
intelxeon3-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
***************
*** 49,55
  2
  show slave status;
.....

hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-debug.log   ps-protocol
hammer-glibc23-5.0-max.log   ps-protocol
hammer-glibc23-5.0-pro.log   ps-protocol
hammer-glibc23-5.0-standard.log   ps-protocol
Same, but also lines 31-37:
hammer-glibc23-5.0-cluster.log   normal
hammer-glibc23-5.0-debug.log   normal
hammer-glibc23-5.0-pro.log   normal
hammer-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....
***************
*** 55,61                                                                                                                     start slave until master_log_file='master-bin.000001', master_log_pos=776;                                                  show slave status;
.....

rhas3-x86-glibc23-5.0-max.log   normal
[10 Aug 2005 10:11] Joerg Bruehe
In a new build (without "yassl"), it reoccurred, but in different test runs ("normal" vs. "PS protocol"), so it seems not exactly reproducible. Race condition?
[27 Aug 2005 20:53] Alexander Pachev
All replication tests that have a sleep in the test with a fixed number of seconds sooner or later will have a similar problem. In order to fix them, we need to add some extra features to mysqltest.c and the server itself. For example, we have no way of knowing exactly when the slave I/O or SQL thread would hit some kind of an error that we are trying to simulate. At this time, all we can do is sleep and hope the problem happened soon enough. But once in a while you hit a peculiar scheduler or thread implementation where it does not happen as soon as you expected. So in short, to avoid those problem, additional features are required in the test suite and in the server.
[25 Sep 2005 2:31] Alexander Pachev
Needs a new feature in mysqltest to be fixed.
[16 Nov 2005 17:23] Matthias Leich
There are three testcases which sometimes suffer from the fact that
a "SHOW SLAVE STATUS" comes too early
   rpl_relayrotate : line 61
   rpl_until           : line 53
   rpl_deadlock    : line 107
If this is the case we get for the show result column Slave_IO_Running
the value "No" instead of the expected "Yes".

My current experience is, that an extreme fast IO system increases
the probability of getting this problem.
[17 Nov 2005 20:29] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/32379
[3 Dec 2005 14:56] Guilhem Bichot
There may be a mix of different issues in this bug report. I see that at least for rpl_relayrotate and rpl_until, the difference between the expected result and the observed result is only in "Slave_IO_Running". It's understandable that one cannot be sure if it will be Yes or No, as the synchronization done in the tests, MASTER_POS_WAIT() or START SLAVE UNTIL, applies only to the slave SQL thread. So this difference can simply be avoided by replacing column 11 by "#" in SHOW SLAVE STATUS. Note that this solution is NOT a hack, not a simple problem masking. It is a real reasoning.
In the test we don't care if the slave I/O thread is running, we don't need it, only the slave SQL thread is of interest. So while it's ok to implement a solution which waits until the slave I/O thread has the "Yes" state (see Matthias' patch), it was not strictly neededin order to fix this "bug".
[3 Dec 2005 15:07] Guilhem Bichot
Even more, we just don't need SHOW SLAVE STATUS in rpl_relayrotate. The fact that the previous SELECT returns 8000 is all that's important. So one can remove SHOW SLAVE STATUS from the test and re-enable this test.
[5 Dec 2005 16:04] Matthias Leich
Snip of an answer of Guilhem given to me:

> You wrote that at least for rpl_relayrotate and rpl_until,
> the difference between the expected result and the observed
> result is only in "Slave_IO_Running".
> 
> This is also valid for rpl_deadlock 
> which failed with mysql-5.1.1-alpha builds 05/08/25
> on  "cane" (i386-unknown-freebsd4.7) debug normal
> 
> *** r/rpl_deadlock.result     Thu Aug 25 17:17:09 2005
> 09924: --- r/rpl_deadlock.reject      Sat Aug 27 00:03:56 2005
> 09925: ***************
> 09926: *** 83,87 ****
> 09927:   22
> 09928:   show slave status;
> 09929:   Slave_IO_State       Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09930: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       18911   #       #       master-bin.000001       Yes     Yes                                                     0               0       18911   #       None            0       No                                              #
> 09931:   drop table t1,t2,t3,t4;
> 09932: --- 83,87 ----
> 09933:   22
> 09934:   show slave status;
> 09935:   Slave_IO_State       Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09936: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       18911   #       #       master-bin.000001       No      Yes                                                     0               0       18911   #       None            0       No                                              #
> 09937:   drop table t1,t2,t3,t4;
> 09938: -------------------------------------------------------
> 
> If 
> - the value for "Slave_IO_Running" does not matter

right

> and
> - it can be expected that the other values do not change
>   between executing the previous command 
>      select * from t2;   and
>   the show slave status

right again. The values which are "random" are:
1) slave_io_running
2) relay_log_file
3) relay_log_pos
4) relay_log_space

1) you know why
2)3)4) is because depending on timing, the slave may disconnect and
reconnect, and every time it reconnects the relay log grows by circa
48 bytes. So 3) (=offset where the slave SQL thread is in the relay
log), 2) (=name of that relay log, if relay log has grown this may
trigger a relay log rotation creating a new relay log), 4) (relay log
grows so occupied space grows too).

2) usually is not random (so it is not replaced by # in all results),
but in theory it is.

> we could also mangle the Slave_IO_Running column with
> replace_result.
> 
> What do you think ?

I think this would be a good idea.
[5 Dec 2005 17:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33039
[12 Dec 2005 16:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33214
[12 Dec 2005 17:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33215
[12 Dec 2005 18:17] Joerg Bruehe
These patches of today affect test "rpl_until":
1) The original change is backported to 4.1, and
2) it is extended to 2 other places in that test (the status is checked 3 times).
[10 Apr 2006 12:55] Lars Thalmann
The bug is fixed, but the test cases are still disabled in 4.1.
[11 Apr 2006 19:40] Matthias Leich
I discovered today the same instability in MySQL 4.1
rpl_deadlock.test line 85.
[12 Apr 2006 15:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/4867
[18 Apr 2006 10:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/5074
[26 Apr 2006 15:39] Matthias Leich
Got approval for bug fix via email

Fix pushed to 4.1.19

No changelog entry needed