Bug #12429 Replication tests fail: "Slave_IO_Running" (?) differs
Submitted: 8 Aug 2005 13:24 Modified: 26 Apr 2006 17:39
Reporter: Joerg Bruehe
Status: Closed
Category:Server: Replication Severity:S3 (Non-critical)
Version:5.0.11 OS:various Unix
Assigned to: Matthias Leich Target Version:

[8 Aug 2005 13:24] Joerg Bruehe
Description:
Build of 5.0.11, based on ChangeSet
  1.1950.1.1 05/08/06 03:10:35 kent@mysql.com +1 -0
  make_win_src_distribution.sh:
    Added missing backslashes

Test failure:
rpl_relayrotate                [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 80: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_relayrotate.result
--- r/rpl_relayrotate.reject
***************
*** 18,22
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher
    Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #
      #       master-bin.000001
    Yes     Yes                                                     0               0    
  735186  #       None
0       No                                              #                                
                                    drop table t1;
--- 18,22 
  8000
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos
    Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running       
Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table
    Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos    
Relay_Log_Space Until_Condition Until_Log_File  Until_Log_Pos   Master_SSL_Allowed     
Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert Master_SSL_Cipher      
Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       735186  #
      #       master-bin.000001
    No      Yes                                                     0               0    
  735186  #       None
0       No                                              #
  drop table t1;
-------------------------------------------------------

Occurred on platform / product / in test run:
bsd53-5.0-cluster.log   normal
bsd53-5.0-max.log   normal + ps-protocol
bsd53-5.0-pro.log   normal + ps-protocol
bsd53-5.0-standard.log   normal
buildqnx2-5.0-debug.log   normal + ps-protocol
buildqnx2-5.0-pro.log   normal + ps-protocol
buildqnx2-5.0-standard.log   normal + ps-protocol                                        
                                  cane-linuxthreads-5.0-debug.log   normal
etpglb0-5.0-debug.log   normal + ps-protocol 
etpglb0-5.0-max.log   normal + ps-protocol 
etpglb0-5.0-standard.log   normal + ps-protocol 
hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-max.log   normal
hammer-glibc23-5.0-pro.log   normal + ps-protocol 
hammer-glibc23-5.0-standard.log   normal + ps-protocol 
ita2-rhas21-5.0-debug.log   normal
ita2-rhas21-5.0-pro.log   normal + ps-protocol
ita2-rhas21-5.0-standard.log   normal + ps-protocol
sol10-x86-5.0-max.log   ps-protocol
sol10-x86-5.0-pro.log   ps-protocol
sol9x86-5.0-cluster.log   normal

How to repeat:
Build + test
[8 Aug 2005 15:58] Joerg Bruehe
rpl_until                      [ fail ]

Errors are (from PATH/mysqltest-time) :
At line 88: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776    
slave-relay-bin.000004  #       master-bin.000001       Yes     No                       
                              0               0       608     #       Relay  
slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
--- 49,55
  2
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry  
Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001       776    
slave-relay-bin.000004  #       master-bin.000001       No      No                       
                              0               0       608     #       Relay  
slave-relay-bin.000004  746     No                                              #
  start slave;
  stop slave;
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
-------------------------------------------------------

Occurred:
etpglb0-5.0-debug.log   normal
etpglb0-5.0-max.log   normal
etpglb0-5.0-max.log   ps-protocol
etpglb0-5.0-standard.log   normal
etpglb0-5.0-standard.log   ps-protocol
sol10-sparc-a-64bit-5.0-cluster.log   normal
sol10-sparc-a-64bit-5.0-max.log   normal
sol10-sparc-a-64bit-5.0-standard.log   normal
sunfire100b-5.0-pro.log   ps-protocol
[8 Aug 2005 16:10] Joerg Bruehe
Further occurrence:
-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....

Occurred:
rhas3-x86-glibc23-5.0-cluster.log   normal
rhas3-x86-glibc23-5.0-debug.log   normal
rhas3-x86-glibc23-5.0-pro.log   normal
rhas3-x86-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
intelxeon3-5.0-debug.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 55,61
  start slave until master_log_file='master-bin.000001', master_log_pos=776;
  show slave status;
.....

intelxeon3-5.0-cluster.log   normal
intelxeon3-5.0-max.log   normal
intelxeon3-5.0-pro.log   normal
intelxeon3-5.0-pro.log   ps-protocol
intelxeon3-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 41,47
  4
  show slave status;
.....
***************
*** 49,55
  2
  show slave status;
.....

hammer-glibc23-5.0-cluster.log   ps-protocol
hammer-glibc23-5.0-debug.log   ps-protocol
hammer-glibc23-5.0-max.log   ps-protocol
hammer-glibc23-5.0-pro.log   ps-protocol
hammer-glibc23-5.0-standard.log   ps-protocol
Same, but also lines 31-37:
hammer-glibc23-5.0-cluster.log   normal
hammer-glibc23-5.0-debug.log   normal
hammer-glibc23-5.0-pro.log   normal
hammer-glibc23-5.0-standard.log   normal

-------------------------------------------------------
*** r/rpl_until.result
--- r/rpl_until.reject
***************
*** 31,37
  4
  show slave status;
.....
***************
*** 55,61                                                                                
                                    start slave until master_log_file='master-bin.000001',
master_log_pos=776;                                                  show slave status;
.....

rhas3-x86-glibc23-5.0-max.log   normal
[10 Aug 2005 12:11] Joerg Bruehe
In a new build (without "yassl"), it reoccurred, but in different test runs ("normal" vs.
"PS protocol"), so it seems not exactly reproducible. Race condition?
[27 Aug 2005 22:53] Alexander Pachev
All replication tests that have a sleep in the test with a fixed number of seconds sooner
or later will have a similar problem. In order to fix them, we need to add some extra
features to mysqltest.c and the server itself. For example, we have no way of knowing
exactly when the slave I/O or SQL thread would hit some kind of an error that we are
trying to simulate. At this time, all we can do is sleep and hope the problem happened
soon enough. But once in a while you hit a peculiar scheduler or thread implementation
where it does not happen as soon as you expected. So in short, to avoid those problem,
additional features are required in the test suite and in the server.
[25 Sep 2005 4:31] Alexander Pachev
Needs a new feature in mysqltest to be fixed.
[16 Nov 2005 18:23] Matthias Leich
There are three testcases which sometimes suffer from the fact that
a "SHOW SLAVE STATUS" comes too early
   rpl_relayrotate : line 61
   rpl_until           : line 53
   rpl_deadlock    : line 107
If this is the case we get for the show result column Slave_IO_Running
the value "No" instead of the expected "Yes".

My current experience is, that an extreme fast IO system increases
the probability of getting this problem.
[17 Nov 2005 21:29] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/32379
[3 Dec 2005 15:56] Guilhem Bichot
There may be a mix of different issues in this bug report. I see that at least for
rpl_relayrotate and rpl_until, the difference between the expected result and the
observed result is only in "Slave_IO_Running". It's understandable that one cannot be
sure if it will be Yes or No, as the synchronization done in the tests, MASTER_POS_WAIT()
or START SLAVE UNTIL, applies only to the slave SQL thread. So this difference can simply
be avoided by replacing column 11 by "#" in SHOW SLAVE STATUS. Note that this solution is
NOT a hack, not a simple problem masking. It is a real reasoning.
In the test we don't care if the slave I/O thread is running, we don't need it, only the
slave SQL thread is of interest. So while it's ok to implement a solution which waits
until the slave I/O thread has the "Yes" state (see Matthias' patch), it was not strictly
neededin order to fix this "bug".
[3 Dec 2005 16:07] Guilhem Bichot
Even more, we just don't need SHOW SLAVE STATUS in rpl_relayrotate. The fact that the
previous SELECT returns 8000 is all that's important. So one can remove SHOW SLAVE STATUS
from the test and re-enable this test.
[5 Dec 2005 17:04] Matthias Leich
Snip of an answer of Guilhem given to me:

> You wrote that at least for rpl_relayrotate and rpl_until,
> the difference between the expected result and the observed
> result is only in "Slave_IO_Running".
> 
> This is also valid for rpl_deadlock 
> which failed with mysql-5.1.1-alpha builds 05/08/25
> on  "cane" (i386-unknown-freebsd4.7) debug normal
> 
> *** r/rpl_deadlock.result     Thu Aug 25 17:17:09 2005
> 09924: --- r/rpl_deadlock.reject      Sat Aug 27 00:03:56 2005
> 09925: ***************
> 09926: *** 83,87 ****
> 09927:   22
> 09928:   show slave status;
> 09929:   Slave_IO_State       Master_Host     Master_User     Master_Port    
Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09930: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001      
18911   #       #       master-bin.000001       Yes     Yes                              
                      0               0       18911   #       None            0       No  
                                           #
> 09931:   drop table t1,t2,t3,t4;
> 09932: --- 83,87 ----
> 09933:   22
> 09934:   show slave status;
> 09935:   Slave_IO_State       Master_Host     Master_User     Master_Port    
Connect_Retry   Master_Log_File Read_Master_Log_Pos     Relay_Log_File  Relay_Log_Pos  
Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicate_Do_DB
Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table 
Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table     Last_Errno      Last_Error     
Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition Until_Log_File 
Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path     
Master_SSL_Cert Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
> 09936: ! #    127.0.0.1       root    MASTER_MYPORT   1       master-bin.000001      
18911   #       #       master-bin.000001       No      Yes                              
                      0               0       18911   #       None            0       No  
                                           #
> 09937:   drop table t1,t2,t3,t4;
> 09938: -------------------------------------------------------
> 
> If 
> - the value for "Slave_IO_Running" does not matter

right

> and
> - it can be expected that the other values do not change
>   between executing the previous command 
>      select * from t2;   and
>   the show slave status

right again. The values which are "random" are:
1) slave_io_running
2) relay_log_file
3) relay_log_pos
4) relay_log_space

1) you know why
2)3)4) is because depending on timing, the slave may disconnect and
reconnect, and every time it reconnects the relay log grows by circa
48 bytes. So 3) (=offset where the slave SQL thread is in the relay
log), 2) (=name of that relay log, if relay log has grown this may
trigger a relay log rotation creating a new relay log), 4) (relay log
grows so occupied space grows too).

2) usually is not random (so it is not replaced by # in all results),
but in theory it is.

> we could also mangle the Slave_IO_Running column with
> replace_result.
> 
> What do you think ?

I think this would be a good idea.
[5 Dec 2005 18:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33039
[12 Dec 2005 17:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33214
[12 Dec 2005 18:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/33215
[12 Dec 2005 19:17] Joerg Bruehe
These patches of today affect test "rpl_until":
1) The original change is backported to 4.1, and
2) it is extended to 2 other places in that test (the status is checked 3 times).
[10 Apr 2006 14:55] Lars Thalmann
The bug is fixed, but the test cases are still disabled in 4.1.
[11 Apr 2006 21:40] Matthias Leich
I discovered today the same instability in MySQL 4.1
rpl_deadlock.test line 85.
[12 Apr 2006 17:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/4867
[18 Apr 2006 12:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/5074
[26 Apr 2006 17:39] Matthias Leich
Got approval for bug fix via email

Fix pushed to 4.1.19

No changelog entry needed