Bug #28144 "Slave_IO_Running" differs in replication tests
Submitted: 27 Apr 2007 19:00 Modified: 27 Jun 2007 20:04
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.0.40 OS:Other (Sol10x86 + Linux)
Assigned to: Ramil Kalimullin CPU Architecture:Any

[27 Apr 2007 19:00] Joerg Bruehe
Description:
Occurred in the 5.0.40 release build:

-------------------------------------------------------
*** r/rpl_ssl.result
--- r/rpl_ssl.reject
***************
*** 68,74
  Relay_Log_File        #
  Relay_Log_Pos #
  Relay_Master_Log_File master-bin.000001
! Slave_IO_Running      Yes
  Slave_SQL_Running     Yes
  Replicate_Do_DB
  Replicate_Ignore_DB
--- 68,74
  Relay_Log_File        #
  Relay_Log_Pos #
  Relay_Master_Log_File master-bin.000001
! Slave_IO_Running      No
  Slave_SQL_Running     Yes
  Replicate_Do_DB
  Replicate_Ignore_DB
-------------------------------------------------------
Happened once only:
sol10-x86-b-5.0-enterprise-gpl.log   ps

More difficult to read:
-------------------------------------------------------
*** r/rpl_log_pos.result
--- r/rpl_log_pos.reject
***************
*** 21,27
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_
Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicat
e_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_I
gnore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition
Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert
Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       73      #       #       master-bin.00000
1       No      Yes                                                     0               0       73      #       None
        0       No                                              #
  stop slave;
  change master to master_log_pos=173;
  start slave;
--- 21,27
  start slave;
  show slave status;
  Slave_IO_State        Master_Host     Master_User     Master_Port     Connect_Retry   Master_Log_File Read_Master_Log_
Pos     Relay_Log_File  Relay_Log_Pos   Relay_Master_Log_File   Slave_IO_Running        Slave_SQL_Running       Replicat
e_Do_DB Replicate_Ignore_DB     Replicate_Do_Table      Replicate_Ignore_Table  Replicate_Wild_Do_Table Replicate_Wild_I
gnore_Table     Last_Errno      Last_Error      Skip_Counter    Exec_Master_Log_Pos     Relay_Log_Space Until_Condition
Until_Log_File  Until_Log_Pos   Master_SSL_Allowed      Master_SSL_CA_File      Master_SSL_CA_Path      Master_SSL_Cert
Master_SSL_Cipher       Master_SSL_Key  Seconds_Behind_Master
! #     127.0.0.1       root    MASTER_PORT     1       master-bin.000001       73      #       #       master-bin.00000
1       Yes     Yes                                                     0               0       73      #       None
        0       No                                              #
  stop slave;
  change master to master_log_pos=173;
  start slave;
-------------------------------------------------------
Also happened only once:
production-icc-glibc23-5.0-enterprise-gpl.log   normal

How to repeat:
Detected when running the test suite.
[11 May 2007 16:42] Jonathan Miller
Are there any errors in the slave error logs?

I can not see why it would be failing in rpl_ssl, but rpl_log_pos is poorly done and probably should be rewritten.

Other issues is that I do not have a Solaris to test with. 

/Jeb
[12 May 2007 15:15] Joerg Bruehe
When we do a release build, we run a series (once for each configuration) of
   configure ; make ; test
actions, and for each such action we copy back the generated package and the log.
The tests are run using the "--force" option, so they continue after a failure.

When we analyze the log and extract the failure information (if any), the next configuration is already being built and tested - maybe even more.
So any trace files have likely been overwritten by the continuing test,
they are already gone when we analyze,
and even the archive of this test build has been deleted in the three weeks that passed since the run.

As regards the machine: It will be possible to repeat the test on that build host, the tools will support parallel runs.

From my build experience, "Slave IO running" is a field that every now and then produces a different result - it is mentioned in several different bug reports.

So IMO either we find a better way to synchronize master and slave for such a report, to guarantee stable results,
or we use the "--replace-result" mechanism to ignore this field (provided its value is not critical ti judge the test success).
[14 May 2007 19:46] BJ Dierkes
I have the same issue on Redhat Enterprise Linux ES 3 - x86_64

rpl_ssl                        [ fail ]

Errors are (from /var/rpebuildfarm/build/mysql-5.0.41-1.rs/BUILD/mysql-5.0.41/mysql-test/var/log/mysqltest-time) :
mysqltest: Result length mismatch
(the last lines may be the most important ones)
Below are the diffs between actual and expected results:
-------------------------------------------------------
*** r/rpl_ssl.result    Wed May  2 18:10:47 2007
--- r/rpl_ssl.reject    Mon May 14 22:29:40 2007
***************
*** 68,74 ****
  Relay_Log_File        #
  Relay_Log_Pos #
  Relay_Master_Log_File master-bin.000001
! Slave_IO_Running      Yes
  Slave_SQL_Running     Yes
  Replicate_Do_DB
  Replicate_Ignore_DB
--- 68,74 ----
  Relay_Log_File        #
  Relay_Log_Pos #
  Relay_Master_Log_File master-bin.000001
! Slave_IO_Running      No
  Slave_SQL_Running     Yes
  Replicate_Do_DB
  Replicate_Ignore_DB
-------------------------------------------------------
Please follow the instructions outlined at
http://www.mysql.com/doc/en/Reporting_mysqltest_bugs.html
to find the reason to this problem and how to report this.

Result from queries before failure can be found in r/rpl_ssl.log

Aborting: rpl_ssl failed in ps-protocol mode. To continue, re-run with '--force'.
Stopping All Servers
make: *** [test-ps] Error 1
[30 May 2007 14:40] Lars Thalmann
See BUG#28772.
[8 Jun 2007 12:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28395

ChangeSet@1.2529, 2007-06-08 17:51:29+05:00, ramil@mysql.com +5 -0
  Fix for bug #28144: "Slave_IO_Running" differs in replication tests
  
  Problem: Slave_IO_Running values depend on circumstances while they are not important, actually.
  Fix: replace them with #
[8 Jun 2007 13:58] Jonathan Miller
Hi,

Please note I disagree with proposed patch. We need to understand why it differs, not mask it! Once we understand why it differs then we could patch it correctly.

Just my $.02
/Jeb
[8 Jun 2007 14:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28403

ChangeSet@1.2529, 2007-06-08 19:01:39+05:00, ramil@mysql.com +5 -0
  Fix for bug #28144: "Slave_IO_Running" differs in replication tests
  
  Problem: Slave_IO_Running values depend on circumstances while they are not important, actually.
  Fix: replace them with #, remove SHOW SLAVE STATUS if we don't really need it.
[13 Jun 2007 8:40] Ramil Kalimullin
Jeb,

the reason why it differs is clear: we don't wait for an explicit (desirable?)
state of the slave, just use some incoherent combinations of delays etc. instead,
so we get some uncertain slave's parameters (e.g Slave_IO_Running) sometimes as
them may depend on external factors such as machine/network load, etc.

Please also see emails from Magnus and Mats.
[13 Jun 2007 19:17] Jonathan Miller
Relay_Log_Pos can be nondeterministic (I do agree that not all items always come out the same), but Slave_IO_State is not nondeterministic, it is very deterministic, and if it has stopped when we don't expect it then there is an issue. And if it does not stop when expected there is an issue.

The slave IO Thread is deterministic just as is sql thread. the should run and stop on all systems just them same. If not, please show me in the docs where we say that they act differently on different platforms.

My suggestion would be to figure out what the state of it should be during that test. If different on some platforms then it is not expected and should be investigated as why it behaves the way it does.

I disagree with masking a deterministic field because the value is not what was expect and is not exactly what we are testing for. 

In addition, we do a lot of things in testing maybe one thing and we should expect that all these things we use to act the same when they are deterministic, other wise it becomes a nightmare for QA and for customer that are trying to use the product. 

Best wishes,
[15 Jun 2007 5:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28834

ChangeSet@1.2529, 2007-06-15 10:47:10+05:00, ramil@mysql.com +8 -0
  Fix for bug #28144: "Slave_IO_Running" differs in replication tests
  
  Problem: show slave status may return different Slave_IO_Running values running some tests.
  Fix: wait for a certain slave state if needed to get tests more predictable.
[15 Jun 2007 11:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28858

ChangeSet@1.2536, 2007-06-15 16:09:28+05:00, ramil@mysql.com +5 -0
  Fix for bug #28144: "Slave_IO_Running" differs in replication tests
  
  Problem: show slave status may return different Slave_IO_Running values running some tests.
  Fix: wait for a certain slave state if needed to get tests more predictable.
[22 Jun 2007 18:07] Bugs System
Pushed into 5.1.20-beta
[22 Jun 2007 18:09] Bugs System
Pushed into 5.0.46
[27 Jun 2007 20:04] Paul DuBois
Test case changes. No changelog entry needed.