Bug #12429 | Replication tests fail: "Slave_IO_Running" (?) differs | ||
---|---|---|---|
Submitted: | 8 Aug 2005 11:24 | Modified: | 26 Apr 2006 15:39 |
Reporter: | Joerg Bruehe | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.0.11 | OS: | various Unix |
Assigned to: | Matthias Leich | CPU Architecture: | Any |
[8 Aug 2005 11:24]
Joerg Bruehe
[8 Aug 2005 13:58]
Joerg Bruehe
rpl_until [ fail ] Errors are (from PATH/mysqltest-time) : At line 88: Result length mismatch (the last lines may be the most important ones) Below are the diffs between actual and expected results: ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 49,55 2 show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_MYPORT 1 master-bin.000001 776 slave-relay-bin.000004 # master-bin.000001 Yes No 0 0 608 # Relay slave-relay-bin.000004 746 No # start slave; stop slave; start slave until master_log_file='master-bin.000001', master_log_pos=776; --- 49,55 2 show slave status; Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master ! # 127.0.0.1 root MASTER_MYPORT 1 master-bin.000001 776 slave-relay-bin.000004 # master-bin.000001 No No 0 0 608 # Relay slave-relay-bin.000004 746 No # start slave; stop slave; start slave until master_log_file='master-bin.000001', master_log_pos=776; ------------------------------------------------------- Occurred: etpglb0-5.0-debug.log normal etpglb0-5.0-max.log normal etpglb0-5.0-max.log ps-protocol etpglb0-5.0-standard.log normal etpglb0-5.0-standard.log ps-protocol sol10-sparc-a-64bit-5.0-cluster.log normal sol10-sparc-a-64bit-5.0-max.log normal sol10-sparc-a-64bit-5.0-standard.log normal sunfire100b-5.0-pro.log ps-protocol
[8 Aug 2005 14:10]
Joerg Bruehe
Further occurrence: ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 31,37 4 show slave status; ..... Occurred: rhas3-x86-glibc23-5.0-cluster.log normal rhas3-x86-glibc23-5.0-debug.log normal rhas3-x86-glibc23-5.0-pro.log normal rhas3-x86-glibc23-5.0-standard.log normal ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 41,47 4 show slave status; ..... intelxeon3-5.0-debug.log normal ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 55,61 start slave until master_log_file='master-bin.000001', master_log_pos=776; show slave status; ..... intelxeon3-5.0-cluster.log normal intelxeon3-5.0-max.log normal intelxeon3-5.0-pro.log normal intelxeon3-5.0-pro.log ps-protocol intelxeon3-5.0-standard.log normal ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 41,47 4 show slave status; ..... *************** *** 49,55 2 show slave status; ..... hammer-glibc23-5.0-cluster.log ps-protocol hammer-glibc23-5.0-debug.log ps-protocol hammer-glibc23-5.0-max.log ps-protocol hammer-glibc23-5.0-pro.log ps-protocol hammer-glibc23-5.0-standard.log ps-protocol Same, but also lines 31-37: hammer-glibc23-5.0-cluster.log normal hammer-glibc23-5.0-debug.log normal hammer-glibc23-5.0-pro.log normal hammer-glibc23-5.0-standard.log normal ------------------------------------------------------- *** r/rpl_until.result --- r/rpl_until.reject *************** *** 31,37 4 show slave status; ..... *************** *** 55,61 start slave until master_log_file='master-bin.000001', master_log_pos=776; show slave status; ..... rhas3-x86-glibc23-5.0-max.log normal
[10 Aug 2005 10:11]
Joerg Bruehe
In a new build (without "yassl"), it reoccurred, but in different test runs ("normal" vs. "PS protocol"), so it seems not exactly reproducible. Race condition?
[27 Aug 2005 20:53]
Alexander Pachev
All replication tests that have a sleep in the test with a fixed number of seconds sooner or later will have a similar problem. In order to fix them, we need to add some extra features to mysqltest.c and the server itself. For example, we have no way of knowing exactly when the slave I/O or SQL thread would hit some kind of an error that we are trying to simulate. At this time, all we can do is sleep and hope the problem happened soon enough. But once in a while you hit a peculiar scheduler or thread implementation where it does not happen as soon as you expected. So in short, to avoid those problem, additional features are required in the test suite and in the server.
[25 Sep 2005 2:31]
Alexander Pachev
Needs a new feature in mysqltest to be fixed.
[16 Nov 2005 17:23]
Matthias Leich
There are three testcases which sometimes suffer from the fact that a "SHOW SLAVE STATUS" comes too early rpl_relayrotate : line 61 rpl_until : line 53 rpl_deadlock : line 107 If this is the case we get for the show result column Slave_IO_Running the value "No" instead of the expected "Yes". My current experience is, that an extreme fast IO system increases the probability of getting this problem.
[17 Nov 2005 20:29]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/32379
[3 Dec 2005 14:56]
Guilhem Bichot
There may be a mix of different issues in this bug report. I see that at least for rpl_relayrotate and rpl_until, the difference between the expected result and the observed result is only in "Slave_IO_Running". It's understandable that one cannot be sure if it will be Yes or No, as the synchronization done in the tests, MASTER_POS_WAIT() or START SLAVE UNTIL, applies only to the slave SQL thread. So this difference can simply be avoided by replacing column 11 by "#" in SHOW SLAVE STATUS. Note that this solution is NOT a hack, not a simple problem masking. It is a real reasoning. In the test we don't care if the slave I/O thread is running, we don't need it, only the slave SQL thread is of interest. So while it's ok to implement a solution which waits until the slave I/O thread has the "Yes" state (see Matthias' patch), it was not strictly neededin order to fix this "bug".
[3 Dec 2005 15:07]
Guilhem Bichot
Even more, we just don't need SHOW SLAVE STATUS in rpl_relayrotate. The fact that the previous SELECT returns 8000 is all that's important. So one can remove SHOW SLAVE STATUS from the test and re-enable this test.
[5 Dec 2005 16:04]
Matthias Leich
Snip of an answer of Guilhem given to me: > You wrote that at least for rpl_relayrotate and rpl_until, > the difference between the expected result and the observed > result is only in "Slave_IO_Running". > > This is also valid for rpl_deadlock > which failed with mysql-5.1.1-alpha builds 05/08/25 > on "cane" (i386-unknown-freebsd4.7) debug normal > > *** r/rpl_deadlock.result Thu Aug 25 17:17:09 2005 > 09924: --- r/rpl_deadlock.reject Sat Aug 27 00:03:56 2005 > 09925: *************** > 09926: *** 83,87 **** > 09927: 22 > 09928: show slave status; > 09929: Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master > 09930: ! # 127.0.0.1 root MASTER_MYPORT 1 master-bin.000001 18911 # # master-bin.000001 Yes Yes 0 0 18911 # None 0 No # > 09931: drop table t1,t2,t3,t4; > 09932: --- 83,87 ---- > 09933: 22 > 09934: show slave status; > 09935: Slave_IO_State Master_Host Master_User Master_Port Connect_Retry Master_Log_File Read_Master_Log_Pos Relay_Log_File Relay_Log_Pos Relay_Master_Log_File Slave_IO_Running Slave_SQL_Running Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno Last_Error Skip_Counter Exec_Master_Log_Pos Relay_Log_Space Until_Condition Until_Log_File Until_Log_Pos Master_SSL_Allowed Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master > 09936: ! # 127.0.0.1 root MASTER_MYPORT 1 master-bin.000001 18911 # # master-bin.000001 No Yes 0 0 18911 # None 0 No # > 09937: drop table t1,t2,t3,t4; > 09938: ------------------------------------------------------- > > If > - the value for "Slave_IO_Running" does not matter right > and > - it can be expected that the other values do not change > between executing the previous command > select * from t2; and > the show slave status right again. The values which are "random" are: 1) slave_io_running 2) relay_log_file 3) relay_log_pos 4) relay_log_space 1) you know why 2)3)4) is because depending on timing, the slave may disconnect and reconnect, and every time it reconnects the relay log grows by circa 48 bytes. So 3) (=offset where the slave SQL thread is in the relay log), 2) (=name of that relay log, if relay log has grown this may trigger a relay log rotation creating a new relay log), 4) (relay log grows so occupied space grows too). 2) usually is not random (so it is not replaced by # in all results), but in theory it is. > we could also mangle the Slave_IO_Running column with > replace_result. > > What do you think ? I think this would be a good idea.
[5 Dec 2005 17:16]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/33039
[12 Dec 2005 16:52]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/33214
[12 Dec 2005 17:02]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/33215
[12 Dec 2005 18:17]
Joerg Bruehe
These patches of today affect test "rpl_until": 1) The original change is backported to 4.1, and 2) it is extended to 2 other places in that test (the status is checked 3 times).
[10 Apr 2006 12:55]
Lars Thalmann
The bug is fixed, but the test cases are still disabled in 4.1.
[11 Apr 2006 19:40]
Matthias Leich
I discovered today the same instability in MySQL 4.1 rpl_deadlock.test line 85.
[12 Apr 2006 15:00]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/4867
[18 Apr 2006 10:57]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/5074
[26 Apr 2006 15:39]
Matthias Leich
Got approval for bug fix via email Fix pushed to 4.1.19 No changelog entry needed