Bug #37718 | rpl.rpl_stm_mystery22 fails sporadically on pushbuild | ||
---|---|---|---|
Submitted: | 28 Jun 2008 11:48 | Modified: | 30 Mar 2009 12:24 |
Reporter: | Sven Sandberg | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | Tests: Replication | Severity: | S7 (Test Cases) |
Version: | 5.1 | OS: | Any |
Assigned to: | Serge Kozlov | CPU Architecture: | Any |
Tags: | pushbuild, rpl.rpl_stm_mystery22, sporadic, start slave, test failure, wait_for_slave_to_start |
[28 Jun 2008 11:48]
Sven Sandberg
[28 Jun 2008 11:55]
Sven Sandberg
BTW, if nobody objects, I'll rename the test to rpl_deadlock and write proper comments.
[3 Jul 2008 15:34]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/48967 2682 Sven Sandberg 2008-07-03 BUG#37718: rpl.rpl_stm_mystery22 failed on pushbuild rpl_stm_mystery22 failed because 'START SLAVE' was not followed by 'source include/wait_for_slave_to_start.inc'. Fixed by adding include/start_slave.inc and sourcing that file. Also commented the test and removed unnecessary code. Also, the test contains no mystery and is not related to the number 22, so it has been renamed to rpl_conflicts.
[14 Jul 2008 11:49]
Sven Sandberg
xref shows many failures: http://tinyurl.com/6mmu3v
[2 Dec 2008 19:55]
Sven Sandberg
See also BUG#41186
[1 Jan 2009 18:56]
Sven Sandberg
The formulation of the error message has changed in rpl trees, and line numbers have changed, hence the xref above does not work any more. Also, the xref search is sensitive to the combination under which the test was run. The following xrefs list all failures: http://tinyurl.com/9mx42b (rpl.rpl_stm_mystery22) http://tinyurl.com/a7zshw (rpl.rpl_stm_mystery22 'stmt') http://tinyurl.com/9owjq2 (rpl.rpl_stm_mystery22 'row')
[2 Jan 2009 17:06]
Sven Sandberg
==== Background ==== When the slave SQL thread receives an error, it first sets Slave_SQL_Error (by calling rli->report()), then does some cleanup, and then sets Slave_SQL_Running=No (by setting 'rli->slave_running = 0'). Hence, there is a short window of time when Slave_SQL_Error != '' and Slave_SQL_Running = Yes. ==== The failure ==== The test case does: (1) produce error in SQL thread (2) source include/wait_for_slave_sql_error.inc; (3) remove condition for error in SQL thread (4) disable_warnings; START SLAVE; enable_warnings; (5) source include/wait_for_slave_sql_to_start.inc; (6) sync_with_master; The failure happens as follows: After (1), the SQL thread sets Last_SQL_Error. The test proceeds to (2) and (3) without reaching the point where the SQL thread sets Slave_SQL_Running=No. When (4) is reached, it looks like the slave is still running (though it is actually just about to stop), hence the START SLAVE command fails. No warning is printed since warnings are disabled. Then two things can happen: - The SQL thread stops between (4) and (5). Then (5) fails. This is BUG#41186. - The SQL thread stops between (5) and (6), or during (6). Then (6) fails. This is the present bug. ==== Solution ==== A quick fix is to fix the tests. I think the most robust fix is to make wait_for_slave_sql_error.inc wait until the slave has stopped completely. The test should also do "START SLAVE SQL_THREAD" instead of "disable_warnings; START SLAVE; enable_warnings". (Warnings were disabled because the IO thread is running, so the IO thread will generate a warning. It is better to just start the SQL thread and not suppress any warnings.) This may also be a server bug. We may consider changing the server so that it guarantees: (1) "Last_SQL_Error != ''" implies "Slave_SQL_Running = No" and also: (2) "Last_IO_Error != ''" implies "Slave_IO_Running = No" (If this is not a server bug, it should at least be documented.)
[5 Jan 2009 15:35]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62487 2709 Sven Sandberg 2009-01-05 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: make wait_for_slave_sql_error wait for the slave to stop. Updated wait_for_slave_sql_error_and_skip accordingly. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign.
[7 Jan 2009 8:05]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62569 2709 Sven Sandberg 2009-01-07 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Remove parameter $slave_keep_connection by $master_connection.
[8 Jan 2009 10:36]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62673 2709 Sven Sandberg 2009-01-08 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Replace parameter $slave_keep_connection by $master_connection.
[8 Jan 2009 10:40]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62674 2709 Sven Sandberg 2009-01-08 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Replace parameter $slave_keep_connection by $master_connection.
[8 Jan 2009 16:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62712 2709 Sven Sandberg 2009-01-08 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Replace parameter $slave_keep_connection by $master_connection.
[8 Jan 2009 17:02]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62721 2709 Sven Sandberg 2009-01-08 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Replace parameter $slave_keep_connection by $master_connection.
[9 Jan 2009 14:10]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62836 2712 Sven Sandberg 2009-01-09 BUG#37718: rpl.rpl_stm_mystery22 fails sporadically on pushbuild Problem 1: The test waits for an error in the slave sql thread, then resolves the error and issues 'start slave'. However, there is a gap between when the error is reported and the slave sql thread stops. If this gap was long, the slave would still be running when 'start slave' happened, so 'start slave' would fail and cause a test failure. Fix 1: Made wait_for_slave_sql_error wait for the slave to stop instead of wait for error in the IO thread. After stopping, the error code is verified. If the error code is wrong, debug info is printed. To print debug info, the debug printing code in wait_for_slave_param.inc was moved out to a new file, show_rpl_debug_info.inc. Problem 2: rpl_stm_mystery22 is a horrible name, the comments in the file didn't explain anything useful, the test was generally hard to follow, and the test was essentially duplicated between rpl_stm_mystery22 and rpl_row_mystery22. Fix 2: The test is about conflicts in the slave SQL thread, hence I renamed the tests to rpl_{stm,row}_conflicts. Refactored the test so that the work is done in extra/rpl_tests/rpl_conflicts.inc, and rpl.rpl_{row,stm}_conflicts merely sets some variables and then sourced extra/rpl_tests/rpl_conflicts.inc. The tests have been rewritten and comments added. Problem 3: When calling wait_for_slave_sql_error.inc, you always want to verify that the sql thread stops because of the expected error and not because of some other error. Currently, wait_for_slave_sql_error.inc allows the caller to omit the error code, in which case all error codes are accepted. Fix 3: Made wait_for_slave_sql_error.inc fail if no error code is given. Updated rpl_filter_tables_not_exist accordingly. Problem 4: rpl_filter_tables_not_exist had a typo, the dollar sign was missing in a 'let' statement. Fix 4: Added dollar sign. Problem 5: When replicating from other servers than the one named 'master', the wait_for_slave_* macros were unable to print debug info on the master. Fix 5: Replace parameter $slave_keep_connection by $master_connection.
[9 Jan 2009 15:25]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/62844 2782 Sven Sandberg 2009-01-09 [merge] merged BUG#37718 from 5.1-rpl to 6.0-rpl
[30 Jan 2009 13:30]
Bugs System
Pushed into 6.0.10-alpha (revid:luis.soares@sun.com-20090129165607-wiskabxm948yx463) (version source revid:luis.soares@sun.com-20090129163120-e2ntks4wgpqde6zt) (merge vers: 6.0.10-alpha) (pib:6)
[30 Jan 2009 15:10]
Bugs System
Pushed into 5.1.32 (revid:luis.soares@sun.com-20090129165946-d6jnnfqfokuzr09y) (version source revid:sven@mysql.com-20090109141231-85894um6szphnbxy) (merge vers: 5.1.31) (pib:6)
[17 Feb 2009 14:58]
Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:tomas.ulin@sun.com-20090217131017-6u8qz1edkjfiobef) (version source revid:tomas.ulin@sun.com-20090203133556-9rclp06ol19bmzs4) (merge vers: 5.1.32-ndb-6.3.22) (pib:6)
[17 Feb 2009 16:46]
Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:tomas.ulin@sun.com-20090217134419-5ha6xg4dpedrbmau) (version source revid:tomas.ulin@sun.com-20090203133556-9rclp06ol19bmzs4) (merge vers: 5.1.32-ndb-6.3.22) (pib:6)
[17 Feb 2009 18:22]
Bugs System
Pushed into 5.1.32-ndb-6.2.17 (revid:tomas.ulin@sun.com-20090217134216-5699eq74ws4oxa0j) (version source revid:tomas.ulin@sun.com-20090201210519-vehobc4sy3g9s38e) (merge vers: 5.1.32-ndb-6.2.17) (pib:6)
[30 Mar 2009 12:24]
Jon Stephens
Testing failures only; no end-user changes to document. No further further action required. Closed.