Bug #97734 | Document the correct method to stop slaving with MTS without a warning or error | ||
---|---|---|---|
Submitted: | 21 Nov 2019 14:19 | Modified: | 5 Mar 2020 7:08 |
Reporter: | Buchan Milne | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Documentation | Severity: | S3 (Non-critical) |
Version: | 5.7.27 | OS: | Linux |
Assigned to: | CPU Architecture: | x86 (x86_64) |
[21 Nov 2019 14:19]
Buchan Milne
[21 Nov 2019 17:55]
MySQL Verification Team
Thanks for your request. The documentation can always be improved, so thanks for your suggestions. With regards to "how to stop without having even a warning" - this is not always possible. all best
[18 Jan 2020 21:28]
Jean-François Gagné
Hi all, Using slave_preserve_commit_order = ON is probably a good workaround to avoid the warning in the error log. Cheers, JFG
[5 Mar 2020 6:36]
Buchan Milne
Graph of impact of slave_preserve_commit_order=1 on catchup-rate and slave_sql_warnings
Attachment: catchup_rate_vs_slave_sql_warnings_after_preserve_commit_order.png (image/png, text), 96.35 KiB.
[5 Mar 2020 7:04]
Buchan Milne
> Using slave_preserve_commit_order = ON is probably a good workaround to avoid the warning in the error log. On one of the servers where I have seen this the most (a delayed slave, which measures catch-up rate by stopping slaving 1 minute, starting slaving until it catches up, repeating), I set slave_preserve_commit_order about 2 weeks ago. This doesn't remove the warnings, but does reduce the frequency. See the attached graphs. I saw an occurrence while running the script that does this with debugging enabled: 2020-03-05T06:53:08.601433Z 14597349 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0 2020-03-05T06:53:08.601709Z 14597349 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db.018376' position 72638354 Here are some debug logs from the script we use (logged in UTC+2): [2020-03-05 08:53:06,571] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0 [2020-03-05 08:53:06,574] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:06,575] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS [2020-03-05 08:53:06,578] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:06,578] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10 [2020-03-05 08:53:06,581] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:06,584] [DEBUG] delayed_slave_command._cycle_slave_delay:238 - Slave_SQL_Running is still Yes, waiting up to 4 times for 0.500000 s for it to change [2020-03-05 08:53:07,087] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:07,591] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:08,095] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:08,598] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:08,599] [DEBUG] delayed_slave_command._cycle_slave_delay:245 - Gave up waiting for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD [2020-03-05 08:53:08,604] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: No, Last_SQL_Error: , Until_Condition: None [2020-03-05 08:53:08,604] [DEBUG] delayed_slave_command._cycle_slave_delay:254 - SQL Thread has stopped, waiting 60 seconds [2020-03-05 08:54:08,622] [DEBUG] delayed_slave_command._cycle_slave_delay:257 - Starting slave with command: START SLAVE SQL_THREAD [2020-03-05 08:54:17,173] [DEBUG] delayed_slave_command._cycle_slave_delay:282 - Caught up 60s in 8551ms (rate: 7.016723 replicated seconds per second, factor: 0.142517) Caught up 60s in 8551ms (rate: 7.016723 replicated seconds per second, factor: 0.142517) The script stops slaving by running the commands as follows: SET GLOBAL slave_parallel_workers=0 START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS SET GLOBAL slave_parallel_workers=10 (wait until Slave_SQL_Running!=Yes or 2s have passed, in which case run STOP SLAVE SQL_THREAD) (wait 60s) START SLAVE SQL_THREAD I can't see an example if Slave_SQL_Running being No before we run STOP SLAVE SQL_THREAD. (We could change it around to: SET GLOBAL slave_parallel_workers=0 START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS (wait until Slave_SQL_Running!=Yes or 2s have passed, in which case run STOP SLAVE SQL_THREAD) (sleep 60s) SET GLOBAL slave_parallel_workers=10 START SLAVE SQL_THREAD However, if the script dies or is restarted in the 60s it is sleeping, we would lose the current value of slave_parallel_workers, which is why we haven't done it that way ) Am I misunderstanding the documentation/behaviour/intention here?
[5 Mar 2020 7:08]
Buchan Milne
I just noticed one where we caught the other error we see as Last_SQL_Error in the scripts logs: [2020-03-05 09:05:32,253] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0 [2020-03-05 09:05:32,256] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:32,256] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS [2020-03-05 09:05:32,259] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:32,259] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10 [2020-03-05 09:05:32,262] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:32,265] [DEBUG] delayed_slave_command._cycle_slave_delay:238 - Slave_SQL_Running is still Yes, waiting up to 4 times for 0.500000 s for it to change [2020-03-05 09:05:32,768] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:33,272] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:33,775] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:34,279] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None [2020-03-05 09:05:34,279] [DEBUG] delayed_slave_command._cycle_slave_delay:245 - Gave up waiting for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD [2020-03-05 09:05:34,283] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: No, Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details)., Until_Condition: None [2020-03-05 09:05:34,284] [INFO ] delayed_slave_command._cycle_slave_delay:250 - 'STOP SLAVE' resulted in SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). [2020-03-05 09:05:34,284] [DEBUG] delayed_slave_command._cycle_slave_delay:254 - SQL Thread has stopped, waiting 60 seconds [2020-03-05 09:06:34,307] [DEBUG] delayed_slave_command._cycle_slave_delay:257 - Starting slave with command: START SLAVE SQL_THREAD