MySQL Bugs: #97734: Document the correct method to stop slaving with MTS without a warning or error

Bug #97734	Document the correct method to stop slaving with MTS without a warning or error
Submitted:	21 Nov 2019 14:19	Modified:	5 Mar 2020 7:08
Reporter:	Buchan Milne	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	5.7.27	OS:	Linux
Assigned to:		CPU Architecture:	x86 (x86_64)

Description:
We are running 5.7.27 with MTS. We have 'delayed-slaves' which measure slaving rate, where we `STOP SLAVE SQL_THREAD`, wait for a period, `START SLAVE SQL_THREAD`, and measure the time it takes for Seconds_Behind_Master to return to 0.

We also have log monitoring for Warning or ERROR lines in our mysqld logs, also specifically on these hosts, to identify bugs in our continously-deploying applications.

Since we enabled parallel replication, we have tried a few changes to our delayed-slave script to avoid these errors and warnings, but we have not been able to eliminate them. 

The documentation indicates the problems with using `STOP SLAVE` while @@slave_parallel_workers > 1, but there doesn't seem to be a documented way to stop slaving without triggering one of the following warnings or errors:

[Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0

[ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

As far as I understand the documentation, from a slave running with @@slave_parallel_workers>1, I should:
1. SET GLOBAL @@slave_parallel_workers=0
2. START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2.1 Wait until Until_Condition becomes None or Slave_SQL_Running becomes No?
3. STOP SLAVE?

However:
* I never see Until_Condition become SQL_AFTER_MTS_GAPS as indicated in https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html

* I never see Slave_SQL_Running change to No after START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS, contrary to the documentation for START SLAVE which says:

https://dev.mysql.com/doc/refman/5.7/en/start-slave.html
> Use the START SLAVE UNTIL SQL_AFTER_MTS_GAPS statement to cause a multithreaded slave's worker threads to only run until no more gaps are found in the relay log, and then to stop.

Does this mean that this is an equivalent to STOP SLAVE, or to START SLAVE, or neither?

* No matter what I do, even if I have run (1) and (2), (3) still often logs either the warning above, or the error above.

How to repeat:
Setup a slaving relationship with MTS ( @@slave_parallel_workers = 4 or similar) with log_warnings=1

Try and mechanmism to stop slaving, without logging an eror:

* SET GLOBAL @@slave_parallel_workers=0; START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS;

Slaving never stops, ( Slave_SQL_Running remains 'Yes' and Seconds_Behind_Master remains 0).

Until_Condition never becomes SQL_AFTER_MTS_GAPS

* SET GLOBAL @@slave_parallel_workers=0; START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS; STOP SLAVE SQL_THREAD

New log entries are logged:

[Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0

Here are some of the logs from my current script:

2019-11-21 16:00:39,631 [DEBUG] delayed_slave_command._cycle_slave_delay:248 - Starting slave with command: START SLAVE SQL_THREAD
2019-11-21 16:01:09,590 [DEBUG] delayed_slave_command._cycle_slave_delay:273 - Caught up 60s in 29958ms (rate: 2.002804 replicated seconds per second, factor: 0.4993)
Caught up 60s in 29958ms (rate: 2.002804 replicated seconds per second, factor: 0.4993)
2019-11-21 16:01:09,593 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2019-11-21 16:01:09,599 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,599 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:09,600 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
2019-11-21 16:01:09,603 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,603 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:10,107 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:10,611 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:11,117 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:11,623 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:12,129 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:12,635 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:13,141 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:13,648 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:14,154 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:14,660 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:15,166 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:15,167 [DEBUG] delayed_slave_command._cycle_slave_delay:237 - Waited too long for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
2019-11-21 16:01:15,173 [DEBUG] delayed_slave_command._cycle_slave_delay:240 - Slave_SQL_Running: No, Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details)., Until_Condition: None
2019-11-21 16:01:15,174 [DEBUG] delayed_slave_command._cycle_slave_delay:244 - SQL Thread has stopped, waiting 60 seconds

mysqld log:

2019-11-21T14:01:15.167916Z 9696269 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

(for convenience, my script checks @@slave_parallel_workers, and if it is greater than 0, saves the current value, sets it to 0, tries to stop slaving correctly, and then sets it back so that when START SLAVE runs it has been set back correctly. If changing this to only set @@slave_parallel_workers>0 just before START SLAVE would make a difference, I will do it, but it is a but more work, and the documentation seems to indicate this should not make a difference)

Another case:

2019-11-21 16:12:15,884 [DEBUG] delayed_slave_command._cycle_slave_delay:273 - Caught up 60s in 16815ms (rate: 3.568243 replicated seconds per second, factor:
 0.28025)
Caught up 60s in 16815ms (rate: 3.568243 replicated seconds per second, factor: 0.28025)
2019-11-21 16:12:15,887 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2019-11-21 16:12:15,893 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,894 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:15,894 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
2019-11-21 16:12:15,897 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,897 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:16,400 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:16,904 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:17,408 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:17,911 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:18,415 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:18,919 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:19,422 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:19,926 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:20,430 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:20,933 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,437 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,437 [DEBUG] delayed_slave_command._cycle_slave_delay:237 - Waited too long for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
2019-11-21 16:12:21,442 [DEBUG] delayed_slave_command._cycle_slave_delay:240 - Slave_SQL_Running: No, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,442 [DEBUG] delayed_slave_command._cycle_slave_delay:244 - SQL Thread has stopped, waiting 60 seconds

mysqld logged this at the time ( OS time is UTC+2):

2019-11-21T14:12:21.438567Z 9709979 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2019-11-21T14:12:21.439416Z 9709979 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'cmdb.002705' position 444969641.

Suggested fix:
Update the documentation to reflect how to stop an MTS slave without generating either an error or warning message, or convert this bug to a software bug.

Thanks for your request. The documentation can always be improved, so thanks for your suggestions. With regards to "how to stop without having even a warning" - this is not always possible.

all best

Hi all,

Using slave_preserve_commit_order = ON is probably a good workaround to avoid the warning in the error log.

Cheers, JFG

Graph of impact of slave_preserve_commit_order=1 on catchup-rate and slave_sql_warnings

Attachment: catchup_rate_vs_slave_sql_warnings_after_preserve_commit_order.png (image/png, text), 96.35 KiB.

> Using slave_preserve_commit_order = ON is probably a good workaround to avoid the warning in the error log.

On one of the servers where I have seen this the most (a delayed slave, which measures catch-up rate by stopping slaving 1 minute, starting slaving until it catches up, repeating), I set slave_preserve_commit_order about 2 weeks ago.

This doesn't remove the warnings, but does reduce the frequency. See the attached graphs.

I saw an occurrence while running the script that does this with debugging enabled:

2020-03-05T06:53:08.601433Z 14597349 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2020-03-05T06:53:08.601709Z 14597349 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db.018376' position 72638354

Here are some debug logs from the script we use (logged in UTC+2):

[2020-03-05 08:53:06,571] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
[2020-03-05 08:53:06,574] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:06,575] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
[2020-03-05 08:53:06,578] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:06,578] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
[2020-03-05 08:53:06,581] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:06,584] [DEBUG] delayed_slave_command._cycle_slave_delay:238 - Slave_SQL_Running is still Yes, waiting up to 4 times for 0.500000 s for it to change
[2020-03-05 08:53:07,087] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:07,591] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:08,095] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:08,598] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:08,599] [DEBUG] delayed_slave_command._cycle_slave_delay:245 - Gave up waiting for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
[2020-03-05 08:53:08,604] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: No, Last_SQL_Error: , Until_Condition: None
[2020-03-05 08:53:08,604] [DEBUG] delayed_slave_command._cycle_slave_delay:254 - SQL Thread has stopped, waiting 60 seconds
[2020-03-05 08:54:08,622] [DEBUG] delayed_slave_command._cycle_slave_delay:257 - Starting slave with command: START SLAVE SQL_THREAD
[2020-03-05 08:54:17,173] [DEBUG] delayed_slave_command._cycle_slave_delay:282 - Caught up 60s in 8551ms (rate: 7.016723 replicated seconds per second, factor: 0.142517)
Caught up 60s in 8551ms (rate: 7.016723 replicated seconds per second, factor: 0.142517)

The script stops slaving by running the commands as follows:

SET GLOBAL slave_parallel_workers=0
START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
SET GLOBAL slave_parallel_workers=10
(wait until Slave_SQL_Running!=Yes or 2s have passed, in which case run STOP SLAVE SQL_THREAD)
(wait 60s)
START SLAVE SQL_THREAD

I can't see an example if Slave_SQL_Running being No before we run STOP SLAVE SQL_THREAD.

(We could change it around to:

SET GLOBAL slave_parallel_workers=0
START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
(wait until Slave_SQL_Running!=Yes or 2s have passed, in which case run STOP SLAVE SQL_THREAD)
(sleep 60s)
SET GLOBAL slave_parallel_workers=10
START SLAVE SQL_THREAD

However, if the script dies or is restarted in the 60s it is sleeping, we would lose the current value of slave_parallel_workers, which is why we haven't done it that way
)

Am I misunderstanding the documentation/behaviour/intention here?

I just noticed one where we caught the other error we see as Last_SQL_Error in the scripts logs:

[2020-03-05 09:05:32,253] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
[2020-03-05 09:05:32,256] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:32,256] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
[2020-03-05 09:05:32,259] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:32,259] [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
[2020-03-05 09:05:32,262] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:32,265] [DEBUG] delayed_slave_command._cycle_slave_delay:238 - Slave_SQL_Running is still Yes, waiting up to 4 times for 0.500000 s for it to change
[2020-03-05 09:05:32,768] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:33,272] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:33,775] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:34,279] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
[2020-03-05 09:05:34,279] [DEBUG] delayed_slave_command._cycle_slave_delay:245 - Gave up waiting for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
[2020-03-05 09:05:34,283] [DEBUG] delayed_slave_command._log_slave_status:171 - Slave_SQL_Running: No, Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details)., Until_Condition: None
[2020-03-05 09:05:34,284] [INFO ] delayed_slave_command._cycle_slave_delay:250 - 'STOP SLAVE' resulted in SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
[2020-03-05 09:05:34,284] [DEBUG] delayed_slave_command._cycle_slave_delay:254 - SQL Thread has stopped, waiting 60 seconds
[2020-03-05 09:06:34,307] [DEBUG] delayed_slave_command._cycle_slave_delay:257 - Starting slave with command: START SLAVE SQL_THREAD