Bug #97734 Document the correct method to stop slaving with MTS without a warning or error
Submitted: 21 Nov 2019 14:19 Modified: 21 Nov 2019 17:55
Reporter: Buchan Milne Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.7.27 OS:Linux
Assigned to: CPU Architecture:x86 (x86_64)

[21 Nov 2019 14:19] Buchan Milne
Description:
We are running 5.7.27 with MTS. We have 'delayed-slaves' which measure slaving rate, where we `STOP SLAVE SQL_THREAD`, wait for a period, `START SLAVE SQL_THREAD`, and measure the time it takes for Seconds_Behind_Master to return to 0.

We also have log monitoring for Warning or ERROR lines in our mysqld logs, also specifically on these hosts, to identify bugs in our continously-deploying applications.

Since we enabled parallel replication, we have tried a few changes to our delayed-slave script to avoid these errors and warnings, but we have not been able to eliminate them. 

The documentation indicates the problems with using `STOP SLAVE` while @@slave_parallel_workers > 1, but there doesn't seem to be a documented way to stop slaving without triggering one of the following warnings or errors:

[Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0

[ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

As far as I understand the documentation, from a slave running with @@slave_parallel_workers>1, I should:
1. SET GLOBAL @@slave_parallel_workers=0
2. START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2.1 Wait until Until_Condition becomes None or Slave_SQL_Running becomes No?
3. STOP SLAVE?

However:
* I never see Until_Condition become SQL_AFTER_MTS_GAPS as indicated in https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html

* I never see Slave_SQL_Running change to No after START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS, contrary to the documentation for START SLAVE which says:

https://dev.mysql.com/doc/refman/5.7/en/start-slave.html
> Use the START SLAVE UNTIL SQL_AFTER_MTS_GAPS statement to cause a multithreaded slave's worker threads to only run until no more gaps are found in the relay log, and then to stop.

Does this mean that this is an equivalent to STOP SLAVE, or to START SLAVE, or neither?

* No matter what I do, even if I have run (1) and (2), (3) still often logs either the warning above, or the error above.

How to repeat:
Setup a slaving relationship with MTS ( @@slave_parallel_workers = 4 or similar) with log_warnings=1

Try and mechanmism to stop slaving, without logging an eror:

* SET GLOBAL @@slave_parallel_workers=0; START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS;

Slaving never stops, ( Slave_SQL_Running remains 'Yes' and Seconds_Behind_Master remains 0).

Until_Condition never becomes SQL_AFTER_MTS_GAPS

* SET GLOBAL @@slave_parallel_workers=0; START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS; STOP SLAVE SQL_THREAD

New log entries are logged:

[Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0

Here are some of the logs from my current script:

2019-11-21 16:00:39,631 [DEBUG] delayed_slave_command._cycle_slave_delay:248 - Starting slave with command: START SLAVE SQL_THREAD
2019-11-21 16:01:09,590 [DEBUG] delayed_slave_command._cycle_slave_delay:273 - Caught up 60s in 29958ms (rate: 2.002804 replicated seconds per second, factor: 0.4993)
Caught up 60s in 29958ms (rate: 2.002804 replicated seconds per second, factor: 0.4993)
2019-11-21 16:01:09,593 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:09,596 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2019-11-21 16:01:09,599 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,599 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:09,600 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
2019-11-21 16:01:09,603 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:09,603 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:01:10,107 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:10,611 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:11,117 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:11,623 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:12,129 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:12,635 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:13,141 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:13,648 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:14,154 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:14,660 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:15,166 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:01:15,167 [DEBUG] delayed_slave_command._cycle_slave_delay:237 - Waited too long for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
2019-11-21 16:01:15,173 [DEBUG] delayed_slave_command._cycle_slave_delay:240 - Slave_SQL_Running: No, Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details)., Until_Condition: None
2019-11-21 16:01:15,174 [DEBUG] delayed_slave_command._cycle_slave_delay:244 - SQL Thread has stopped, waiting 60 seconds

mysqld log:

2019-11-21T14:01:15.167916Z 9696269 [ERROR] Slave SQL for channel '': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

(for convenience, my script checks @@slave_parallel_workers, and if it is greater than 0, saves the current value, sets it to 0, tries to stop slaving correctly, and then sets it back so that when START SLAVE runs it has been set back correctly. If changing this to only set @@slave_parallel_workers>0 just before START SLAVE would make a difference, I will do it, but it is a but more work, and the documentation seems to indicate this should not make a difference)

Another case:

2019-11-21 16:12:15,884 [DEBUG] delayed_slave_command._cycle_slave_delay:273 - Caught up 60s in 16815ms (rate: 3.568243 replicated seconds per second, factor:
 0.28025)
Caught up 60s in 16815ms (rate: 3.568243 replicated seconds per second, factor: 0.28025)
2019-11-21 16:12:15,887 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=0
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:15,890 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): START SLAVE SQL_THREAD UNTIL SQL_AFTER_MTS_GAPS
2019-11-21 16:12:15,893 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,894 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:15,894 [DEBUG] delayed_slave_command._cycle_slave_delay:224 - Stopping slave with command(s): SET GLOBAL slave_parallel_workers=10
2019-11-21 16:12:15,897 [DEBUG] delayed_slave_command._cycle_slave_delay:227 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:15,897 [DEBUG] delayed_slave_command._cycle_slave_delay:228 - Waiting for SQL thread to stop
2019-11-21 16:12:16,400 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:16,904 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:17,408 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:17,911 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:18,415 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:18,919 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:19,422 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:19,926 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:20,430 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:20,933 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,437 [DEBUG] delayed_slave_command._cycle_slave_delay:234 - Slave_SQL_Running: Yes, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,437 [DEBUG] delayed_slave_command._cycle_slave_delay:237 - Waited too long for Slave_SQL_Running!=Yes, stopping slaving with STOP SLAVE SQL_THREAD
2019-11-21 16:12:21,442 [DEBUG] delayed_slave_command._cycle_slave_delay:240 - Slave_SQL_Running: No, Last_SQL_Error: , Until_Condition: None
2019-11-21 16:12:21,442 [DEBUG] delayed_slave_command._cycle_slave_delay:244 - SQL Thread has stopped, waiting 60 seconds

mysqld logged this at the time ( OS time is UTC+2):

2019-11-21T14:12:21.438567Z 9709979 [Warning] Slave SQL for channel '': Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2019-11-21T14:12:21.439416Z 9709979 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'cmdb.002705' position 444969641.

Suggested fix:
Update the documentation to reflect how to stop an MTS slave without generating either an error or warning message, or convert this bug to a software bug.
[21 Nov 2019 17:55] Bogdan Kecman
Thanks for your request. The documentation can always be improved, so thanks for your suggestions. With regards to "how to stop without having even a warning" - this is not always possible.

all best
[18 Jan 21:28] Jean-François Gagné
Hi all,

Using slave_preserve_commit_order = ON is probably a good workaround to avoid the warning in the error log.

Cheers, JFG