MySQL Bugs: #86171: master_pos_wait incompatible with MTS

Bug #86171	master_pos_wait incompatible with MTS
Submitted:	4 May 2017 1:53	Modified:	2 Nov 2017 13:49
Reporter:	Trey Raymond	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	replication

Description:
https://dev.mysql.com/doc/refman/5.6/en/miscellaneous-functions.html#function_master-pos-w...

This function is not compatible with MTS, as it only looks for the synced position, vs the actual executed position.
It should use a live source for data if MTS is enabled - table mysql.slave_worker_info, or equivalent in memory.

observed on 5.6.34, have not yet tested on a 5.7 setup (spoiled on gtids with all recent systems)

How to repeat:
make sure the slave's set up so checkpointing isn't rapid, you want to be able to see the issue.  set slave_checkpoint_group to a sensible but significant value (32K is good, ~60K is the max practical with the current implementation), and slave_checkpoint_period arbitrarily high.  start a slave with nonzero slave_parallel_workers.
apply a light write load to the master, just enough to be visibly advancing, but not enough to trigger checkpointing quickly.

now, check the checkpoint pos and the actual executed pos (master_log_pos and checkpoint_master_log_pos in slave_worker_info, checkpoint is equivalent to what's shown in show slave status or slave_relay_log_info) - then call master_pos_wait for a value in between them.

expected result would be to return immediately with result 0, as it was called for a position that was already executed.  However, it will actually hang until timeout or a checkpoint, whichever comes first.

Suggested fix:
Use the actual exec values from slave_worker_info for this function, when such data is available (with non-mts, current implementation works)

this needs polish, but here's a pretty basic way to get current position that supports mts and non mts:

(
select master_log_name,master_log_pos
from mysql.slave_worker_info w
join (
	select @tid:=@tid+1 as tid,t.PROCESSLIST_TIME
	from performance_schema.threads t
		join (select @tid:=0) tmp
	where t.NAME='thread/sql/slave_worker'
	order by THREAD_ID asc
	) tmp on tmp.tid=w.Id
where tmp.PROCESSLIST_TIME=0
order by master_log_name asc,master_log_pos asc
limit 1
)
union all
(
select master_log_name,master_log_pos
from slave_worker_info
order by master_log_name desc,master_log_pos desc
limit 1
)
union all
(
select master_log_name,master_log_pos
from slave_relay_log_info
)
limit 1

Hi,

If I understand you correctly you are reporting that when you exec master_pos_wait() it's waiting till the checkpoint and not returning "during" checkpoint as soon as event is read?

The manual clearly states "...It blocks until the slave has read and applied all updates up to the specified position...". The update is fully "applied" only after the checkpoint so it's actually behaving as expected.

Or am I misunderstanding your report?

kind regards
Bogdan

It's quite a stretch to say ""The update is fully "applied" only after the checkpoint"" - every thread on the db will consider an update applies after it commits, with no regard to checkpointing.

This isn't a very easy fix, I know.  It's not trivial for a thread to check every worker's execution since the last checkpoint, as it's not ordered or distributed in a predictable way.

Problem is, the function's not very useful without a fix (well, not useful with MTS).  I don't know how much it's used industry wide - we don't depend on it, but others might.
I've seen switchover scripts and such that will call this after setting one host to read_only to wait for the new master to sync.  In this case, it could wait until the max value of slave_checkpoint_period, as stopped writes wouldn't trigger one based on _group, and that might be an enormous amount of time.  Anyone using a script like that should be aware of this.

If fixing the function is impractical, then the documentation should at least be clear on this, specifying that it will wait until there's a checkpoint at or past the specified position, vs just waiting until it's been applied which is often much shorter.

If you do decide to fix, logic would be like:
Check all worker threads that have been active since the last checkpoint, and see if they have all processed up to/including the specified log position.  this data is tracked internally, just needs logic to use it.

Checking just the thread which processed the specified position isn't good enough, as others could be further behind.

Hi,

I do agree documentation here might be sketchy so that should be fixed for sure hence marking the bug as verified. As for the switch over, well those scripts will still work, they will just have to wait few more seconds/minutes longer then they would really have to - but on the other side, before checkpoint it is "unsafe" really to assume anything hence imo this behavior is, for production, and this case of switchover scripting, better and safer, then reporting "as soon" and not "after checkpoint" as it is now.

In any case I agree that at least documentation should be clear on what's going on.

many thanks
Bogdan

Yeah, it's not a good way to do switchovers, and doesn't affect me personally - this issue came up on freenode a while back when someone was looking for help, using some mysql switchover tool that used this method.  it's not great, but people are using it, maybe they'll stop if it's more clearly documented.

Posted by developer:
 
Thanks for the information. Added for 5.6, 5.7, and 8.0 in
https://dev.mysql.com/doc/refman/5.6/en/miscellaneous-functions.html#function_master-pos-w...

On a multi-threaded slave, the function waits until expiry of the limit set by the slave_checkpoint_group or slave_checkpoint_period system variable, when the checkpoint operation is called to update the status of the slave. Depending on the setting for the system variables, the function might therefore return some time after the specified position was reached.