MySQL Bugs: #99006: GTID Synchronisation speed when doing START SLAVE

Bug #99006	GTID Synchronisation speed when doing START SLAVE
Submitted:	20 Mar 2020 11:45	Modified:	14 Apr 2020 15:29
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7 and 8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	GTID, replication, slow, sync

Description:
Since moving from file/position to GTIDs replication management is much simpler. However, under some circumstances which I'm not quite sure of after restarting a connection on a slave to it's upstream master the time from doing this until you see changes being applied on the slave can be very long and seems excessive.
I would it expect it to be quite quick and usually this happens in a few seconds.

What is confusing is:
* There is no logging of this synchronisation process so it's often not clear what is going on. nothing on the master, nothing on the slave. All you see is that replication delay keeps increasing. * I suspect that "scanning the binlogs" to find the right place to start from may take longer in some cases than necessary. The exact process used to find the right binlog to start to pull data from is not documented and so improving documentation, and if this process under certain circumstances is not efficient, improving it would help a lot.
* this would increase transparency and distinguish from connecting to the master, and finding the first binlogs to send and /or when the first entries are being sent.
* often noticed when a significant number (10+) of slaves reconnect at the same time, perhaps being moved from one master or intermediate master to another.

In my case I usually have binlog sizes configured to be 100MB maximum. This came from when I was using pseudo-GTID for orchestrator but there's been no change on the apps. This means that generally binlog sizes are "small" so scanning a binlog should not take long.

This may be related to a large transaction in the binlog which needs to be scanned and passed. Again it's not 100% clear.

Why is this important?
* lack of transparency of what is happening / perceived speed is slower than expected / state is not clear as replication does not appear to be progressing
* if you monitor the replication delay, e.g. by injecting changes periodically on a master, you'd expect that after doing START SLAVE the replication delay should start to decrease immediately and with that you know that replication is working.
* it was suggested to me that I should not look at SHOW SLAVE STATUS\G output as that is old and does not have specific details of what the I/O or SQL threads are doing. I've not had time when I've seen this to double check the performance_schema tables to see if more information is visible there, but when I see this again I'll try to do that.

An example I saw from a few days ago showed this:

root@myhost [(none)]> start slave;
Query OK, 0 rows affected (3 min 17.74 sec)

That's a long time to start up. What's going on? In some cases I've seen the START SLAVE work immediately and also seen this issue.

If connecting a slave to a master when the slave's has just been cloned and may be several hours or days behind the master does mean that binlogs need to be scanned to find the right place to send data from. If a large number of binlogs are involved then this process may be slow. I don't believe there's any GTID info in the binlog.index file or anywhere else so the master must scan the files to find the right place to start from. Is this done sequentially backwards, or sequentially from the first binlog, as potentially it could do a binary search between the first and last binlogs to find the right one to scan. Potentially this process could be improved if the slave is some way behind the master.

I can provide more information if needed but probably may have to do that privately.

This can be confusing as replication doesn't seem to be working, then after a minute or two things seem fine. So nothing "broken" but just a long pause before changes seem to reach the slave and start to be applied.

How to repeat:
Repeating this is not that easy as sometimes replication restarts quite quickly. I've noticed this on several occasions but am not sure of the exact circumstances under which it happens.

Suggested fix:
* Improved documentation of the process of connecting and what to look for to see if the i/o thread is pulling down data, and the SQL thread is applying it. Maybe a new sys view?
* better visibility once connected to see if data has been received from the master. I suspect nothing is sent and so the i/o thread is just waiting, but you can't see I believe the time the last "packet" or binlog event was sent. Might be good to show that, or include some counters?

Hi Simon,

Just to check one details before I go further.

You are talking about
 - executing "start slave" and waiting a long time for it to finish (like 3 minutes you shown", but after start slave is "executed" the replication starts immediately and you can see your changes propagate?

or you are talking about 
 - "start slave" finished (in whatever time) and then you see too much tima passes till the replication actually start moving and your changes start propagating?

or third option where
 - both start slave takes a while to start
 - and after start slave finish you don't immediately see replication catching on but is stalled for a while

thanks and good health
Bogdan

Hi,

To be clear I've experienced both issues:
* start slave taking a while to start. It's not clear why
* start slave starting immediately and the i/o thread being shown as connected but then not seeing any "progress" of transactions on the slave. This usually resolves itself after a while. (few minutes)

Sorry for not being clear. This doesn't happen all the time and I'm aware that the description is rather vague.

Hi Simon,

All clear, that's why we have 2way communication :D to make things clearer.

The first issue (start slave taking long to start) is IMO acceptable as long as we properly show what's going on. So some additional logging will have to happen here. As for the second issue, this might be a problem. 

At the moment I'm not able to reproduce neither but working on this (the whole team, not just me). Will update the report as soon as I have something. 

all best
Bogdan