MySQL Bugs: #104677: Old joiner blocked for a long time in the failure recovery(Local recovery phase)

Bug #104677	Old joiner blocked for a long time in the failure recovery(Local recovery phase)
Submitted:	20 Aug 2021 12:04	Modified:	23 Sep 2021 6:41
Reporter:	Steven Curry	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.23	OS:	Debian
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
I simulated failure recovery in single-primary mode, and the local recovery phase took a lot of time.

During the time of restarting the secondary container (it took about 5 seconds), I made sure that only mysql-router performed transaction operations in the group, but after the secondary container joined the group, it took 10-20 minutes in the local recovery phase. , And the global recovery phase is completed in less than a few seconds.

I have asked many people and they have encountered this kind of situation, but they don't know what caused it.

I try to find out the reason in the following aspects
1. It may be the size of the relay log file
2. It may be that the gtid_exectued set is missing
3. It may be the thread log position information recorded by the applier channel (Relay_log_pos)
4. It may be the view ID

For the first case, I tried to fill the relay log (about 50mb log file) with multiple transactions, and then restart the secondary container. I thought it would be very slow to recover, but in fact the recovery is very fast and costly It took a few seconds; but if the relay log is filled up slowly, for example, only mysql-router has executed transactions for several days (the size of the relay log is about 50mb), but recovery is particularly slow

For the second case, I checked the gtid_exectued set in the group before restarting the secondary container. The difference between the gtid of the recovery node and the node in the group is only 1-2 transactions, but the recovery is still very slow.

For the third case, I checked the records in the slave_worker_info table before restarting the secondary container. pos records a larger value. I think this means that recovery should start after this pos, but the actual situation still starts from pos 0 recover

For the fourth case, after switching the VIEW ID in the group, I restarted the secondary container, but the recovery was still very slow.

How to repeat:
The way to reproduce is very simple. Only mysql-router executes transactions for several days (the size of the relay log is about 50mb), and then restarts one of the secondary containers. You can see through the log that it is blocked in the local recovery phase (group_replication_applier channel) 5~ 20 minutes, then global recovery takes less than a few seconds

Hi,

First, can you please try this with latest 8.0.26.

Can you share your config, especially value for group_replication_consistency?

> The way to reproduce is very simple. 
> Only mysql-router executes transactions for several days

There is nothing "simple" in a test that lasts "several days".
Can you share more about this test case
 - mysql-router is not executing anything, it passes queries from your application / client to the mysql server. How exactly did you configure mysql-router, what type of queries are you executing, how much of them. Is the number of data read/data changed important or is important to "let it stew for few days" ?
 - script you are running?

> blocked in the local recovery phase (group_replication_applier channel) 5~ 20 minutes, then global recovery takes less than a few seconds

this is something rather usual on multi-master setup but on single master setup this is not something I can reproduce. Are you sure you are not running multimaster setup.

thanks

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".