MySQL Bugs: #87474: Slave channels start before group replication starts preventing joins

Bug #87474	Slave channels start before group replication starts preventing joins
Submitted:	18 Aug 2017 11:07	Modified:	26 Feb 2018 13:29
Reporter:	Pedro Gomes	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	5.7.20	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Lets look at mysqld.cc at the main server initialization method
int mysqld_main(int argc, char **argv)

See that at (rough line numbers)
line 4913
init_slave(); /* Ignoring errors while configuring replication. */
and at line 5032
(void) RUN_HOOK(server_state, before_handle_connection, (NULL));

-------------------
An aside, this hook is here group replication starts.
Only here it can access the SQL API to set the read mode on start
-------------------

What does this means is that slave channels start before group replication starts.

In single primary mode:
SP1- If the member is the bootstrap server, then it doesn't matter, all the applied data becomes recovery data for the group.
SP2- If the member is not the bootstrap, then start will fail but here the bug is arguable because you can either
A) Start the channels first and then GR fails (current)
B) Start GR first and then channels will fail at start
So for single primary mode it could be arguable that this is not a bug.

In multi primary mode:
MP1- Again if the member is the bootstrap server then no issue exists as on SP1
MP2- If the member tries to rejoin a running group it can be possible that the extra received transactions (logged as being local) will prevent the join.

So MP2 is the case to look in this bug.
The user scenario would be a member of the group that is receiving data from an external source.
If this member restarts, it is possible that it won't be able to join the group because the started channel applied some local transactions before it joined the group.

If channels started later or we somehow we delayed the application of data then the server could be restarted without issue.

How to repeat:
I don't have a reproducible case, this is for now a theoretical case.
You can complain about it when I return from vacation

Suggested fix:
Somehow delay the start of the SQL threads in case the plugin is present and instructed to run?

There is still the question of the fix for M2 will also lead to SP2 case B and if that is desirable.

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.11 changelog:
In a multi-primary group, when a member was also configured with a asynchronous replication channel, there was a possibility that the asynchronous channel could start before Group Replication started. This could result in the asynchronous channel processing transactions before the member became an online member of the group, causing issues when members tried to join the group. The fix ensures that asynchronous channels on group members do not start until the member has become online.

Reclosing.