MySQL Bugs: #113838: Async connection failover prevents failed Group Replication member rejoin

Bug #113838	Async connection failover prevents failed Group Replication member rejoin
Submitted:	31 Jan 2024 19:24	Modified:	14 Feb 2024 0:27
Reporter:	Matthew Boehm	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0+	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	asynchronous, group replication, Replicaiton

Description:
When async replication is configured between two group replication clusters, a failed replica will not be able to rejoin the cluster due to async replication attempting to reconnect on mysql restart.

This creates a situation where a GR member cannot auto-rejoin the group after temporary failure and must be corrected by DBA intervention.

Ref: https://dev.mysql.com/doc/refman/8.0/en/replication-asynchronous-connection-failover-repli...

How to repeat:
Consider the following two group replication clusters, both in single primary mode. All members are configured to start GR on boot.

  GR1: node1 (P), node2 (S), node3 (S) (01010101-aaaa-bbbb-cccc-dddddddddddd)
  GR2: node4 (P), node5 (S), node6 (S) (02020202-aaaa-bbbb-cccc-dddddddddddd)

-- Create the repl user in GR1
node1> CREATE USER 'grharepl'@'%' IDENTIFIED BY 'repl1234#';
node1> GRANT REPLICATION SLAVE ON *.* TO 'grharepl'@'%';
node1> GRANT SELECT ON performance_schema.* TO 'grharepl'@'%';

-- Create the HA replication channel on all members in DR GR
node4/5/6> CHANGE REPLICATION SOURCE TO SOURCE_HOST='127.0.0.1', SOURCE_USER='grharepl', SOURCE_PASSWORD='repl1234#', SOURCE_AUTO_POSITION=1, SOURCE_SSL=1, SOURCE_RETRY_COUNT=3, SOURCE_CONNECT_RETRY=30, FOR CHANNEL 'gr1HA';

-- On node4, configure async connection failover for managed group
-- node1 is the current Primary and running on port 24536
--
-- Ref: https://dev.mysql.com/doc/refman/8.0/en/replication-functions-async-failover.html#function...

node4> CHANGE REPLICATION SOURCE TO SOURCE_CONNECTION_AUTO_FAILOVER=1 FOR CHANNEL 'gr1HA';

node4> SELECT asynchronous_connection_failover_add_managed('gr1HA', 'GroupReplication', '01010101-aaaa-bbbb-cccc-dddddddddddd', '127.0.0.1', 24536, '', 80, 60);
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| asynchronous_connection_failover_add_managed('gr1HA', 'GroupReplication', '01010101-aaaa-bbbb-cccc-dddddddddddd', '127.0.0.1', 24536, '', 80, 60) |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| The UDF asynchronous_connection_failover_add_managed() executed successfully.                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

-- Start the HA channel and verify replication from GR1 -> GR2
node4> START REPLICA FOR CHANNEL 'gr1HA';
node1> INSERT INTO test.table VALUES (1);
node5> SELECT * FROM test.table;

-- Shutdown mysql on node4, simulating an outage
node4# systemctl stop mysql

-- This causes a leader election on GR2 and re-establishes the gr1HA async channel on the new PRIMARY of GR2.

-- Some moments later, node4 restarts. node4 is unable to connect to the GR:

2024-01-31T18:57:17.920452Z 7 [System] [MY-014002] [Repl] Replica receiver thread for channel 'gr1ha': connected to source 'grharepl@127.0.0.1:24537' with server_uuid=00024537-2222-2222-2222-222222222222, server_id=200. Starting GTID-based replication.

2024-01-31T18:58:19.933650Z 43 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2024-01-31T18:58:19.934596Z 43 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2024-01-31T18:58:19.938972Z 43 [ERROR] [MY-011638] [Repl] Plugin group_replication reported: 'Can't start group replication on secondary member with single-primary mode while asynchronous replication channels are running.'
2024-01-31T18:58:19.939036Z 43 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2024-01-31T18:58:19.940409Z 43 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'

-- The "Asynchronous Connection Failover for Replicas" has now prevented the node from rejoining the cluster of which it was a previous member.

-- The async channel must now be manually stopped on node4, and GR started manually.

Suggested fix:
Two potential solutions.

 1) Easy quick fix, set skip_replica_start=1 on all GR members. This prevents node4 from auto-starting the gr1HA channel, which then allows GR to reconnect successfully. This must be documented and should be enforced by MySQL when setting up the async failover feature.

  2) During the startup of GR plugin, make the plugin stop all async channels and print warnings in error log indicating this. This also allows GR to automatically rejoin.

Hi Matthew,

I am not 100% sure this is a bug and I do not for sure agree it is S2 (dropped to S3). I am verifying the behavior to pass this down to GR team to see what they have to say. I can reproduce the problem hence verifying it.

Fixed in 8.0 and later versions of the Manual, in mysqldoc rev 77836.

Closed.