MySQL Bugs: #108573: using 'manualStartOnBoot': true in create_replica

Bug #108573	using 'manualStartOnBoot': true in create_replica_cluster causes issues
Submitted:	21 Sep 2022 17:08	Modified:	2 Mar 2023 15:47
Reporter:	Jay Janssen	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S3 (Non-critical)
Version:	8.0.30	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I have a script that takes 3 nodes available for a replica cluster and
1) Runs create_replica_cluster on the first
2) runs add_instance to join the remaining 2 to the first.

I have started experimenting with manualStartOnBoot, which has to be set when clusters are created (either dba.create_cluster or cs.create_replica_cluster. I am creating my replica cluster like this:

create_opts={
"recoveryMethod": "clone",
"interactive": False,
"timeout": 172800, # wait for the new instance to catch up
"manualStartOnBoot": True, # Workaround for https://bugs.mysql.com/bug.php?id=108339
}
seed_clusterset.create_replica_cluster(args.standalone , args.name, create_opts)

This works fine:

Adding a new replica cluster jaytest-development-001-usw2 to the InnoDB ClusterSet
The standalone instance 10.168.138.178 will be used to initialize the replica cluster
Setting up replica 'jaytest-development-001-usw2' of cluster 'jaytest-development-001-use1' at instance '10.168.138.178:3306'.

A new InnoDB Cluster will be created on instance '10.168.138.178:3306'.

Validating instance configuration at 10.168.138.178:3306...

This instance reports its own address as 10.168.138.178:3306

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using '10.168.138.178:3306'. Use the localAddress option to override.

* Checking transaction state of the instance...

NOTE: The target instance '10.168.138.178:3306' has not been pre-provisioned (GTID set is empty). The Shell is unable to decide whether replication can completely recover its state.

Clone based recovery selected through the recoveryMethod option

Waiting for clone process of the new member to complete. Press ^C to abort the operation.
* Waiting for clone to finish...
NOTE: 10.168.138.178:3306 is being cloned from 10.160.132.135:3306
** Stage DROP DATA: Completed
** Clone Transfer
FILE COPY ############################################################ 100% Completed
PAGE COPY ############################################################ 100% Completed
REDO COPY ############################################################ 100% In Progress

NOTE: 10.168.138.178:3306 is shutting down...

* Waiting for server restart... ready
* 10.168.138.178:3306 has restarted, waiting for clone to finish...
** Stage RESTART: Completed
* Clone process has finished: 75.76 MB transferred in 2 sec (37.88 MB/s)

Creating InnoDB Cluster 'jaytest-development-001-usw2' on '10.168.138.178:3306'...

Adding Seed Instance...
Cluster successfully created. Use Cluster.add_instance() to add MySQL instances.
At least 3 instances are needed for the cluster to be able to withstand up to
one server failure.

* Configuring ClusterSet managed replication channel...
** Changing replication source of 10.168.138.178:3306 to 10.160.132.135:3306

* Waiting for instance '10.168.138.178:3306' to synchronize with PRIMARY Cluster...
** Transactions replicated ############################################################ 100%

* Updating topology

Replica Cluster 'jaytest-development-001-usw2' successfully created on ClusterSet 'jaytest-development-001'.

but when I add my remaining instances they get stuck in 'group_replication is stopped' even though add_instance reports no error.

Adding instance 10.168.139.49 to cluster jaytest-development-001-usw2...
WARNING: Using a password on the command line interface can be insecure.
NOTE: A GTID set check of the MySQL instance at '10.168.139.49:3306' determined that it is missing transactions that were purged from all cluster members.
NOTE: The target instance '10.168.139.49:3306' has not been pre-provisioned (GTID set is empty). The Shell is unable to determine whether the instance has pre-existing data that would be overwritten with clone based recovery.

Clone based recovery selected through the recoveryMethod option

Validating instance configuration at 10.168.139.49:3306...

This instance reports its own address as 10.168.139.49:3306

Instance configuration is suitable.
NOTE: Group Replication will communicate with other members using '10.168.139.49:3306'. Use the localAddress option to override.

A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.

Adding instance to the cluster...

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...

* Configuring ClusterSet managed replication channel...
** Changing replication source of 10.168.139.49:3306 to 10.160.132.135:3306

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 10.168.139.49:3306 is being cloned from 10.168.138.178:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 75.75 MB transferred in about 1 second (~75.75 MB/s)

The instance '10.168.139.49:3306' was successfully added to the cluster.

But, I get this in the cluster status:

{
"clusterName": "jaytest-development-001-usw2",
"clusterRole": "REPLICA",
"clusterSetReplicationStatus": "OK",
"defaultReplicaSet": {
"name": "default",
"primary": "10.168.138.178:3306",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.",
"topology": {
"10.168.138.178:3306": {
"address": "10.168.138.178:3306",
"memberRole": "PRIMARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.30"
},
"10.168.139.218:3306": {
"address": "10.168.139.218:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.30"
},
"10.168.139.49:3306": {
"address": "10.168.139.49:3306",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.30"
}
},
"topologyMode": "Single-Primary"
},
"domainName": "jaytest-development-001",
"groupInformationSourceMember": "10.168.138.178:3306",
"metadataServer": "10.160.132.135:3306"

This is bizarre, because when I do the same thing with a regular cluster (using create_cluster), I don't see this issue and the nodes join normally.

Further, I can't get group replication started on nodes in this state:

MySQL localhost:33060+ ssl SQL > start group_replication;
ERROR: 3092: The server is not configured properly to be an active member of the group. Please see more details on error log.

I'll attach the error log from that instance.

How to repeat:
Use 'manualStartOnBoot':True with create_replica_cluster. If I remove it, I have no issues.

log from the instance with issues

Attachment: bad.log (application/octet-stream, text), 8.67 KiB.

Hi Jay,

Thank you for the report

Hi Jay,

I'm unable to reproduce the bug. I tried using the option in the Primary Cluster and Replica Clusters and combining it with other options such as 'recoveryMethod', but the outcome is always the same: the operations are successful.

Can you please share the result of `show global variables` on all instances?