MySQL Bugs: #108475: addInstance exits with "Not connected" after clone completes

Bug #108475	addInstance exits with "Not connected" after clone completes
Submitted:	13 Sep 2022 15:12	Modified:	23 Feb 2023 19:03
Reporter:	Jay Janssen	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S2 (Serious)
Version:	8.0.30	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I am seeing an issue where sometimes addInstance exits right after a large-ish clone finishes with an "ERROR: Not connected" message.

When this happens, the instance is part of the cluster, but its metadata is not updated and I need to rescan the cluster, so addInstance is exiting before it does that step.

Here's a log output of it happening:

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 10.162.255.161:3306 is being cloned from 10.162.254.204:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed

NOTE: 10.162.255.161:3306 is shutting down...

* Waiting for server restart... ready
* 10.162.255.161:3306 has restarted, waiting for clone to finish...
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 278.96 GB transferred in 23 min 36 sec (197.01 MB/s)

State recovery already finished for '10.162.255.161:3306'

ERROR: Not connected

How to repeat:
I tried reproducing this with `--log-sql=on --verbose=2` in mysqlsh, but the few times I tried, I could not. It may be that these outputs are changing the behavior somehow.

That said, I've seen several new cluster nodes exit this way, so I'm continuing to try to reproduce. I am continuing to try to investigate as this is an important problem to solve for my environment.

Checking further, this *might* be a manifestation of "dba.restartWaitTimeout" not being set properly on the shell, so maybe a false alarm for this bug.

I was able to reproduce the same basic behavior, though this time the log was slightly different:

```
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 10.162.254.54:3306 is being cloned from 10.162.255.161:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed

NOTE: 10.162.254.54:3306 is shutting down...

* Waiting for server restart... timeout
WARNING: Clone process appears to have finished and tried to restart the MySQL server, but it has not yet started back up.

Please make sure the MySQL server at '10.162.254.54:3306' is restarted and call <Cluster>.rescan() to complete the process. To increase the timeout, change shell.options["dba.restartWaitTimeout"].
ERROR: Timeout waiting for server to restart
```

Is it possible shell will give either error for the same core issue?

update:  I corrected my issue with dba.restartWaitTimeout not being properly set, but I'm still seeing the "ERROR: Not Connected"

* Waiting for server restart... ready
* 10.162.255.177:3306 has restarted, waiting for clone to finish...
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 297.46 GB transferred in 26 min 14 sec (188.98 MB/s)
Incremental state recovery is now in progress.

* Waiting for distributed recovery to finish...
NOTE: '10.162.255.177:3306' is being recovered from '10.162.254.180:3306'

* Distributed recovery has finished

ERROR: Not connected

this cluster is under a decent amount of load when this happens.  I am not seeing this issue in a smaller cluster with proportionally less load and smaller clone time.

upgrading severity for me as this is disrupting my instance automation.  

I can't recover from this on the instance itself easily, because I have to rescan the cluster from another instance to get the instance metadata updated.

Just had an instance of what I think is the same bug, with a slightly different error message:

* Clone process has finished: 61.45 GB transferred in 11 min 22 sec (90.11 MB/s)

ESC[1mIncremental state recovery is now in progress.ESC[0m

* Waiting for distributed recovery to finish...
ESC[36mNOTE: ESC[0m'10.170.254.76:3306' is being recovered from '10.170.255.37:3306'
* Distributed recovery has finished

ESC[31mERROR: ESC[0mLost connection to MySQL server during query

doing a cluster rescan and adding the instance recovered the problem.

Another complaint here -- despite add-instance throwing an error message, the command line integration (i.e., `mysqlsh cluster add-instance ...`) returns 0 and not an error code (1 or higher).  

If any command hits any kind of error/exception, the command line integration (and the equivalent python and js functions) should return proper error codes and/or throw exceptions.

Hi Jay,

Thanks for the report. I'm not reproducing this issue as it looks like it requires some specific load to happen but I'll see what the shell team can do with it.

Posted by developer:
 
Unable to reproduce with MySQL Shell 8.0.32.

** Stage RESTART: Completed
** Stage RECOVERY: \ 
* Clone process has finished: 79.96 MB transferred in about 1 second (~79.96 MB/s)

State recovery already finished for 'stagedbcluster-2.stagedbcluster-headless-service.mysql-innodb-operator-dev.svc.cluster.local:3306'

Not connected (RuntimeError)
 at STDIN:1:180
in  shell.options["dba.connectivityChecks"] = true;shell.options["dba.connectTimeout"] = 1800;shell.options["dba.restartWaitTimeout"] = 1800; dba.getCluster("bngstagecopsdbcluster").addInstance("stagedbcluster-2.stagedbcluster-headless-service.mysql-innodb-operator-dev.svc.cluster.local:3306",{ recoveryMethod: "clone" })

it is coming randomly and it is impacting cluster operations .. it is happening with 9.1.0 and 9.0.0 as well

After this error I have executed cluster.rescan() to fix this error . but it is fixing error in Meta data but not updating to group_replication seeds information. So Again same node will be go into offline after next restart of mysql servers 

is it possible to fix this issue ASAP.

It is happening continuously with us