Bug #108475 addInstance exits with "Not connected" after clone completes
Submitted: 13 Sep 2022 15:12 Modified: 29 Sep 2022 16:28
Reporter: Jay Janssen Email Updates:
Status: Verified Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S2 (Serious)
Version:8.0.30 OS:Any
Assigned to: CPU Architecture:Any

[13 Sep 2022 15:12] Jay Janssen
Description:
I am seeing an issue where sometimes addInstance exits right after a large-ish clone finishes with an "ERROR: Not connected" message.  

When this happens, the instance is part of the cluster, but its metadata is not updated and I need to rescan the cluster, so addInstance is exiting before it does that step.  

Here's a log output of it happening:

Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 10.162.255.161:3306 is being cloned from 10.162.254.204:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed

NOTE: 10.162.255.161:3306 is shutting down...

* Waiting for server restart... ready
* 10.162.255.161:3306 has restarted, waiting for clone to finish...
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 278.96 GB transferred in 23 min 36 sec (197.01 MB/s)

State recovery already finished for '10.162.255.161:3306'

ERROR: Not connected

How to repeat:
I tried reproducing this with `--log-sql=on --verbose=2` in mysqlsh, but the few times I tried, I could not.  It may be that these outputs are changing the behavior somehow.  

That said, I've seen several new cluster nodes exit this way, so I'm continuing to try to reproduce.  I am continuing to try to investigate as this is an important problem to solve for my environment.
[13 Sep 2022 15:57] Jay Janssen
Checking further, this *might* be a manifestation of "dba.restartWaitTimeout" not being set properly on the shell, so maybe a false alarm for this bug.

I was able to reproduce the same basic behavior, though this time the log was slightly different:

```
Clone based state recovery is now in progress.

NOTE: A server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
NOTE: 10.162.254.54:3306 is being cloned from 10.162.255.161:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed

NOTE: 10.162.254.54:3306 is shutting down...

* Waiting for server restart... timeout
WARNING: Clone process appears to have finished and tried to restart the MySQL server, but it has not yet started back up.

Please make sure the MySQL server at '10.162.254.54:3306' is restarted and call <Cluster>.rescan() to complete the process. To increase the timeout, change shell.options["dba.restartWaitTimeout"].
ERROR: Timeout waiting for server to restart
```

Is it possible shell will give either error for the same core issue?
[14 Sep 2022 13:46] Jay Janssen
update:  I corrected my issue with dba.restartWaitTimeout not being properly set, but I'm still seeing the "ERROR: Not Connected"

* Waiting for server restart... ready
* 10.162.255.177:3306 has restarted, waiting for clone to finish...
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 297.46 GB transferred in 26 min 14 sec (188.98 MB/s)
Incremental state recovery is now in progress.

* Waiting for distributed recovery to finish...
NOTE: '10.162.255.177:3306' is being recovered from '10.162.254.180:3306'

* Distributed recovery has finished

ERROR: Not connected
[14 Sep 2022 13:57] Jay Janssen
this cluster is under a decent amount of load when this happens.  I am not seeing this issue in a smaller cluster with proportionally less load and smaller clone time.
[14 Sep 2022 14:01] Jay Janssen
upgrading severity for me as this is disrupting my instance automation.  

I can't recover from this on the instance itself easily, because I have to rescan the cluster from another instance to get the instance metadata updated.
[21 Sep 2022 14:22] Jay Janssen
Just had an instance of what I think is the same bug, with a slightly different error message:

* Clone process has finished: 61.45 GB transferred in 11 min 22 sec (90.11 MB/s)

ESC[1mIncremental state recovery is now in progress.ESC[0m

* Waiting for distributed recovery to finish...
ESC[36mNOTE: ESC[0m'10.170.254.76:3306' is being recovered from '10.170.255.37:3306'
* Distributed recovery has finished

ESC[31mERROR: ESC[0mLost connection to MySQL server during query

doing a cluster rescan and adding the instance recovered the problem.
[23 Sep 2022 14:45] Jay Janssen
Another complaint here -- despite add-instance throwing an error message, the command line integration (i.e., `mysqlsh cluster add-instance ...`) returns 0 and not an error code (1 or higher).  

If any command hits any kind of error/exception, the command line integration (and the equivalent python and js functions) should return proper error codes and/or throw exceptions.
[29 Sep 2022 16:28] MySQL Verification Team
Hi Jay,

Thanks for the report. I'm not reproducing this issue as it looks like it requires some specific load to happen but I'll see what the shell team can do with it.