Bug #108065 clusterset.rejoinCluster() hangs
Submitted: 3 Aug 2022 15:46 Modified: 2 Sep 2022 16:49
Reporter: Jay Janssen Email Updates:
Status: Closed Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S3 (Non-critical)
Version:8.0.30 OS:Any
Assigned to: CPU Architecture:Any

[3 Aug 2022 15:46] Jay Janssen
Description:
didn't have this issue with 8.0.29.

If I have a clusterset and kill -9 all the mysqld nodes on the primary cluster.  I then forcePrimaryCluster failover (and it works, sometimes it hits 108064).  I then restart the killed nodes and execute dba.rebootClusterFromCompleteOutage(), which works.

I then get a clusterset object from my new primary cluster and attempt to rejoin the recovered cluster, it hangs on "* Reconciling internally generated GTIDs"

I've waited 5-10 mins so far before Ctrl-Cing, maybe I'm being impatient, but I don't see anything happening in the server logs.

How to repeat:
This seems repeatable.  I did this a lot on 8.0.29 and didn't have the issue.

* 2x3 node clusters in a cluster set.
* Sysbench load on the primary cluster via router

1. kill -9 `pidof mysqld` on every node in the primary cluster (1)
2. forcePrimaryCluster failover to the other side (2)
3. Restart nodes in cluster 1, issue dba.rebootClusterFromCompleteOutage() successfully
4. Using clusterset handle from cluster 2, try to rejoin the cluster

 MySQL  10.162.0.219:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
NOTE: Instance 10.170.1.106:3306 has more recent metadata than 10.162.0.219:3306 (generation 2 vs 1), which suggests jay-test2-east has been invalidated
NOTE: Cluster jay-test2-east appears to have been invalidated, reconnecting to 10.170.1.106:3306.
Restoring the cluster 'jay-test2-east' from complete outage...

The instance '10.162.0.229:3306' was part of the cluster configuration but the Cluster is invalidated. Please rejoin the instance after the Cluster is rejoined to the ClusterSet
The instance '10.162.0.248:3306' was part of the cluster configuration but the Cluster is invalidated. Please rejoin the instance after the Cluster is rejoined to the ClusterSet
Validating instance configuration at 10.162.0.219:3306...

This instance reports its own address as 10.162.0.219:3306

Instance configuration is suitable.
* Waiting for seed instance to become ONLINE...
10.162.0.219:3306 was restored.
NOTE: Instance 10.170.1.106:3306 has more recent metadata than 10.162.0.219:3306 (generation 2 vs 1), which suggests jay-test2-east has been invalidated
NOTE: Cluster jay-test2-east appears to have been invalidated, reconnecting to 10.170.1.106:3306.
The cluster was successfully rebooted.

<Cluster:jay-test2-east>
 MySQL  10.162.0.219:33060+ ssl  JS > cs.status()
{
    "clusters": {
        "jay-test2-east": {
            "clusterErrors": [
                "WARNING: Replication channel from the Primary Cluster is missing",
                "WARNING: Cluster was invalidated and must be either removed from the ClusterSet or rejoined"
            ],
            "clusterRole": "REPLICA",
            "clusterSetReplication": {},
            "clusterSetReplicationStatus": "MISSING",
            "globalStatus": "INVALIDATED",
            "status": "INVALIDATED",
            "statusText": "Cluster was invalidated by the ClusterSet it belongs to."
        },
        "jay-test2-west": {
            "clusterRole": "PRIMARY",
            "globalStatus": "OK",
            "primary": "10.170.1.106:3306"
        }
    },
    "domainName": "jay-test2-global",
    "globalPrimaryInstance": "10.170.1.106:3306",
    "primaryCluster": "jay-test2-west",
    "status": "AVAILABLE",
    "statusText": "Primary Cluster available, there are issues with a Replica cluster."
}
 MySQL  10.162.0.219:33060+ ssl  JS > cs.rejoinCluster("jay-test2-east")
Rejoining cluster 'jay-test2-east' to the clusterset
NOTE: Cluster 'jay-test2-east' is invalidated
* Reconciling internally generated GTIDs

^^ Hangs here
[3 Aug 2022 18:35] Jay Janssen
encountered a similar issue just trying to do a clusterset switchover:

 MySQL  10.162.0.219:33060+ ssl  JS > cs.status()
{
    "clusters": {
        "jay-test2-east": {
            "clusterRole": "PRIMARY",
            "globalStatus": "OK",
            "primary": "10.162.0.219:3306"
        },
        "jay-test2-west": {
            "clusterRole": "REPLICA",
            "clusterSetReplicationStatus": "OK",
            "globalStatus": "OK"
        }
    },
    "domainName": "jay-test2",
    "globalPrimaryInstance": "10.162.0.219:3306",
    "primaryCluster": "jay-test2-east",
    "status": "HEALTHY",
    "statusText": "All Clusters available."
}
 MySQL  10.162.0.219:33060+ ssl  JS > cs.setPrimaryCluster("jay-test2-west")
Switching the primary cluster of the clusterset to 'jay-test2-west'
* Verifying clusterset status
** Checking cluster jay-test2-west
  Cluster 'jay-test2-west' is available
** Checking cluster jay-test2-east
  Cluster 'jay-test2-east' is available

* Reconciling internally generated GTIDs
[2 Sep 2022 16:49] Edward Gilmore
Posted by developer:
 
Added the following note to the MySQL Shell 8.0.31 release notes:

        ClusterSet commands which perform transaction set consistency
        checking, such as rejoinCluster and
        setPrimaryCluster, became unresponsive during
        view change log event reconciliations if the write load was
        high. This occurred because reconciliation included the entire
        transaction backlog instead of just the view change log events.