MySQL Bugs: #108064: forcePrimaryCluster failing, no override

Bug #108064	forcePrimaryCluster failing, no override
Submitted:	3 Aug 2022 15:06	Modified:	22 Aug 2022 18:23
Reporter:	Jay Janssen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S1 (Critical)
Version:	8.0.30	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
I have a clusterset where I've simulated a failure of the primary cluster by killing the mysqld processes.

 MySQL  10.170.1.106:33060+ ssl  JS > cs.status()
{
    "clusters": {
        "jay-test2-east": {
            "clusterErrors": [
                "ERROR: Could not connect to any ONLINE members but there are unreachable instances that could still be ONLINE."
            ],
            "clusterRole": "PRIMARY",
            "clusterSetReplicationStatus": "UNKNOWN",
            "globalStatus": "UNKNOWN",
            "primary": null,
            "status": "UNREACHABLE",
            "statusText": "Could not connect to any ONLINE members"
        },
        "jay-test2-west": {
            "clusterErrors": [
                "WARNING: Replication from the Primary Cluster not in expected state"
            ],
            "clusterRole": "REPLICA",
            "clusterSetReplicationStatus": "ERROR",
            "globalStatus": "NOT_OK",
            "status": "OK",
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure."
        }
    },
    "domainName": "jay-test2-global",
    "globalPrimaryInstance": null,
    "primaryCluster": "jay-test2-east",
    "status": "UNAVAILABLE",
    "statusText": "Primary Cluster is not reachable from the Shell, assuming it to be unavailable."
}

Now when I try to forcePrimaryCluster to the remaining cluster, I get this error:

 MySQL  10.170.1.106:33060+ ssl  JS > cs.forcePrimaryCluster("jay-test2-west")
Failing-over primary cluster of the clusterset to 'jay-test2-west'
* Verifying primary cluster status
None of the instances of the PRIMARY cluster 'jay-test2-east' could be reached.
* Verifying clusterset status
** Checking cluster jay-test2-west
  Cluster 'jay-test2-west' is available
** Checking whether target cluster has the most recent GTID set
NOTE: Cluster jay-test2-west has a more up-to-date GTID set
The following GTIDs are missing from the target cluster:
ERROR: The selected target cluster is not the most up-to-date cluster available for failover.
ClusterSet.forcePrimaryCluster: Target cluster is behind other candidates (MYSQLSH 51311)

In essence: you can't force jay-test2-west to primary because jay-test2-west has more transactions.  There is no force option with forcePrimaryCluster, so now I am stuck.

How to repeat:
I haven't confirmed it's repeatable.

on hindsight, this is S1. I'm attempting to reproduce

Hi Jay,

Can you please reproduce the issue with the logging set to debug level and share the relevant log entries?

Either start shell with $ ./bin/mysqlsh --log-level=8 --dba-log-sql=2

or, do the following when shell is already running:

shell.options["dba.logSql"]=2
shell.options["logLevel"]=8

Thanks.

I was able to reproduce by ensuring the applier is still applying a backlog of transactions at the time of the failover. 

A workaround is waiting for the applier queue to empty before failover.

Posted by developer:
 
Added the following note to the MySQL Shell 8.0.31 release notes:

Cluster failover could fail under high load because the cluster being promoted was compared with itself in checks to confirm the promoted cluster was the most up-to-date. This comparison failed because the applier was catching up and the GTID_EXECUTED comparison resulted in two different values.

As of this release, the check for most up-to-date cluster does not include the promoted cluster.

Thanks to Jay Janssen for reporting this issue.