Bug #108064 forcePrimaryCluster failing, no override
Submitted: 3 Aug 2022 15:06 Modified: 22 Aug 2022 18:23
Reporter: Jay Janssen Email Updates:
Status: Closed Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S1 (Critical)
Version:8.0.30 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[3 Aug 2022 15:06] Jay Janssen
Description:
I have a clusterset where I've simulated a failure of the primary cluster by killing the mysqld processes.

 MySQL  10.170.1.106:33060+ ssl  JS > cs.status()
{
    "clusters": {
        "jay-test2-east": {
            "clusterErrors": [
                "ERROR: Could not connect to any ONLINE members but there are unreachable instances that could still be ONLINE."
            ],
            "clusterRole": "PRIMARY",
            "clusterSetReplicationStatus": "UNKNOWN",
            "globalStatus": "UNKNOWN",
            "primary": null,
            "status": "UNREACHABLE",
            "statusText": "Could not connect to any ONLINE members"
        },
        "jay-test2-west": {
            "clusterErrors": [
                "WARNING: Replication from the Primary Cluster not in expected state"
            ],
            "clusterRole": "REPLICA",
            "clusterSetReplicationStatus": "ERROR",
            "globalStatus": "NOT_OK",
            "status": "OK",
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure."
        }
    },
    "domainName": "jay-test2-global",
    "globalPrimaryInstance": null,
    "primaryCluster": "jay-test2-east",
    "status": "UNAVAILABLE",
    "statusText": "Primary Cluster is not reachable from the Shell, assuming it to be unavailable."
}

Now when I try to forcePrimaryCluster to the remaining cluster, I get this error:

 MySQL  10.170.1.106:33060+ ssl  JS > cs.forcePrimaryCluster("jay-test2-west")
Failing-over primary cluster of the clusterset to 'jay-test2-west'
* Verifying primary cluster status
None of the instances of the PRIMARY cluster 'jay-test2-east' could be reached.
* Verifying clusterset status
** Checking cluster jay-test2-west
  Cluster 'jay-test2-west' is available
** Checking whether target cluster has the most recent GTID set
NOTE: Cluster jay-test2-west has a more up-to-date GTID set
The following GTIDs are missing from the target cluster:
ERROR: The selected target cluster is not the most up-to-date cluster available for failover.
ClusterSet.forcePrimaryCluster: Target cluster is behind other candidates (MYSQLSH 51311)

In essence: you can't force jay-test2-west to primary because jay-test2-west has more transactions.  There is no force option with forcePrimaryCluster, so now I am stuck.

How to repeat:
I haven't confirmed it's repeatable.
[3 Aug 2022 15:22] Jay Janssen
on hindsight, this is S1. I'm attempting to reproduce
[3 Aug 2022 16:05] Miguel Araujo
Hi Jay,

Can you please reproduce the issue with the logging set to debug level and share the relevant log entries?

Either start shell with $ ./bin/mysqlsh --log-level=8 --dba-log-sql=2

or, do the following when shell is already running:

shell.options["dba.logSql"]=2
shell.options["logLevel"]=8

Thanks.
[4 Aug 2022 17:44] Alfredo Kojima
I was able to reproduce by ensuring the applier is still applying a backlog of transactions at the time of the failover. 

A workaround is waiting for the applier queue to empty before failover.
[22 Aug 2022 18:23] Edward Gilmore
Posted by developer:
 
Added the following note to the MySQL Shell 8.0.31 release notes:

Cluster failover could fail under high load because the cluster being promoted was compared with itself in checks to confirm the promoted cluster was the most up-to-date. This comparison failed because the applier was catching up and the GTID_EXECUTED comparison resulted in two different values.

As of this release, the check for most up-to-date cluster does not include the promoted cluster.

Thanks to Jay Janssen for reporting this issue.