Description:
We invoke 'dba.get_replica_set().set_primary_instance()' to switch primary instance and get this error log:
---
*.*.*.199:13050 will be promoted to PRIMARY of 'set_2041992056301'.
The current PRIMARY is *.*.*.204:1****.
* Connecting to replicaset instances
** Connecting to *.*.*.204:1****
** Connecting to *.*.*.199:1****
** Connecting to *.*.*.205:1****
** Connecting to *.*.*.204:1****
** Connecting to *.*.*.199:1****
** Connecting to *.*.*.205:1****
* Performing validation checks
** Checking async replication topology...
** Checking transaction state of the instance...
* Synchronizing transaction backlog at *.*.*.199:6301
* Updating metadata
* Acquiring locks in replicaset instances
** Pre-synchronizing SECONDARIES
** Acquiring global lock at PRIMARY
** Acquiring global lock at SECONDARIES
[31mERROR: [0m*.*.*.199:6301: GTID sync failed: MYSQLSH 51157: Timeout waiting for replica to synchronize
[31mERROR: [0mAn error occurred while preparing replicaset instances for a PRIMARY switch: Timeout waiting for replica to synchronize
[36mNOTE: [0mReverting metadata changes
[31mERROR: [0m*.*.*.204:6301: Error updating cluster metadata: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
[33mWARNING: [0mFailed to revert metadata changes on the PRIMARY: Metadata cannot be updated: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
Traceback (most recent call last):
File "<string>", line 1, in <module>
SystemError: MYSQLSH (51157): ReplicaSet.set_primary_instance: Timeout waiting for replica to synchronize
err:exit status 1
---
We then invoke 'dba.get_replica_set().status({"extended":2})' and get information below:
---
{
"metadataVersion": "2.0.0",
"replicaSet": {
"name": "set_2041992056301",
"primary": "*.*.*.199:****",
"status": "UNAVAILABLE",
"statusText": "PRIMARY instance is not available, but there is at least one SECONDARY that could be force-promoted.",
"topology": {
"*.*.*.199:****": {
"address": "*.*.*.199:6301",
"fenceSysVars": [],
"fenced": false,
"instanceErrors": ["ERROR: Instance is a PRIMARY but is replicating from another instance."],
"instanceRole": "PRIMARY",
"mode": "R/W",
"replication": {
"applierQueuedTransactionSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954",
"applierQueuedTransactionSetSize": 864,
"applierState": "ON",
"applierStatus": "APPLYING",
"applierThreadState": "",
"applierWorkerThreads": 16,
"coordinatorState": "ON",
"options": {
"connectRetry": 60,
"delay": 0,
"heartbeatPeriod": 30,
"retryCount": 86400
},
"receiverStatus": "ON",
"receiverThreadState": "",
"receiverTimeSinceLastMessage": "00:00:00.201497",
"replicationLag": "00:00:53.105207",
"source": "*.*.*.204:****"
},
"serverUuid": "8dcef8b2-72ee-11ed-baaa-fa163e4f622d",
"status": "ERROR"
},
"*.*.*.204:****": {
"address": "*.*.*.204:6301",
"fenceSysVars": ["read_only", "super_read_only"],
"fenced": true,
"instanceErrors": ["ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired.", "ERROR: Replication source channel is not configured, should be *.*.*.199:****."],
"instanceRole": "SECONDARY",
"mode": "R/O",
"replication": {
"applierStatus": null,
"expectedSource": "*.*.*.199:****",
"receiverStatus": null
},
"serverUuid": "832a2224-72ee-11ed-8368-fa163ee19b6f",
"status": "ERROR",
"transactionSetConsistencyStatus": "INCONSISTENT",
"transactionSetConsistencyStatusText": "There are 864 transactions that were executed in this instance that did not originate from the PRIMARY.",
"transactionSetErrantGtidSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954"
},
"*.*.*.205:****": {
"address": "*.*.*.205:6301",
"fenceSysVars": ["read_only", "super_read_only"],
"fenced": true,
"instanceErrors": ["ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired.", "ERROR: Replication source misconfigured. Expected *.*.*.199:**** but is *.*.*.204:6301."],
"instanceRole": "SECONDARY",
"mode": "R/O",
"replication": {
"applierQueuedTransactionSet": "",
"applierQueuedTransactionSetSize": 0,
"applierState": "ON",
"applierStatus": "APPLIED_ALL",
"applierThreadState": "",
"applierWorkerThreads": 16,
"coordinatorState": "ON",
"expectedSource": "*.*.*.199:****",
"options": {
"connectRetry": 60,
"delay": 0,
"heartbeatPeriod": 30,
"retryCount": 86400
},
"receiverStatus": "ON",
"receiverThreadState": "",
"receiverTimeSinceLastMessage": "00:00:00.211088",
"replicationLag": null,
"source": "*.*.*.204:****"
},
"serverUuid": "82496ed7-72ee-11ed-813c-fa163e7ab701",
"status": "ERROR",
"transactionSetConsistencyStatus": "INCONSISTENT",
"transactionSetConsistencyStatusText": "There are 864 transactions that were executed in this instance that did not originate from the PRIMARY.",
"transactionSetErrantGtidSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954"
}
},
"type": "ASYNC"
}
}
---
It looks the metadata incorrectly shows *.*.*.199 is the primary while *.*.*.204 is the actual one.
The messageļ¼
---
[36mNOTE: [0mReverting metadata changes
[31mERROR: [0m*.*.*.204:6301: Error updating cluster metadata: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
---
indicates that the switch procedure was failed to revert metadata changes. Moreover it failed after "** Acquiring global lock at SECONDARIES" was printed.
We then view the code and find out the reason is that it hasn't reverted super_read_only to 0 before reverting metadata.
---
void Global_locks::sync_and_lock_all(
const std::list<std::shared_ptr<Instance>> &instances,
const std::string &master_uuid, uint32_t gtid_sync_timeout, bool dry_run) {
...
// Set SRO so that any attempts to do new writes at the PRIMARY error out
// instead of blocking.
if (!dry_run) m_master->execute("SET global super_read_only=1");
if (!dry_run) m_master->execute("FLUSH BINARY LOGS");
// 3 - synchronize and lock all slaves (in parallel)
// Sync slaves to master once again and acquire lock
console->print_info("** Acquiring global lock at SECONDARIES");
// BUG#30344595: switching primary fails under client load
// After locking the master and setting it to read-only, we must get the
// updated (final) GTID_EXECUTED set, since new transactions might have been
// committed during the pre-sync step, otherwise some trx might fail to be
// applied on the secondaries (lost).
master_gtid_set = mysqlshdk::mysql::get_executed_gtid_set(*m_master);
for (const auto &inst : instances) {
if (m_master->get_uuid() != inst->get_uuid()) {
try {
wait_pending_master_transactions(master_gtid_set, inst.get(),
gtid_sync_timeout);
} catch (const shcore::Exception &e) {
console->print_error(shcore::str_format("%s: GTID sync failed: %s",
inst->descr().c_str(),
e.format().c_str()));
throw;
}
...
}
}
...
}
---
How to repeat:
It would happens when payload is high.
Suggested fix:
Revert super_read_only to 0 before reverting metadata.
Description: We invoke 'dba.get_replica_set().set_primary_instance()' to switch primary instance and get this error log: --- *.*.*.199:13050 will be promoted to PRIMARY of 'set_2041992056301'. The current PRIMARY is *.*.*.204:1****. * Connecting to replicaset instances ** Connecting to *.*.*.204:1**** ** Connecting to *.*.*.199:1**** ** Connecting to *.*.*.205:1**** ** Connecting to *.*.*.204:1**** ** Connecting to *.*.*.199:1**** ** Connecting to *.*.*.205:1**** * Performing validation checks ** Checking async replication topology... ** Checking transaction state of the instance... * Synchronizing transaction backlog at *.*.*.199:6301 * Updating metadata * Acquiring locks in replicaset instances ** Pre-synchronizing SECONDARIES ** Acquiring global lock at PRIMARY ** Acquiring global lock at SECONDARIES [31mERROR: [0m*.*.*.199:6301: GTID sync failed: MYSQLSH 51157: Timeout waiting for replica to synchronize [31mERROR: [0mAn error occurred while preparing replicaset instances for a PRIMARY switch: Timeout waiting for replica to synchronize [36mNOTE: [0mReverting metadata changes [31mERROR: [0m*.*.*.204:6301: Error updating cluster metadata: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement [33mWARNING: [0mFailed to revert metadata changes on the PRIMARY: Metadata cannot be updated: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement Traceback (most recent call last): File "<string>", line 1, in <module> SystemError: MYSQLSH (51157): ReplicaSet.set_primary_instance: Timeout waiting for replica to synchronize err:exit status 1 --- We then invoke 'dba.get_replica_set().status({"extended":2})' and get information below: --- { "metadataVersion": "2.0.0", "replicaSet": { "name": "set_2041992056301", "primary": "*.*.*.199:****", "status": "UNAVAILABLE", "statusText": "PRIMARY instance is not available, but there is at least one SECONDARY that could be force-promoted.", "topology": { "*.*.*.199:****": { "address": "*.*.*.199:6301", "fenceSysVars": [], "fenced": false, "instanceErrors": ["ERROR: Instance is a PRIMARY but is replicating from another instance."], "instanceRole": "PRIMARY", "mode": "R/W", "replication": { "applierQueuedTransactionSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954", "applierQueuedTransactionSetSize": 864, "applierState": "ON", "applierStatus": "APPLYING", "applierThreadState": "", "applierWorkerThreads": 16, "coordinatorState": "ON", "options": { "connectRetry": 60, "delay": 0, "heartbeatPeriod": 30, "retryCount": 86400 }, "receiverStatus": "ON", "receiverThreadState": "", "receiverTimeSinceLastMessage": "00:00:00.201497", "replicationLag": "00:00:53.105207", "source": "*.*.*.204:****" }, "serverUuid": "8dcef8b2-72ee-11ed-baaa-fa163e4f622d", "status": "ERROR" }, "*.*.*.204:****": { "address": "*.*.*.204:6301", "fenceSysVars": ["read_only", "super_read_only"], "fenced": true, "instanceErrors": ["ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired.", "ERROR: Replication source channel is not configured, should be *.*.*.199:****."], "instanceRole": "SECONDARY", "mode": "R/O", "replication": { "applierStatus": null, "expectedSource": "*.*.*.199:****", "receiverStatus": null }, "serverUuid": "832a2224-72ee-11ed-8368-fa163ee19b6f", "status": "ERROR", "transactionSetConsistencyStatus": "INCONSISTENT", "transactionSetConsistencyStatusText": "There are 864 transactions that were executed in this instance that did not originate from the PRIMARY.", "transactionSetErrantGtidSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954" }, "*.*.*.205:****": { "address": "*.*.*.205:6301", "fenceSysVars": ["read_only", "super_read_only"], "fenced": true, "instanceErrors": ["ERROR: 1 errant transaction(s) detected. Topology changes will not be possible until the instance is removed from the replicaset to have the inconsistency repaired.", "ERROR: Replication source misconfigured. Expected *.*.*.199:**** but is *.*.*.204:6301."], "instanceRole": "SECONDARY", "mode": "R/O", "replication": { "applierQueuedTransactionSet": "", "applierQueuedTransactionSetSize": 0, "applierState": "ON", "applierStatus": "APPLIED_ALL", "applierThreadState": "", "applierWorkerThreads": 16, "coordinatorState": "ON", "expectedSource": "*.*.*.199:****", "options": { "connectRetry": 60, "delay": 0, "heartbeatPeriod": 30, "retryCount": 86400 }, "receiverStatus": "ON", "receiverThreadState": "", "receiverTimeSinceLastMessage": "00:00:00.211088", "replicationLag": null, "source": "*.*.*.204:****" }, "serverUuid": "82496ed7-72ee-11ed-813c-fa163e7ab701", "status": "ERROR", "transactionSetConsistencyStatus": "INCONSISTENT", "transactionSetConsistencyStatusText": "There are 864 transactions that were executed in this instance that did not originate from the PRIMARY.", "transactionSetErrantGtidSet": "832a2224-72ee-11ed-8368-fa163ee19b6f:143091-143954" } }, "type": "ASYNC" } } --- It looks the metadata incorrectly shows *.*.*.199 is the primary while *.*.*.204 is the actual one. The messageļ¼ --- [36mNOTE: [0mReverting metadata changes [31mERROR: [0m*.*.*.204:6301: Error updating cluster metadata: MySQL Error 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement --- indicates that the switch procedure was failed to revert metadata changes. Moreover it failed after "** Acquiring global lock at SECONDARIES" was printed. We then view the code and find out the reason is that it hasn't reverted super_read_only to 0 before reverting metadata. --- void Global_locks::sync_and_lock_all( const std::list<std::shared_ptr<Instance>> &instances, const std::string &master_uuid, uint32_t gtid_sync_timeout, bool dry_run) { ... // Set SRO so that any attempts to do new writes at the PRIMARY error out // instead of blocking. if (!dry_run) m_master->execute("SET global super_read_only=1"); if (!dry_run) m_master->execute("FLUSH BINARY LOGS"); // 3 - synchronize and lock all slaves (in parallel) // Sync slaves to master once again and acquire lock console->print_info("** Acquiring global lock at SECONDARIES"); // BUG#30344595: switching primary fails under client load // After locking the master and setting it to read-only, we must get the // updated (final) GTID_EXECUTED set, since new transactions might have been // committed during the pre-sync step, otherwise some trx might fail to be // applied on the secondaries (lost). master_gtid_set = mysqlshdk::mysql::get_executed_gtid_set(*m_master); for (const auto &inst : instances) { if (m_master->get_uuid() != inst->get_uuid()) { try { wait_pending_master_transactions(master_gtid_set, inst.get(), gtid_sync_timeout); } catch (const shcore::Exception &e) { console->print_error(shcore::str_format("%s: GTID sync failed: %s", inst->descr().c_str(), e.format().c_str())); throw; } ... } } ... } --- How to repeat: It would happens when payload is high. Suggested fix: Revert super_read_only to 0 before reverting metadata.