Description:
In the Group Replication multi-primary architecture, a rule violation for SMR occurs: CT_CERTIFICATION_MESSAGE messages are not placed into the applier queue, leading to a non-uniform processing order. Below is the function handle_certifier_data that prematurely processes CT_CERTIFICATION_MESSAGE messages.
void Plugin_gcs_events_handler::handle_certifier_message(
const Gcs_message &message) const {
if (this->applier_module == nullptr) {
LogPluginErr(ERROR_LEVEL,
ER_GRP_RPL_MISSING_GRP_RPL_APPLIER); /* purecov: inspected */
return; /* purecov: inspected */
}
Certifier_interface *certifier =
this->applier_module->get_certification_handler()->get_certifier();
const unsigned char *payload_data = nullptr;
size_t payload_size = 0;
Plugin_gcs_message::get_first_payload_item_raw_data(
message.get_message_data().get_payload(), &payload_data, &payload_size);
if (certifier->handle_certifier_data(payload_data,
static_cast<ulong>(payload_size),
message.get_origin())) {
LogPluginErr(
ERROR_LEVEL,
ER_GRP_RPL_CERTIFIER_MSSG_PROCESS_ERROR); /* purecov: inspected */
}
}
Handling CT_CERTIFICATION_MESSAGE messages prematurely can lead to inconsistencies in the certification database data that different nodes' OCC rely on, potentially resulting in eventual data inconsistencies. While this problem may not be easy to detect, it is relatively straightforward to reproduce under specific conditions.
How to repeat:
The specific details of reproduction are as follows: in a Group Replication multi-primary scenario, distribute write pressure evenly across all MySQL nodes using a load balancer (such as LVS). Given sufficient write conflicts, it is possible to reproduce inconsistencies in the final state of state machine replication.
Suggested fix:
Based on extensive testing, placing certification messages into the applier queue for unified processing can eliminate the aforementioned data inconsistency problem.