Description:
Our production environment has a MGR cluster. After using kill - 9 to close the primary node, the master node automatically switched to the secondary node. At this time, we found that the data of the new primary node is several days behind the original primary node. The production data was lost.
On secondary node 1(new primary node), Multi-threaded slave's log was missing after 2023-02-05T07:09:20:
2023-02-05T07:04:56.989601+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 130; events assigned = 13022989313; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:06:56.064381+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 120; events assigned = 13022995457; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:09:05.478416+08:00 147114699 [Note] [MY-010914] [Server] Aborted connection 147114699 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:09:20.008315+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 144; events assigned = 13023003649; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:19:05.519661+08:00 147118336 [Note] [MY-010914] [Server] Aborted connection 147118336 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:29:05.399290+08:00 147121974 [Note] [MY-010914] [Server] Aborted connection 147121974 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:39:05.494746+08:00 147125606 [Note] [MY-010914] [Server] Aborted connection 147125606 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:49:05.494036+08:00 147129245 [Note] [MY-010914] [Server] Aborted connection 147129245 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
On secondary node 2 , Multi-threaded slave's log was normal after 2023-02-05T07:09:20:
2023-02-05T07:09:22.756199+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 139; events assigned = 11292869633; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972107467000 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:11:31.496568+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 129; events assigned = 11292878849; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972162627700 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:13:35.272902+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 124; events assigned = 11292886017; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972187508300 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:15:41.971013+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 126; events assigned = 11292895233; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972216145500 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:17:50.761959+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 129; events assigned = 11292901377; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972216145500 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
Once the secondary node 1 has no Multi-threaded slave's log output, it should be in an abnormal state and should not be selected as the primary node.
I think we encountered a bug that we haven't seen before.
How to repeat:
.