Bug #110219 MGR Cluster chose a wrong priamry node
Submitted: 27 Feb 2023 6:58 Modified: 27 Mar 2023 17:20
Reporter: LeYuan Zhong (OCA) Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:8.0.26 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: mgr

[27 Feb 2023 6:58] LeYuan Zhong
Description:
Our production environment has a MGR cluster.  After using kill - 9 to close the primary node, the master node automatically switched to the  secondary node. At this time, we found that the data of the new primary node is several days behind the original primary node. The production data was lost.

On secondary node 1(new primary node), Multi-threaded slave's log was missing after 2023-02-05T07:09:20:

2023-02-05T07:04:56.989601+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 130; events assigned = 13022989313; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:06:56.064381+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 120; events assigned = 13022995457; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:09:05.478416+08:00 147114699 [Note] [MY-010914] [Server] Aborted connection 147114699 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:09:20.008315+08:00 350 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 144; events assigned = 13023003649; worker queues filled over overrun level = 73007; waited due a Worker queue full = 3995; waited due the total size = 0; waited at clock conflicts = 162975357023200 waited (count) when Workers occupied = 53123496 waited when Workers occupied = 29743815931000
2023-02-05T07:19:05.519661+08:00 147118336 [Note] [MY-010914] [Server] Aborted connection 147118336 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:29:05.399290+08:00 147121974 [Note] [MY-010914] [Server] Aborted connection 147121974 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:39:05.494746+08:00 147125606 [Note] [MY-010914] [Server] Aborted connection 147125606 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).
2023-02-05T07:49:05.494036+08:00 147129245 [Note] [MY-010914] [Server] Aborted connection 147129245 to db: 'unconnected' user: 'dbmon' host: '10.0.30.24' (Got an error reading communication packets).

On secondary node 2 , Multi-threaded slave's log was normal after 2023-02-05T07:09:20:

2023-02-05T07:09:22.756199+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 139; events assigned = 11292869633; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972107467000 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:11:31.496568+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 129; events assigned = 11292878849; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972162627700 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:13:35.272902+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 124; events assigned = 11292886017; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972187508300 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:15:41.971013+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 126; events assigned = 11292895233; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972216145500 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500
2023-02-05T07:17:50.761959+08:00 296 [Note] [MY-010559] [Repl] Multi-threaded slave statistics for channel 'group_replication_applier': seconds elapsed = 129; events assigned = 11292901377; worker queues filled over overrun level = 797276; waited due a Worker queue full = 7116; waited due the total size = 0; waited at clock conflicts = 184972216145500 waited (count) when Workers occupied = 50391879 waited when Workers occupied = 49440118449500

Once the secondary node 1 has no Multi-threaded slave's log output, it should be in an abnormal state and should not be selected as the primary node.
I think we encountered a bug that we haven't seen before.

How to repeat:
.
[27 Feb 2023 17:20] MySQL Verification Team
Hi,

I cannot reproduce this, anything more you can give us about your setup, how many machines, their config, how that was all configured?

Thanks
[28 Mar 2023 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".