Bug #105576 | exitStateAction not called after autoRejoinTries has finished | ||
---|---|---|---|
Submitted: | 15 Nov 2021 13:06 | Modified: | 23 Nov 2021 16:02 |
Reporter: | IGG t | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | Shell AdminAPI InnoDB Cluster / ReplicaSet | Severity: | S2 (Serious) |
Version: | 8.0.27 | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[15 Nov 2021 13:06]
IGG t
[17 Nov 2021 11:08]
MySQL Verification Team
Hi, I am not able to reproduce this. You might have issues with "ifdown" as it can introduce issues in detecting failure as that's not failure, you asked for the interface to go down. If you want to test network outtage don't ifdown but disconnect network from the VM on the host system, so that VM see that as cable disconnected. Testing this with 8.0.27 the node without network properly goes to offline mode and drops all connections from users without CONNECTION_ADMIN priv. As for the behavior, you can connect to offline mysql depending on your user privilege (CONNECTION_ADMIN or SUPER). kind regards
[19 Nov 2021 13:36]
IGG t
The `ifdown` was simply a way to simulate the loss of connection to the cluster. I assumed that if n3 could no longer see n1,n2 for whatever reason (maybe the interface on the server has failed or been shutdown by mistake, maybe a cable has been unplugged etc) then it would activate the exitStateAction. But this doesn't seem to be the case. It just seems to sit there in read_only mode. If I connect to n3 on the hypervisor, what I am seeing is this: Normal operation: mysql> SELECT now(), member_host, member_state, @@global.super_read_only, @@global.offline_mode FROM performance_schema.replication_group_members order by member_host; +---------------------+----------------+--------------+--------------------------+-----------------------+ | now() | member_host | member_state | @@global.super_read_only | @@global.offline_mode | +---------------------+----------------+--------------+--------------------------+-----------------------+ | 2021-11-19 12:19:15 | n1 | ONLINE | 1 | 0 | | 2021-11-19 12:19:15 | n2 | ONLINE | 1 | 0 | | 2021-11-19 12:19:15 | n3 | ONLINE | 1 | 0 | +---------------------+----------------+--------------+--------------------------+-----------------------+ 3 rows in set (0.00 sec) Remove the network interface from the VM (completely deleted it). After the 5 seconds has passed: mysql> SELECT now(), member_host, member_state, @@global.super_read_only, @@global.offline_mode FROM performance_schema.replication_group_members order by member_host; +---------------------+----------------+--------------+--------------------------+-----------------------+ | now() | member_host | member_state | @@global.super_read_only | @@global.offline_mode | +---------------------+----------------+--------------+--------------------------+-----------------------+ | 2021-11-19 12:19:46 | n1 | UNREACHABLE | 1 | 0 | | 2021-11-19 12:19:46 | n2 | UNREACHABLE | 1 | 0 | | 2021-11-19 12:19:46 | n3 | ONLINE | 1 | 0 | +---------------------+----------------+--------------+--------------------------+-----------------------+ 3 rows in set (0.00 sec) 60 seconds later it still look the same, with offline_mode set to 0. 1 hour later, still no change. I can still connect to MySQL with a non-admin user and and read the stale data. It is only "after" I re-create the network interfaces that it suddenly goes into offline_mode = 1; But in a real world scenario, this could happen overnight (for example), and the server is then sat in read_only mode, serving up stale data, for many hours before we fix the problem. This doesn't seem right to me.
[23 Nov 2021 16:02]
IGG t
On top of that. If I set it to "autoRejoinTries" = 0. Then after I have re-added the network interfaces and n1,n2 can see n3 again, it reports it's state as: "instanceErrors": [ "ERROR: group_replication has stopped with an error." ], "status": (Missing) At this point it finally goes into offline_mode (having been serving up stale data to anyone connecting directly to the database for several hours). So I run: STOP GROUP_REPLICATION; START GROUP_REPLICATION; and n3 rejoins the cluster. If I look at my routers, they can now see all three nodes and resume trying to send data to n3. Except, that it n3 is now in OFFLINE_MODE. So in this case, with 2 RO nodes, 50% of my (non-admin) connections are failing due to "ERROR 3032 (HY000): The server is currently in offline mode"