Bug #105576 exitStateAction not called after autoRejoinTries has finished
Submitted: 15 Nov 2021 13:06 Modified: 23 Nov 2021 16:02
Reporter: IGG t Email Updates:
Status: Can't repeat Impact on me:
None 
Category:Shell AdminAPI InnoDB Cluster / ReplicaSet Severity:S2 (Serious)
Version:8.0.27 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[15 Nov 2021 13:06] IGG t
Description:
I have a three node cluster (n1, n2, n3).
It is set up with the following options on each node:

 - "autoRejoinTries", 1
 - "exitStateAction", "OFFLINE_MODE"
 - "group_replication_member_expel_timeout" = 30

n1 is the Primary.

My understanding from the documentation is that should a node (n3) disappear, e.g. the network between n3 and the other nodes goes down, the following should happen:

5 seconds passes
A 'suspicion' is raised, the missing node is marked as "unreachable" by n1 and n2
30 seconds passes (group_replication_member_expel_timeout)
n3 is expelled from the cluster by n1 and n2
n3 tries to re-join the cluster once (autoRejoinTries) and fails
exitStateAction is called by n3

However in my testing, n3 never goes into offline_mode, no matter how long I leave it for. Meaning clients are free to connect and read from increasingly stale data. 

Am I misunderstanding this, as it seems bad practice to leave the node in a state that could potentially serve up stale data for hours after the network has broken (until someone notices and manually sets the node to Offline mode or shuts it down).

How to repeat:
Set up a three node cluster.

confirm all nodes are connected.

on n3 turn off the network interface:

ifdown eth0

connect to n3 by some other means (mine is a VM so connect via the hypervisor).

run:
mysqlsh

\sql SELECT member_host, member_state, @@global.super_read_only, @@global.offline_mode FROM performance_schema.replication_group_members;

after 30 seconds I would expect @@global.offline_mode would become 1.

Suggested fix:
after group_replication_member_expel_timeout and autoRejoinTries have been completed, exitStateAction should be applied to the node(s) that no longer has quorum.
[17 Nov 2021 11:08] MySQL Verification Team
Hi,

I am not able to reproduce this. You might have issues with "ifdown" as it can introduce issues in detecting failure as that's not failure, you asked for the interface to go down. If you want to test network outtage don't ifdown but disconnect network from the VM on the host system, so that VM see that as cable disconnected. Testing this with 8.0.27 the node without network properly goes to offline mode and drops all connections from users without CONNECTION_ADMIN priv.

As for the behavior, you can connect to offline mysql depending on your user privilege (CONNECTION_ADMIN or SUPER). 

kind regards
[19 Nov 2021 13:36] IGG t
The `ifdown` was simply a way to simulate the loss of connection to the cluster. 

I assumed that if n3 could no longer see n1,n2 for whatever reason (maybe the interface on the server has failed or been shutdown by mistake, maybe a cable has been unplugged etc) then it would activate the exitStateAction. But this doesn't seem to be the case. It just seems to sit there in read_only mode.

If I connect to n3 on the hypervisor, what I am seeing is this:

Normal operation:

mysql> SELECT now(), member_host, member_state, @@global.super_read_only, @@global.offline_mode FROM performance_schema.replication_group_members order by member_host;
+---------------------+----------------+--------------+--------------------------+-----------------------+
| now()               | member_host    | member_state | @@global.super_read_only | @@global.offline_mode |
+---------------------+----------------+--------------+--------------------------+-----------------------+
| 2021-11-19 12:19:15 | n1             | ONLINE       |                        1 |                     0 |
| 2021-11-19 12:19:15 | n2             | ONLINE       |                        1 |                     0 |
| 2021-11-19 12:19:15 | n3             | ONLINE       |                        1 |                     0 |
+---------------------+----------------+--------------+--------------------------+-----------------------+
3 rows in set (0.00 sec)

Remove the network interface from the VM (completely deleted it).

After the 5 seconds has passed:

mysql> SELECT now(), member_host, member_state, @@global.super_read_only, @@global.offline_mode FROM performance_schema.replication_group_members order by member_host;
+---------------------+----------------+--------------+--------------------------+-----------------------+
| now()               | member_host    | member_state | @@global.super_read_only | @@global.offline_mode |
+---------------------+----------------+--------------+--------------------------+-----------------------+
| 2021-11-19 12:19:46 | n1             | UNREACHABLE  |                        1 |                     0 |
| 2021-11-19 12:19:46 | n2             | UNREACHABLE  |                        1 |                     0 |
| 2021-11-19 12:19:46 | n3             | ONLINE       |                        1 |                     0 |
+---------------------+----------------+--------------+--------------------------+-----------------------+
3 rows in set (0.00 sec)

60 seconds later it still look the same, with offline_mode set to 0.
1 hour later, still no change. I can still connect to MySQL with a non-admin user and and read the stale data.

It is only "after" I re-create the network interfaces that it suddenly goes into offline_mode = 1;

But in a real world scenario, this could happen overnight (for example), and the server is then sat in read_only mode, serving up stale data, for many hours before we fix the problem. This doesn't seem right to me.
[23 Nov 2021 16:02] IGG t
On top of that.

If I set it to "autoRejoinTries" = 0. Then after I have re-added the network interfaces and n1,n2 can see n3 again, it reports it's state as:

"instanceErrors": [
                    "ERROR: group_replication has stopped with an error."
                ],
"status": (Missing)

At this point it finally goes into offline_mode (having been serving up stale data to anyone connecting directly to the database for several hours).

So I run:

STOP GROUP_REPLICATION;
START GROUP_REPLICATION;

and n3 rejoins the cluster.

If I look at my routers, they can now see all three nodes and resume trying to send data to n3. Except, that it n3 is now in OFFLINE_MODE. 

So in this case, with 2 RO nodes, 50% of my (non-admin) connections are failing due to "ERROR 3032 (HY000): The server is currently in offline mode"