MySQL Bugs: #102515: Group replication member remains ONLINE even when it gets (far) behind

Bug #102515	Group replication member remains ONLINE even when it gets (far) behind
Submitted:	7 Feb 2021 11:03	Modified:	9 Feb 2021 16:03
Reporter:	Matthias Crauwels (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.23	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
We've had an issue with a 3 node InnoDB Group Replication cluster were we had one node reaching the filesystem limit for the binary log volume (no space left on device). This node remained "ONLINE" in the cluster.

Subsequently we had the other SECONDARY node experience a mysql failure it became unreachable and was removed from the group. 

At this time the only "active" server in the group was the PRIMARY. I would assume at this time for it to loose quorum and stop accepting writes as the writes are no longer distributed to any other member.

Additionally, during this time the binary logs got rotated and purged to a situation where the PRIMARY no longer had the binary logs that node2 would require for incremental recovery, but because node1 was still considered "ONLINE" instead of doing a re-clone, node2 connected to node1 (which is not applying any transactions and thus not generating any binary logs) for an incremental recovery causing this node2 to be stuck in RECOVERING status.

During all this time, writes were still happening on the PRIMARY which had no way of (reliably) getting its commits certified. 

How to repeat:
Start of with a 3 node healhty cluster.

```
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 14c7f5b3-67d2-11eb-9e96-08002799f743 | node0       |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | 89170bb4-67d3-11eb-91eb-08002799f743 | node2       |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | fffa336e-67d2-11eb-a791-08002799f743 | node1       |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
```

Have the partition holding binary logs run full

```
node1:~ # dd if=/dev/zero of=/mysql/bin/junk
dd: writing to '/mysql/bin/junk': No space left on device
1815769+0 records in
1815768+0 records out
929673216 bytes (930 MB, 887 MiB) copied, 10.3299 s, 90.0 MB/s
```
MySQL is aware that the is the current situation

```
2021-02-07T10:07:54.879371Z 14 [ERROR] [MY-000035] [Server] Disk is full writing '/mysql/bin/node1-bin.000002' (OS errno 28 - No space left on device). Waiting for someone to free space... Retry in 60 secs. Message reprinted in 600 secs.
```

MySQL can no longer commit any transactions on this node as it would need them to sync to the binlog with sync_binlog = 1. I would have expected the binlog_error_action to be triggered in this case. However I can understand that "No space left on device" would not immediately crash the instance.

```
mysql> SELECT @@hostname, @@log_bin, @@sync_binlog, @@binlog_error_action;
+------------+-----------+---------------+-----------------------+
| @@hostname | @@log_bin | @@sync_binlog | @@binlog_error_action |
+------------+-----------+---------------+-----------------------+
| node1      |         1 |             1 | ABORT_SERVER          |
+------------+-----------+---------------+-----------------------+
1 row in set (0.00 sec)
```

During this time the cluster has at least 2 out of 3 operational nodes so I expect cluster operations to be fine, which they aware

```
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 14c7f5b3-67d2-11eb-9e96-08002799f743 | node0       |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | 89170bb4-67d3-11eb-91eb-08002799f743 | node2       |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | fffa336e-67d2-11eb-a791-08002799f743 | node1       |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
```

While node1 is getting more and more behind in applying transactions, the secondary node2 crashes

```
node2:~ # kill -9 `pidof mysqld` && systemctl stop mysql
```

The group notices that node2 is unreachable and it removes the node from the group.

```
2021-02-07T10:20:29.130938Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address node2:3306 has become unreachable.'
2021-02-07T10:20:45.956837Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: node2:3306'
2021-02-07T10:20:45.957419Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to node0:3306, node1:3306 on view 16126893147346366:12.'
```

At this time we have
- node0 - PRIMARY - ONLINE - processing transactions
- node1 - SECONDARY - ONLINE - not processing transactions
- node2 - DOWN 

At this time I would have assumed that the group would have lost quorum. Node0 is proposing transactions but there is no node available to (reliably) certify these transactions.

Additionally during this situation on the PRIMARY binlogs are getting rotated and eventually purged (either manually using PURGE BINARY LOGS or automatically by the binlog_expire_logs_seconds)
For reproduceability I will manually purge my binary logs.

```
MySQL  node0:33060+ ssl  JS > \sql FLUSH BINARY LOGS;
Query OK, 0 rows affected (0.0983 sec)
MySQL  node0:33060+ ssl  JS > \sql PURGE BINARY LOGS TO 'node0-bin.000002';
Query OK, 0 rows affected (0.0234 sec)
```

At this point the PRIMARY no longer has the binary logs that node2 will need when rejoining.

We restart mysql on node2
```
node2:~ # systemctl start mysql
```

error log output:
```
2021-02-07T10:23:03.792160Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.23) starting as process 28032
...
2021-02-07T10:23:06.220509Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
...
2021-02-07T10:23:07.236538Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.23'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
...
2021-02-07T10:23:10.078442Z 2 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address node0:3306.'
2021-02-07T10:23:11.079997Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-02-07T10:23:11.080484Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to node0:3306, node2:3306, node1:3306 on view 16126893147346366:13.'
...
2021-02-07T10:23:22.075506Z 22 [System] [MY-010562] [Repl] Slave I/O thread for channel 'group_replication_recovery': connected to master 'mysql_innodb_cluster_15973@node1:3306',replication started in log 'FIRST' at position 4
```

node0 does not have the binary logs that node2 requires so it will try node1 which is unable to provide these, the node will remain in "RECOVERING" state.
All this time the cluster just continues to write transactions as if there were no issues, which seems like a problem to me.

Suggested fix:
Either execute the `binlog_error_action` when not able to write binary logs, in this case it would crash the instance and cause it to be OFFLINE in the cluster.

Other option might be to introduce a member_state like "NOT_REPLICATING" to avoid this node from appearing healthy. And it not being eligible to become a donor.

Hi,

I did manage to verify this behavior but you should use a proper monitoring system to know when your HDD is getting low on free space. We offer MySQL Enterprise Monitor but any system monitoring software will monitor space too. It's really much easier to prevent this problem than solve it :(

Low disk space can (and will) create a whole set of issues depending on what disk is full, this is something you have to prevent.

All best
Bogdan

Maybe you could try our version:
https://github.com/session-replay-tools/MySQL