MySQL Bugs: #106032: MySQL uses up ram/swap after short member disconnect within innodb cluster

Bug #106032	MySQL uses up ram/swap after short member disconnect within innodb cluster
Submitted:	4 Jan 2022 14:17	Modified:	20 Jan 2022 16:30
Reporter:	Christian Glaser	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.27 Community Server	OS:	Debian (11.2)
Assigned to:		CPU Architecture:	x86 (HyperV)

Description:
We have set up an 3 node innodb cluster with mysql 8.0.27.

1 node primary
2 nodes secondary RO

> cluster.status()
{
    "clusterName": "testCluster",
    "defaultReplicaSet": {
        "name": "default",
        "primary": "db01-002.########:3306",
        "ssl": "REQUIRED",
        "status": "OK",
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
        "topology": {
            "db01-002.########:3306": {
                "address": "db01-002.########:3306",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.27"
            },
            "db02-002.########:3306": {
                "address": "db02-002.########:3306",
                "memberRole": "SECONDARY",
                "mode": "R/O",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.27"
            },
            "db03-002.########:3306": {
                "address": "db03-002.########:3306",
                "memberRole": "SECONDARY",
                "mode": "R/O",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.27"
            }
        },
        "topologyMode": "Single-Primary"
    },
    "groupInformationSourceMember": "db01-002.########:3306"
}

The 3 nodes are virtualized - as a managed service: 
we have root access to the 3 virtual machines, but no access to the hypervisor/networking/etc.

It's a minimal setup, no clients accessing the databases on the cluster yet.

The problem: we guess that there are short network disconnects/lags and one of the nodes loses contact to the other ones. We cant see anythin obvious with our monitoring and ping statistics. The node not seeing one or two other nodes lasts only some seconds and it reconnects after this event and everything looks good - cluster state etc. Except for ram/swap usage. Whenever this occurs, ram and swap start to grow till exhausted.

log of event:
2021-12-31T23:28:59.677783+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:29:13.362366+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.909802+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916173+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916746+01:00 0 [ERROR] [MY-011496] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked for the next 15 seconds. Unless contact with the majority is restored, after this time the member will error out and leave the group. It is possible to use group_replication_force_members to force a new group membership.'
2021-12-31T23:31:44.927991+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db03-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.942143+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.961775+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.961813+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 is reachable again.'
2021-12-31T23:31:44.967994+01:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2021-12-31T23:31:51.825487+01:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2021-12-31T23:31:51.827340+01:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2021-12-31T23:31:51.854404+01:00 149104 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
2021-12-31T23:32:21.868086+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to exit!'
2021-12-31T23:32:21.872116+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member has failed to gracefully leave the group.'
2021-12-31T23:32:51.891498+01:00 149104 [Warning] [MY-013659] [Repl] Plugin group_replication reported: 'Timeout while waiting for a view change event during the leave step before a auto-rejoin attempt.'
2021-12-31T23:35:00.127143+01:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2021-12-31T23:35:00.760316+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.763983+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.981936+01:00 149119 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 429560479, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-12-31T23:35:07.702481+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:35:07.910540+01:00 149104 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'
2021-12-31T23:35:08.923772+01:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-12-31T23:35:08.937563+01:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to db01-002.########:3306, db02-002.########:3306, db03-002.########:3306 on view 16400802118655702:17.'

PS: We can rule out the OS, as we had the same issue on CentOS 8 also using mysql 8.0.27 Community Edition.

How to repeat:
Short network disconnects between clients to trigger Auto-rejoin procedure

Hi,

I cannot reproduce this but I will verify the bug report so that our replication team can look at it. Expect more questions about this setup.

What you can do is configure GR to be more resilient to these disconnects

https://dev.mysql.com/doc/refman/8.0/en/group-replication-responses-failure-expel.html
https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic...

Hi,

The member expel timeout is indeed a valid configuration option in this case. We came upon it in our initial debugging for this problem and set it from 5 seconds to 10 seconds. Although the memory behavior did not change after that. 

10 days ago I stumbled upon it again and changed the group_replication_member_expel_timeout to 30 seconds. Since then, no more memory troubles have come up! 

It seems that the memory troubles are triggered when this value times out, a member gets expelled and then tries to join back. This rejoin works without any problems, but then the memory usage goes up only on this one node, that got expelled.

BTW our setting for group_replication_exit_state_action is "READ_ONLY".

We are happy to help with more troubleshooting.

All the best!

Hi Christian,

Thanks for confirmation. In 8.0.13 IIRC we implemented those parameters as some cloud providers (azure was the one we got most reports about I think) had this network stability issue. TBH I expected those network stability issues on cloud providers to be solved by now but .. life.. 

I did not manage to replicate this memory allocation increase you reported but I think our GR team will know more, they will probbly come back with more questions. In the meantime you have a workaround (increasing the timeouts).

all best