MySQL Bugs: #116532: Group Replication cluster deployed across a regional GCP regularly got stuck

Bug #116532	Group Replication cluster deployed across a regional GCP regularly got stuck
Submitted:	3 Nov 2024 3:07	Modified:	28 Nov 2024 14:35
Reporter:	Andryan Gouw	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S1 (Critical)
Version:	8.0.40	OS:	Ubuntu (24.04)
Assigned to:		CPU Architecture:	x86 (x86-64)

Description:
We have a MySQL Group Replication cluster deployed across a regional Google Cloud Platform (GCP) setup. The cluster consists of a single primary node with three replicas, running in Group Replication mode. The cluster has been experiencing recurring issues where it becomes unresponsive and appears “stuck.” The issue results in queries hanging indefinitely, specifically during certain events related to table flushes and transactions. These events ultimately lead to a complete cluster stall, affecting availability and requiring manual intervention to recover.

Observed Behavior:

• Transactions initiated on the cluster begin to hang indefinitely.
• SHOW FULL PROCESSLIST displays a significant number of threads in states such as “waiting for handler commit” or “Waiting for table flush.”
• System user processes seem to invoke FLUSH TABLES, causing the cluster to enter an unresponsive state.
• Attempts to kill blocking processes (e.g., FLUSH TABLES) lead to Group Replication members dropping out of the cluster or getting into an error state, with the message: ERROR 3796 (HY000): The option group_replication_consistency cannot be used on the current member state.

Expected Behavior:

The cluster should be able to handle concurrent transactions smoothly, even during heavy load or table flush operations. Transactions should either proceed or fail gracefully, without rendering the entire cluster unresponsive. MySQL Shell and the table replication_group_members in performance_schemas should indicate the issue and shun/disconnect problematic nodes, if any.

Environment:

• MySQL Version: 8.0.40
• Deployment Platform: Google Cloud Platform (GCP)
• Cluster Setup: Multi-zone regional deployment in GCP, single primary node with three replicas (one at AWS of the same geographical location over VPN)
• Replication Method: MySQL Group Replication with GTID-based consistency
• Group Replication Consistency: Set to AFTER for post-commit consistency.

Additional Details:

• The innodb_lock_wait_timeout is set to 50 seconds, but transactions are not timing out as expected.
• transaction_isolation is set to REPEATABLE-READ.
• I have a mysqldump --single transaction running on one of the read replicas that runs every 4 hours, but not sure if that is the cause.
• Attempts to re-add nodes to the cluster fail due to extra GTID transactions, requiring manual intervention.
• Table performance_schemas.replication_group_members and MySQL Shell always indicate that the cluster is healthy and fully ONLINE despite the stuck state.
• I would always have to use -A to get into MySQL CLI to stop Group Replication on all nodes, otherwise it would get stuck.
• Any query sent for any table within a database is stuck, but tables in performance_schemas and mysql databases are always accessible throughout the stuck phase.

Logs & Diagnostics:

• Relevant logs from SHOW ENGINE INNODB STATUS\G show multiple active transactions in the “PREPARED” state, often waiting for a commit.
• SHOW PROCESSLIST includes numerous “system user” entries in states like “waiting for handler commit” and “Waiting for table flush.”
• Killing problematic transactions or processes on the primary node, such as those involving FLUSH TABLES, leads to Group Replication inconsistency errors and extra GTID transactions occur on the read replicas but not on the primary node.

Reproduction Frequency:

• The cluster gets stuck approximately once every couple days, while the load is low.

Impact:

This issue results in extended downtime and data availability concerns since the cluster cannot fulfill its HA promises. Manual intervention of disabling Group Replication on all read replicas is frequently required to recover from the stalled state, which diminishes the benefits of an automated group replication setup.

Suggested Investigation:

• Investigate the interaction between FLUSH TABLES commands (system-initiated) and active transactions in Group Replication mode.
• Assess whether changes in Group Replication consistency settings (e.g., AFTER) are leading to unintended flushes and cluster stalls.
• Review the handling of system-user invoked commands like FLUSH TABLES within the context of Group Replication and suggest alternatives or improvements to reduce such risks.

How to repeat:
1. Deploy a MySQL Group Replication cluster with a single primary and three replicas across multiple zones within a GCP region communicating through the internal IP addresses.
2. Set group_replication_consistency to AFTER to achieve consistency after commit.
3. Execute a sequence of transactions, including frequent updates and inserts, across various nodes in the cluster.
4. Use mysqldump or similar operations that create a load or may trigger FLUSH TABLES implicitly.
5. Observe the cluster over time. The issue typically occurs during peak load, when multiple transactions are in progress.
6. Nothing is recorded in error log and general log.

Suggested fix:
I haven't tested downgrading the version because this cluster used to run 8.0.39 and the issue started happening when I added a third replica (forth node) where it uses 8.0.40 instead. This cluster previously ran 8.0.35 primary and 8.0.39 replicas for over a week and no issue was observed then. It was only after introducing 8.0.40 it started to have this issue and despite upgrading the whole cluster to 8.0.40 it didn't solve the problem. I am concerned of potential data corruption if I downgrade.

We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

For details on getting support for MySQL products see http://www.mysql.com/support/
You can also check our forums (free) at http://forums.mysql.com/

Thank you for your interest in MySQL.

What??  Which part of this is not a bug report?  Did you even read the whole report before jumping into premature conclusion?  I clearly stated that I had InnoDB Cluster Group Replication working fine previously and suddenly it locked up randomly at times after 8.0.40 was introduced into the cluster.  How is this a support request??

Hi,

Let me try to give you some more feedback as you did put some effort into this report.

You wrote that:

> The cluster gets stuck approximately once every couple days, while the load is low.

And then:

> The issue typically occurs during peak load, when multiple transactions are in progress.

This is inconsistent. Is the issue happening while the load is low or the load is at it's peak?

But much more important:

> How to repeat:
> 1.      Deploy a MySQL Group Replication cluster with a single primary and
> three replicas across multiple zones within a GCP region communicating
...

We have tens of thousands deployments like this without any issue. So this is not a way we can reproduce the problem.

> 6.      Nothing is recorded in error log and general log.

and on top of this, nothing is logged.

The way to solve the problem is to doublecheck your configuration and rest of the setup first and second jump on the problem when it happens. This is something you can do with our MySQL Support team, not something we can do in bugs system as we here deal with reproducible cases. On top of this you are using 3rd party infrastructure. 

We already had huge issues with one cloud provider that had weird network issues that prevented proper replication from working, they ended up fixing that issue years later, none of that can be influenced by us. In order to accept a bug report I need to be able to clearly reproduce the bug on tested, 100% working, environment. 

Hope this helps. If you can provide us with a repeatable test case, that does not require us to setup something on system that we do not control I will be happy to reopen and retest the issue.

Thank you for using MySQL.

If I pay for support and it turns out to be a real bug, will you refund me?

I had a working InnoDB cluster that WORKED fine, now it locks up every few days and instead of offering assistance on how to best analyze the issue you actually picked on my bug report?  Wow, very professional indeed.

If I knew how to reproduce it I would have posted it here!  The fact that it is intermittent is why I have shared it here. If you only deal with reproducible bugs only, then state such statement on the front page of this bug report portal.

I will definitely escalate this exchange to the management.

I cannot say that your problem is not caused by some underlying bug in our system. We for sure have bugs and we are finding and fixing them non stop. Issue is that your setup is rather same to what I have here in testing and it works without problems when I apply everything you said. The only difference is GCP. I use bare metal and real network and not GCP. 

Group replication is rather sensitive to "not so good network" and as I mentioned we have seen a lot of network issues in some cloud providers (not GCP) and https://dev.mysql.com/doc/refman/8.4/en/group-replication-performance.html (or in your case https://dev.mysql.com/doc/refman/8.0/en/group-replication-performance.html ) deals with these issues.

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-40.html shows changes in 8.0.40 and none of them should cause issues you are experiencing so upgrade might not be the cause of the problem at all.

> I will definitely escalate this exchange to the management.

Already done immediately when bug was set as "not a bug" to doublecheck and also group replication team was informed to double check too as while the report lack data for us to reproduce it we will doublecheck the code and changes and investigate if there is something we can do.

Thanks

p.s. 

> offering assistance on how to best analyze the issue

This is what MySQL Support team does. What I can suggest is 
 [1] setup similar setup locally and run similar load and see if you can reproduce the problem. You say you see it non stop should be easy (I did this and I did not reproduce the problem)
 [2] setup similar setup with Oracle Cloud - MySQL as a service, get managed group replication setup and push your load to it. IIRC Oracle gives a lot of free setups so you can test this for free on OCI and see if you can reproduce the problem there (I did this also and I did not reproduce the problem, but maybe there is something in your load that triggers some bug I can't say)

There were no indications of network issues, these nodes reside within a region of 3 AZs with sub-10ms latency between each other.

I got a full processlist dump from the primary node when the issue happened (see attached file). I was able to login by passing -A to mysql CLI. It appeared that system-generated FLUSH TABLE had caused the deadlock. Manually killing the PID of the flush tables session solved the problem but the primary node was left with broken state Group Replication and the read replicas carried on with more GTID transactions which I had to reimport into the primary node using mysqlbinlog tool.

Hi,

Attach please config files from all nodes and also global status and variables from all nodes.

Thanks

I have recently configured SET GLOBAL on all nodes hoping it would address the issue:
- lock_wait_timeout = 50
- wait_timeout = 300

I still have the general.log file that includes the Nov 2nd outage if needed.

We changed the mysqldump cronjob to every 6 hour from previously every 4 hour and so far we have not observed any FLUSH TABLES in general_log.  Does this mean the lock up on the primary node happens because running mysqldump --single-transaction on a replica triggers FLUSH TABLES on the primary node?

I have provided the requested information and more details on what was observed (FLUSH TABLES being issued on primary node while mysqldump --single-transaction is invoked on a replica)

After reducing the frequency of mysqldump from every 4 hours to every 6 hours, we had not seen any issue for 9 days (since Nov 2) but today it happened again (complete logs captured on primary node in processlist-241112.txt) and stopping group replication on the replica nodes worked well (again) to restore services on the primary node. However minutes after restarting group replication on one of the replica nodes, it locked up again but this time it was not caused by any hanging FLUSH TABLE session but rather they were stuck at "Waiting for Binlog Group Commit ticket" (please see processlist-241112-2.txt). I had to issue "stop group_replication" again to fix it.

Logs from primary node:
2024-11-02T12:15:47.011469Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306'
2024-11-02T12:15:47.012596Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:12
.'
2024-11-02T12:16:35.884822Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306'
2024-11-02T12:16:35.884940Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306 on view 17302504105241960:13.'
2024-11-02T12:18:30.840484Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306'
2024-11-02T12:18:30.840641Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306 on view 17302504105241960:14.'
2024-11-02T12:32:55.879253Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:15.'
2024-11-02T12:33:05.105162Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.8:3306 was declared online within the replication group.'
2024-11-02T12:52:13.343198Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 172.31.16.11:33061. Please check the connection status to this member'
2024-11-02T12:52:14.638909Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:16.'
2024-11-02T12:52:32.238422Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 172.31.16.11:3306 was declared online within the replication group.'
2024-11-02T12:52:49.935401Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:17.'
2024-11-02T12:52:57.833069Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.16:3306 was declared online within the replication group.'
2024-11-05T00:26:13.478198Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306'
2024-11-05T00:26:13.479140Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:18.'
2024-11-05T00:27:06.451271Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306'
2024-11-05T00:27:06.451375Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306 on view 17302504105241960:19.'
2024-11-05T00:33:27.310928Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306'
2024-11-05T00:33:27.311019Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306 on view 17302504105241960:20.'
2024-11-05T03:34:52.384944Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:21.'
2024-11-05T03:35:37.796939Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:22.'
2024-11-05T03:35:53.663842Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 172.31.16.11:3306 was declared online within the replication group.'
2024-11-05T03:36:35.930009Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.8:3306 was declared online within the replication group.'
2024-11-05T03:36:42.520587Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 10.184.0.16:33061. Please check the connection status to this member'
2024-11-05T03:36:44.052681Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:23.'
2024-11-05T03:37:24.057000Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.16:3306 was declared online within the replication group.'
2024-11-12T01:22:39.056530Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306'
2024-11-12T01:22:39.059716Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:24.'
2024-11-12T01:23:11.364123Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:25.'
2024-11-12T02:35:36.359010Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 172.31.16.11:3306 has become unreachable.'
2024-11-12T02:35:39.045430Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 172.31.16.11:33061. Please check the connection status to this member'
2024-11-12T02:35:39.045934Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 172.31.16.11:3306 is reachable again.'
2024-11-12T02:36:48.030058Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306'
2024-11-12T02:36:48.030177Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:26.'
2024-11-12T02:38:17.985554Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306'

Strangely enough whenever the second issue happens (it has happened thrice now), SHOW FULL PROCESSLIST returns with a list of stuck write operations with the status of "Waiting for Binlog Group Commit ticket" if I try inserting a new row into a table using mysql CLI then it gets stuck waiting for the insert to be successful, but I just send CTRL-C the query will somehow proceed and return: Query OK, 1 row affected (1.60 sec) then the other sessions' queries just complete after previously stalling forever (SHOW FULL PROCESSLIST is empty at this stage). I managed to repeat this 3 times.

Sorry for the confusion, I had the wrong date. It should have been "since Nov 5" and "6 days" instead of "9 days". The frequency has clearly reduced with less frequent mysqldumps but now it seems that there is a new problem as I am not able to restart group replication at all (with even just 1 replica) without causing the primary node to lock up.