Bug #116532 | Group Replication cluster deployed across a regional GCP regularly got stuck | ||
---|---|---|---|
Submitted: | 3 Nov 2024 3:07 | Modified: | 28 Nov 2024 14:35 |
Reporter: | Andryan Gouw | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S1 (Critical) |
Version: | 8.0.40 | OS: | Ubuntu (24.04) |
Assigned to: | CPU Architecture: | x86 (x86-64) |
[3 Nov 2024 3:07]
Andryan Gouw
[4 Nov 2024 8:14]
MySQL Verification Team
We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug. For details on getting support for MySQL products see http://www.mysql.com/support/ You can also check our forums (free) at http://forums.mysql.com/ Thank you for your interest in MySQL.
[4 Nov 2024 9:15]
Andryan Gouw
What?? Which part of this is not a bug report? Did you even read the whole report before jumping into premature conclusion? I clearly stated that I had InnoDB Cluster Group Replication working fine previously and suddenly it locked up randomly at times after 8.0.40 was introduced into the cluster. How is this a support request??
[4 Nov 2024 9:33]
MySQL Verification Team
Hi, Let me try to give you some more feedback as you did put some effort into this report. You wrote that: > The cluster gets stuck approximately once every couple days, while the load is low. And then: > The issue typically occurs during peak load, when multiple transactions are in progress. This is inconsistent. Is the issue happening while the load is low or the load is at it's peak? But much more important: > How to repeat: > 1. Deploy a MySQL Group Replication cluster with a single primary and > three replicas across multiple zones within a GCP region communicating ... We have tens of thousands deployments like this without any issue. So this is not a way we can reproduce the problem. > 6. Nothing is recorded in error log and general log. and on top of this, nothing is logged. The way to solve the problem is to doublecheck your configuration and rest of the setup first and second jump on the problem when it happens. This is something you can do with our MySQL Support team, not something we can do in bugs system as we here deal with reproducible cases. On top of this you are using 3rd party infrastructure. We already had huge issues with one cloud provider that had weird network issues that prevented proper replication from working, they ended up fixing that issue years later, none of that can be influenced by us. In order to accept a bug report I need to be able to clearly reproduce the bug on tested, 100% working, environment. Hope this helps. If you can provide us with a repeatable test case, that does not require us to setup something on system that we do not control I will be happy to reopen and retest the issue. Thank you for using MySQL.
[4 Nov 2024 13:44]
Andryan Gouw
If I pay for support and it turns out to be a real bug, will you refund me? I had a working InnoDB cluster that WORKED fine, now it locks up every few days and instead of offering assistance on how to best analyze the issue you actually picked on my bug report? Wow, very professional indeed. If I knew how to reproduce it I would have posted it here! The fact that it is intermittent is why I have shared it here. If you only deal with reproducible bugs only, then state such statement on the front page of this bug report portal. I will definitely escalate this exchange to the management.
[4 Nov 2024 16:09]
MySQL Verification Team
I cannot say that your problem is not caused by some underlying bug in our system. We for sure have bugs and we are finding and fixing them non stop. Issue is that your setup is rather same to what I have here in testing and it works without problems when I apply everything you said. The only difference is GCP. I use bare metal and real network and not GCP. Group replication is rather sensitive to "not so good network" and as I mentioned we have seen a lot of network issues in some cloud providers (not GCP) and https://dev.mysql.com/doc/refman/8.4/en/group-replication-performance.html (or in your case https://dev.mysql.com/doc/refman/8.0/en/group-replication-performance.html ) deals with these issues. https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-40.html shows changes in 8.0.40 and none of them should cause issues you are experiencing so upgrade might not be the cause of the problem at all. > I will definitely escalate this exchange to the management. Already done immediately when bug was set as "not a bug" to doublecheck and also group replication team was informed to double check too as while the report lack data for us to reproduce it we will doublecheck the code and changes and investigate if there is something we can do. Thanks p.s. > offering assistance on how to best analyze the issue This is what MySQL Support team does. What I can suggest is [1] setup similar setup locally and run similar load and see if you can reproduce the problem. You say you see it non stop should be easy (I did this and I did not reproduce the problem) [2] setup similar setup with Oracle Cloud - MySQL as a service, get managed group replication setup and push your load to it. IIRC Oracle gives a lot of free setups so you can test this for free on OCI and see if you can reproduce the problem there (I did this also and I did not reproduce the problem, but maybe there is something in your load that triggers some bug I can't say)
[4 Nov 2024 16:58]
Andryan Gouw
There were no indications of network issues, these nodes reside within a region of 3 AZs with sub-10ms latency between each other. I got a full processlist dump from the primary node when the issue happened (see attached file). I was able to login by passing -A to mysql CLI. It appeared that system-generated FLUSH TABLE had caused the deadlock. Manually killing the PID of the flush tables session solved the problem but the primary node was left with broken state Group Replication and the read replicas carried on with more GTID transactions which I had to reimport into the primary node using mysqlbinlog tool.
[4 Nov 2024 17:00]
MySQL Verification Team
Hi, Attach please config files from all nodes and also global status and variables from all nodes. Thanks
[4 Nov 2024 18:28]
Andryan Gouw
I have recently configured SET GLOBAL on all nodes hoping it would address the issue: - lock_wait_timeout = 50 - wait_timeout = 300 I still have the general.log file that includes the Nov 2nd outage if needed.
[5 Nov 2024 10:38]
Andryan Gouw
We changed the mysqldump cronjob to every 6 hour from previously every 4 hour and so far we have not observed any FLUSH TABLES in general_log. Does this mean the lock up on the primary node happens because running mysqldump --single-transaction on a replica triggers FLUSH TABLES on the primary node?
[8 Nov 2024 14:35]
Andryan Gouw
I have provided the requested information and more details on what was observed (FLUSH TABLES being issued on primary node while mysqldump --single-transaction is invoked on a replica)
[12 Nov 2024 4:49]
Andryan Gouw
After reducing the frequency of mysqldump from every 4 hours to every 6 hours, we had not seen any issue for 9 days (since Nov 2) but today it happened again (complete logs captured on primary node in processlist-241112.txt) and stopping group replication on the replica nodes worked well (again) to restore services on the primary node. However minutes after restarting group replication on one of the replica nodes, it locked up again but this time it was not caused by any hanging FLUSH TABLE session but rather they were stuck at "Waiting for Binlog Group Commit ticket" (please see processlist-241112-2.txt). I had to issue "stop group_replication" again to fix it. Logs from primary node: 2024-11-02T12:15:47.011469Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306' 2024-11-02T12:15:47.012596Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:12 .' 2024-11-02T12:16:35.884822Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306' 2024-11-02T12:16:35.884940Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306 on view 17302504105241960:13.' 2024-11-02T12:18:30.840484Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306' 2024-11-02T12:18:30.840641Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306 on view 17302504105241960:14.' 2024-11-02T12:32:55.879253Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:15.' 2024-11-02T12:33:05.105162Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.8:3306 was declared online within the replication group.' 2024-11-02T12:52:13.343198Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 172.31.16.11:33061. Please check the connection status to this member' 2024-11-02T12:52:14.638909Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:16.' 2024-11-02T12:52:32.238422Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 172.31.16.11:3306 was declared online within the replication group.' 2024-11-02T12:52:49.935401Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:17.' 2024-11-02T12:52:57.833069Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.16:3306 was declared online within the replication group.' 2024-11-05T00:26:13.478198Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306' 2024-11-05T00:26:13.479140Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:18.' 2024-11-05T00:27:06.451271Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306' 2024-11-05T00:27:06.451375Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306 on view 17302504105241960:19.' 2024-11-05T00:33:27.310928Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306' 2024-11-05T00:33:27.311019Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306 on view 17302504105241960:20.' 2024-11-05T03:34:52.384944Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:21.' 2024-11-05T03:35:37.796939Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:22.' 2024-11-05T03:35:53.663842Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 172.31.16.11:3306 was declared online within the replication group.' 2024-11-05T03:36:35.930009Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.8:3306 was declared online within the replication group.' 2024-11-05T03:36:42.520587Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 10.184.0.16:33061. Please check the connection status to this member' 2024-11-05T03:36:44.052681Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:23.' 2024-11-05T03:37:24.057000Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.184.0.16:3306 was declared online within the replication group.' 2024-11-12T01:22:39.056530Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.8:3306' 2024-11-12T01:22:39.059716Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306 on view 17302504105241960:24.' 2024-11-12T01:23:11.364123Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 172.31.16.11:3306, 10.184.0.8:3306 on view 17302504105241960:25.' 2024-11-12T02:35:36.359010Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 172.31.16.11:3306 has become unreachable.' 2024-11-12T02:35:39.045430Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 172.31.16.11:33061. Please check the connection status to this member' 2024-11-12T02:35:39.045934Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 172.31.16.11:3306 is reachable again.' 2024-11-12T02:36:48.030058Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 172.31.16.11:3306' 2024-11-12T02:36:48.030177Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.184.0.16:3306, 10.184.0.10:3306, 10.184.0.8:3306 on view 17302504105241960:26.' 2024-11-12T02:38:17.985554Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.184.0.16:3306'
[12 Nov 2024 4:58]
Andryan Gouw
Strangely enough whenever the second issue happens (it has happened thrice now), SHOW FULL PROCESSLIST returns with a list of stuck write operations with the status of "Waiting for Binlog Group Commit ticket" if I try inserting a new row into a table using mysql CLI then it gets stuck waiting for the insert to be successful, but I just send CTRL-C the query will somehow proceed and return: Query OK, 1 row affected (1.60 sec) then the other sessions' queries just complete after previously stalling forever (SHOW FULL PROCESSLIST is empty at this stage). I managed to repeat this 3 times.
[12 Nov 2024 5:01]
Andryan Gouw
Sorry for the confusion, I had the wrong date. It should have been "since Nov 5" and "6 days" instead of "9 days". The frequency has clearly reduced with less frequent mysqldumps but now it seems that there is a new problem as I am not able to restart group replication at all (with even just 1 replica) without causing the primary node to lock up.