| Bug #113137 | MySQL Innodb cluster hang forever | ||
|---|---|---|---|
| Submitted: | 20 Nov 2023 4:04 | Modified: | 27 Nov 2023 3:07 |
| Reporter: | zetang zeng (OCA) | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
| Version: | 5.7.43 | OS: | Linux |
| Assigned to: | CPU Architecture: | Any | |
[21 Nov 2023 9:06]
zetang zeng
Is the `stop group_replication` thread waiting for `xcom_taskmain_startup` thread?
It seems because there is inconsistency in xcom task loop (active_tasks is 31, but task link list is empty):
```
(gdb) p active_tasks
$2 = 31
(gdb) p &tasks
$3 = (linkage *) 0x7f2b81e580a0 <tasks>
(gdb) p tasks
$4 = {type = 0, suc = 0x7f2b81e580a0 <tasks>, pred = 0x7f2b81e580a0 <tasks>}
```
[23 Nov 2023 21:36]
MySQL Verification Team
Hi, First, I dropped the severity to S3 as this is not a S2 issue. Secondly, while I can reproduce this issue doing exactly what you stated I do not see how is this "regular thing that can happen in real life" (hence it cannot be S2) I am not sure this is a bug as this is not a normal situation. If I properly kill the network (remove eth cable for e.g.) this issue will not reproduce. Anyhow I will verify the report and let GR team decide if they think this is a bug or there is something they can improve upon. Thank you for the report
[27 Nov 2023 3:04]
zetang zeng
Yep, I agree with you that this reproducing case is too rare to happen in real life. But we do met some similar situation in this case(https://sourceware.org/bugzilla/show_bug.cgi?id=30977), which we fail to reproduce. Hope the reason (inconsistency in GCS task?) leads to this problem also the root cause of this one(https://sourceware.org/bugzilla/show_bug.cgi?id=30977)
[27 Nov 2023 3:07]
zetang zeng
Oh sorry, I give the wrong link in last msg https://bugs.mysql.com/bug.php?id=112277&thanks=5¬ify=71

Description: Repeated in both centos & debian OS: centos 7 or debian 10 kernel: - centos 7: Linux iv-ycjlolti2h8rx7ci6zct 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux - debian 10: Linux iv-ybutt8i9bk8rx7uutcmp 4.19.0-24-amd64 #1 SMP Debian 4.19.282-1 (2023-04-29) x86_64 GNU/Linux After block network using iptables for 30 min, `STOP GROUP_REPLICATION` and cluster status via mysqlsh will block forever. The thread dump show two Gcs_xcom threads waiting on `pthread_join` (but the thread those 2 threads waiting for are not in thread dump). How to repeat: - Deploy a three node cluster on three CentOs 7 machine - check cluster status and all is fine - use following cmd to block network on **all three machine** for 30 min ``` sudo iptables -t filter -S --wait sudo iptables -t filter -N CHAOS_HOST --wait sudo iptables -t filter -A INPUT -j CHAOS_HOST --wait sudo iptables -t filter -A CHAOS_HOST -p tcp -m multiport --dports 10022,2022,2021,22 -j ACCEPT --wait sudo iptables -t filter -A CHAOS_HOST -j DROP --wait sudo iptables -t filter -S CHAOS_HOST --wait sudo ip6tables -t filter -S --wait sudo ip6tables -t filter -N CHAOS_HOST --wait sudo ip6tables -t filter -A INPUT -j CHAOS_HOST --wait sudo ip6tables -t filter -A CHAOS_HOST -p tcp -m multiport --dports 10022,2022,2021,22 -j ACCEPT --wait sudo ip6tables -t filter -A CHAOS_HOST -j DROP --wait sudo ip6tables -t filter -S CHAOS_HOST --wait ``` - recover network on all three nodes ``` sudo iptables -t filter -S --wait sudo iptables -t filter -S INPUT --wait sudo iptables -t filter -D INPUT -j CHAOS_HOST --wait sudo iptables -t filter -F CHAOS_HOST --wait sudo iptables -t filter -X CHAOS_HOST --wait sudo ip6tables -t filter -S --wait sudo ip6tables -t filter -S INPUT --wait sudo ip6tables -t filter -D INPUT -j CHAOS_HOST --wait sudo ip6tables -t filter -F CHAOS_HOST --wait sudo ip6tables -t filter -X CHAOS_HOST --wait ``` - `stop GROUP_REPLICATION` on **all three nodes** try to rebuild cluster, but some nodes block request forever mysql -uroot -p{{mysql_root_password}} -e 'STOP GROUP_REPLICATION;' - check status again, some nodes block request forever.