Bug #113137 MySQL Innodb cluster hang forever
Submitted: 20 Nov 2023 4:04 Modified: 27 Nov 2023 3:07
Reporter: zetang zeng (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:5.7.43 OS:Linux
Assigned to: CPU Architecture:Any

[20 Nov 2023 4:04] zetang zeng
Description:
Repeated in both centos & debian

OS: centos 7 or debian 10
kernel: 
- centos 7: Linux iv-ycjlolti2h8rx7ci6zct 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- debian 10: Linux iv-ybutt8i9bk8rx7uutcmp 4.19.0-24-amd64 #1 SMP Debian 4.19.282-1 (2023-04-29) x86_64 GNU/Linux

After block network using iptables for 30 min, `STOP GROUP_REPLICATION` and cluster status via mysqlsh will block forever.

The thread dump show two Gcs_xcom threads waiting on `pthread_join` (but the thread those 2 threads waiting for are not in thread dump).

How to repeat:

- Deploy a three node cluster on three CentOs 7 machine
- check cluster status and all is fine
- use following cmd to block network on **all three machine** for 30 min

```
sudo iptables -t filter -S --wait
sudo iptables -t filter -N CHAOS_HOST --wait
sudo iptables -t filter -A INPUT -j CHAOS_HOST --wait
sudo iptables -t filter -A CHAOS_HOST -p tcp -m multiport --dports 10022,2022,2021,22 -j ACCEPT --wait
sudo iptables -t filter -A CHAOS_HOST -j DROP --wait
sudo iptables -t filter -S CHAOS_HOST --wait
sudo ip6tables -t filter -S --wait
sudo ip6tables -t filter -N CHAOS_HOST --wait
sudo ip6tables -t filter -A INPUT -j CHAOS_HOST --wait
sudo ip6tables -t filter -A CHAOS_HOST -p tcp -m multiport --dports 10022,2022,2021,22 -j ACCEPT --wait
sudo ip6tables -t filter -A CHAOS_HOST -j DROP --wait
sudo ip6tables -t filter -S CHAOS_HOST --wait

```

- recover network on all three nodes

```
sudo iptables -t filter -S --wait
sudo iptables -t filter -S INPUT --wait
sudo iptables -t filter -D INPUT -j CHAOS_HOST --wait
sudo iptables -t filter -F CHAOS_HOST --wait
sudo iptables -t filter -X CHAOS_HOST --wait
sudo ip6tables -t filter -S --wait
sudo ip6tables -t filter -S INPUT --wait
sudo ip6tables -t filter -D INPUT -j CHAOS_HOST --wait
sudo ip6tables -t filter -F CHAOS_HOST --wait
sudo ip6tables -t filter -X CHAOS_HOST --wait
```

- `stop GROUP_REPLICATION` on **all three nodes** try to rebuild cluster, but some nodes block request forever

mysql -uroot -p{{mysql_root_password}} -e 'STOP GROUP_REPLICATION;'

- check status again, some nodes block request forever.
[21 Nov 2023 9:06] zetang zeng
Is the `stop group_replication` thread waiting for `xcom_taskmain_startup` thread?

It seems because there is inconsistency in xcom task loop (active_tasks is 31, but task link list is empty):

```
(gdb) p active_tasks
$2 = 31
(gdb) p &tasks
$3 = (linkage *) 0x7f2b81e580a0 <tasks>
(gdb) p tasks
$4 = {type = 0, suc = 0x7f2b81e580a0 <tasks>, pred = 0x7f2b81e580a0 <tasks>} 
```
[23 Nov 2023 21:36] MySQL Verification Team
Hi,

First, I dropped the severity to S3 as this is not a S2 issue.
Secondly, while I can reproduce this issue doing exactly what you stated I do not see how is this "regular thing that can happen in real life" (hence it cannot be S2) I am not sure this is a bug as this is not a normal situation. If I properly kill the network (remove eth cable for e.g.) this issue will not reproduce. Anyhow I will verify the report and let GR team decide if they think this is a bug or there is something they can improve upon.

Thank you for the report
[27 Nov 2023 3:04] zetang zeng
Yep, I agree with you that this reproducing case is too rare to happen in real life. But we do met some similar situation in 
 this case(https://sourceware.org/bugzilla/show_bug.cgi?id=30977), which we fail to reproduce. 

Hope the reason (inconsistency in GCS task?) leads to this problem also the root cause of this one(https://sourceware.org/bugzilla/show_bug.cgi?id=30977)
[27 Nov 2023 3:07] zetang zeng
Oh sorry, I give the wrong link in last msg 

https://bugs.mysql.com/bug.php?id=112277&thanks=5&notify=71