Bug #98948 Recovering Innodb cluster from complete outage hangs and fails to recover
Submitted: 13 Mar 2020 20:48 Modified: 24 Jul 2020 15:51
Reporter: Bradley Pearce Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.19 OS:Windows
Assigned to: CPU Architecture:Any

[13 Mar 2020 20:48] Bradley Pearce
Description:
Using the MySQL community installer to install a Innodb cluster of 3 nodes. When failing a single server or even 2 of the 3, whether it's master or slave, still continues to operate normally when the nodes are revived. 

However, when testing a complete outage of all 3 nodes and then attempted to revive the cluster from MySQL Shell using dba.rebootClusterFromCompleteOutage(), the command hangs in Shell with a Note saying stopping active GR however this seems to render the cluster uncoverable and the cluster must be reinstalled.

How to repeat:
- Take all servers in an innodb cluster down. 

- start all the servers and then run the dba.rebootClusterFromCompleteOutage() function

- MySQL Shell hangs and fails to recover the Cluster

Suggested fix:
dba.rebootClusterFromCompleteOutage() recovers cluster or returns an error if it fails to do so
[23 Mar 2020 20:47] MySQL Verification Team
Hi

Thanks for the report. I could not reproduce this on Linux. Let me try to reproduce this on Windows.

Bogdan
[23 Mar 2020 21:07] Bradley Pearce
Thanks for your response. Unfortunately, i haven't verified if this happens in Linux as we haven't got a linux environment to use for this purpose. 

If it does occur in Windows could this be a bug in the Windows version of 8.0.19?
[24 Mar 2020 0:01] MySQL Verification Team
Hi,

Let's see if I can reproduce it on Windows and we'll go from there. Like you don't have Linux handy I don't really use Windows so need to make one system for this purpose :)

all best
Bogdan
[1 Apr 2020 10:12] Peter Johansson
I have the same problem running sandbox 8.0.19 on windows 10.

After running dba.dba.rebootClusterFromCompleteOutage the shell hangs on "NOTE: Cancelling active GR auto-initialization at HP-computer:3310"

Peter
[2 Apr 2020 8:56] Peter Johansson
I waited for the start of the cluster and finally after about 30 minutes it started.
Attached error file.
[2 Apr 2020 16:53] MySQL Verification Team
Hi Bradley, Peter,

I'm not reproducing this on Windows 10.

Peter, I don't see much of useful info in the log, except it did start.

The issues it might have connecting, I can't say, could be Windows related (antivirus, firewall..) but I can't reproduce this on my Win10 system.

all best
Bogdan
[3 Apr 2020 4:28] MySQL Verification Team
Hi Bradley, Peter,

A colleague of mine actually managed to reproduce this, we are working on a fix.

Thanks
Bogdan
[8 Jun 2020 6:30] vijayakumar kommula
Hi ,

We are using windows 2019 with 8.0.20 commercial edition. Encounter the same error.  DO you have any alternative solution or minimize the errors.

Regards,
Vijay
[17 Jun 2020 19:02] Omer Niah
Hey guys, 

I can still see this on version": "8.0.20" on centos 8.0, this happened if you don't stop mysql properly and shutdown all node.

regards
omer
[24 Jul 2020 15:51] Bradley Pearce
Hello, 

Has this been fixed in 8.0.20 or 8.0.21? 

Thanks for your help 

Kind regards,

Brad
[13 Aug 2020 16:03] Robert Azzopardi
Hi,

we have the same situation where when recovering from an outage using command dba.rebootClusterFromCompleteOutage(), it states,

NOTE: Cancelling active GR auto-initialization at xxx.xxx.xxx.xxx:3306

After 30 minutes it completes and cluster will be successfully rebooted. This issue happened as soon as we upgraded cluster from 8.0.18 to 8.0.21.

We have replicated this in various windows environments and always produce the same behavior. 

Is there a fix for this issue?

Thanks

Robert
[11 Sep 2020 15:28] lionel mazeyrat
we upgrade from 8.0.20 to 8.0.21, 3 nodes with wondows server 2016 nad we have the same behavior.

dba.rebootClusterFromCompleteOutage take 30 minutes
[14 Sep 2020 20:08] Romain Brenet
Hello, we have a same issue with Windows 2019 and MySQL server 8.0.21.

rebootClusterFromCompleteOutage() => 30 minutes

Thanks. 

Have a nice day
[7 Oct 2020 9:19] Frieder Mentele
Hello, we have a same issue with Windows 2019 and MySQL server 8.0.21.
I installed 8.0.18 and tested - working like a charm. 

8.0.21 rebootClusterFromCompleteOutage() => 30 minutes
8.0.18 rebootClusterFromCompleteOutage() => 1 minute
[11 Feb 2021 3:54] Keith Lammers
Just adding a note to mention that I am running into this issue with MySQL on Windows as well. MySQL Server and Shell are both 8.0.23 on all 3 cluster instances.
[7 May 2021 13:08] Eduardo Ortega
Affects me on MySQL 8.0.23 for Linux
[7 May 2021 18:35] Andrew Garner
This also affects me using MySQL 8.0.24 on Linux.
[9 Dec 2021 10:45] Florian Apolloner
We are also seeing this on 8.0.26; we are testing in docker-compose with this setup:

```
version: "3"

services:
  mysql1:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=1
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data1:/var/lib/mysql:rw,z
    hostname: mysql1

  mysql2:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=2
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data2:/var/lib/mysql:rw,z
    hostname: mysql2

  mysql3:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=3
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data3:/var/lib/mysql:rw,z
    hostname: mysql3 
```

and this my.cnf:

```
[mysqld]

skip-host-cache
skip-name-resolve
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
secure-file-priv=/var/lib/mysql-files
user=mysql

pid-file=/var/run/mysqld/mysqld.pid

binlog_transaction_dependency_tracking=WRITESET
replica_preserve_commit_order=ON
replica_parallel_type=LOGICAL_CLOCK
enforce_gtid_consistency=ON
gtid_mode=ON

#plugin_load = group_replication.so
#group_replication_autorejoin_tries=0
#group_replication_components_stop_timeout=2
#group_replication_communication_debug_options=GCS_DEBUG_ALL
```
[5 Sep 2022 12:41] ASP Serveur
Hello,

I have the same problem running MySQL 8.0.20 on Debian 10.

After running dba.dba.rebootClusterFromCompleteOutage the shell hangs on "NOTE: Cancelling active GR auto-initialization at mysql_node1:3310"
I waited more than 45 minutes but nothing happened !

Do you have any news on this subject ?

Thanks. 

Have a nice day
[9 Nov 2022 5:14] Khanh Van Chu
I can still see this on version": "8.0.31" on centos 9.0
I setup 3 new instances, createCluster and addInstance is ok, then I restart all 3 instances => mysql hangs on: NOTE: Cancelling active GR auto-initialization at mysql-node1:3306

Regards
[14 Jan 2023 1:11] Aray Chou
I suffered the same problem. 

And I found another clue. reboot all the linux servers, the mysql will use 100% of cpu ( one core of cpu) .

Tasks: 113 total,   2 running, 111 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.9 us, 23.2 sy,  0.0 ni, 48.7 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  4907100 total,  3414200 free,  1141520 used,   351380 buff/cache
KiB Swap:  2097148 total,  2093556 free,     3592 used.  3692584 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9968 polkitd   20   0 3995296 721876  24552 S 101.7 14.7  23:03.92 mysqld
    1 root      20   0  125764   3620   1908 S   0.0  0.1   0:01.27 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd

[root@centos-61 ~]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
[root@centos-61 ~]# uname -a
Linux centos-61 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

MySQL version: 8.0.31

I use following commands to start docker containers to start mysql. 

mkdir -p /opt/mysql/data
chown -R 999:999 /opt/mysql
chmod 700 /opt/mysql

docker run -d --name mysql \
    --cpus 2 \
    --memory 1.5GB \
    --network host \
    --restart unless-stopped \
    -v /opt/mysql/data:/var/lib/mysql \
    -e MYSQL_ROOT_PASSWORD=somePassword\
    --security-opt seccomp=unconfined \
    mysql:8 \
    --innodb-dedicated-server=ON \
    --group-replication-consistency=AFTER

docker exec -it mysql mysql -p

CREATE USER 'cluster_root'@'%' IDENTIFIED BY 'somePassword';
GRANT ALL PRIVILEGES ON *.* TO 'cluster_root'@'%' WITH GRANT OPTION; 
show global variables like 'innodb_dedicated_server';
show global variables like 'group_replication_consistency';

mysqlsh

dba.configureInstance('cluster_root@debian-101:3306')
dba.configureInstance('cluster_root@debian-102:3306')
dba.configureInstance('cluster_root@debian-103:3306')

docker restart mysql

mysqlsh

shell.connect('cluster_root@debian-101:3306')
cluster = dba.createCluster('my_innodb_cluster');
cluster.addInstance('debian-102:3306')
cluster.addInstance('debian-103:3306')
[7 Sep 2023 14:21] Christos Vlachos
I still encounter this problem (stack in "NOTE: Cancelling active GR auto-initialization at nodename:port"), running mysql  Ver 8.0.30 for Linux on x86_64 (MySQL Community Server - GPL) on Cent OS 7.

But I have found a solution!

This problem occurs to me when the MySQL service is running (after a restart in all the nodes) and I try to run the command dba.rebootClusterFromCompleteOutage() in the lest active node in order to power up the cluster.

But!

If I stop the services on the other 2 nodes and reboot the cluster only on the last primary node (keeping the metadata of the other 2 nodes) the cluster is rebooted successfully, and after that I start the services on the other 2 nodes and they rejoin the cluster immediately and VoilĂ !

I hope this approach will help you too.