MySQL Bugs: #98948: Recovering Innodb cluster from complete outage hangs and fails to recover

Bug #98948	Recovering Innodb cluster from complete outage hangs and fails to recover
Submitted:	13 Mar 2020 20:48	Modified:	24 Jul 2020 15:51
Reporter:	Bradley Pearce	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.19	OS:	Windows
Assigned to:		CPU Architecture:	Any

Description:
Using the MySQL community installer to install a Innodb cluster of 3 nodes. When failing a single server or even 2 of the 3, whether it's master or slave, still continues to operate normally when the nodes are revived. 

However, when testing a complete outage of all 3 nodes and then attempted to revive the cluster from MySQL Shell using dba.rebootClusterFromCompleteOutage(), the command hangs in Shell with a Note saying stopping active GR however this seems to render the cluster uncoverable and the cluster must be reinstalled.

How to repeat:
- Take all servers in an innodb cluster down. 

- start all the servers and then run the dba.rebootClusterFromCompleteOutage() function

- MySQL Shell hangs and fails to recover the Cluster

Suggested fix:
dba.rebootClusterFromCompleteOutage() recovers cluster or returns an error if it fails to do so

Hi

Thanks for the report. I could not reproduce this on Linux. Let me try to reproduce this on Windows.

Bogdan

Thanks for your response. Unfortunately, i haven't verified if this happens in Linux as we haven't got a linux environment to use for this purpose. 

If it does occur in Windows could this be a bug in the Windows version of 8.0.19?

Hi,

Let's see if I can reproduce it on Windows and we'll go from there. Like you don't have Linux handy I don't really use Windows so need to make one system for this purpose :)

all best
Bogdan

I have the same problem running sandbox 8.0.19 on windows 10.

After running dba.dba.rebootClusterFromCompleteOutage the shell hangs on "NOTE: Cancelling active GR auto-initialization at HP-computer:3310"

Peter

I waited for the start of the cluster and finally after about 30 minutes it started.
Attached error file.

Hi Bradley, Peter,

I'm not reproducing this on Windows 10.

Peter, I don't see much of useful info in the log, except it did start.

The issues it might have connecting, I can't say, could be Windows related (antivirus, firewall..) but I can't reproduce this on my Win10 system.

all best
Bogdan

Hi Bradley, Peter,

A colleague of mine actually managed to reproduce this, we are working on a fix.

Thanks
Bogdan

Hi ,

We are using windows 2019 with 8.0.20 commercial edition. Encounter the same error.  DO you have any alternative solution or minimize the errors.

Regards,
Vijay

Hey guys, 

I can still see this on version": "8.0.20" on centos 8.0, this happened if you don't stop mysql properly and shutdown all node.

regards
omer

Hello, 

Has this been fixed in 8.0.20 or 8.0.21? 

Thanks for your help 

Kind regards,

Brad

Hi,

we have the same situation where when recovering from an outage using command dba.rebootClusterFromCompleteOutage(), it states,

NOTE: Cancelling active GR auto-initialization at xxx.xxx.xxx.xxx:3306

After 30 minutes it completes and cluster will be successfully rebooted. This issue happened as soon as we upgraded cluster from 8.0.18 to 8.0.21.

We have replicated this in various windows environments and always produce the same behavior. 

Is there a fix for this issue?

Thanks

Robert

we upgrade from 8.0.20 to 8.0.21, 3 nodes with wondows server 2016 nad we have the same behavior.

dba.rebootClusterFromCompleteOutage take 30 minutes

Hello, we have a same issue with Windows 2019 and MySQL server 8.0.21.

rebootClusterFromCompleteOutage() => 30 minutes

Thanks. 

Have a nice day

Hello, we have a same issue with Windows 2019 and MySQL server 8.0.21.
I installed 8.0.18 and tested - working like a charm. 

8.0.21 rebootClusterFromCompleteOutage() => 30 minutes
8.0.18 rebootClusterFromCompleteOutage() => 1 minute

Just adding a note to mention that I am running into this issue with MySQL on Windows as well. MySQL Server and Shell are both 8.0.23 on all 3 cluster instances.

Affects me on MySQL 8.0.23 for Linux

This also affects me using MySQL 8.0.24 on Linux.

We are also seeing this on 8.0.26; we are testing in docker-compose with this setup:

```
version: "3"

services:
  mysql1:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=1
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data1:/var/lib/mysql:rw,z
    hostname: mysql1

  mysql2:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=2
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data2:/var/lib/mysql:rw,z
    hostname: mysql2

  mysql3:
    image: docker.io/mysql/mysql-server:8.0.26
    env_file: mysql.env
    stop_grace_period: 1m
    command: --server_id=3
    volumes:
      - ./my.cnf:/etc/my.cnf:ro,z
      - ./data3:/var/lib/mysql:rw,z
    hostname: mysql3 
```

and this my.cnf:

```
[mysqld]

skip-host-cache
skip-name-resolve
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
secure-file-priv=/var/lib/mysql-files
user=mysql

pid-file=/var/run/mysqld/mysqld.pid

binlog_transaction_dependency_tracking=WRITESET
replica_preserve_commit_order=ON
replica_parallel_type=LOGICAL_CLOCK
enforce_gtid_consistency=ON
gtid_mode=ON

#plugin_load = group_replication.so
#group_replication_autorejoin_tries=0
#group_replication_components_stop_timeout=2
#group_replication_communication_debug_options=GCS_DEBUG_ALL
```

Hello,

I have the same problem running MySQL 8.0.20 on Debian 10.

After running dba.dba.rebootClusterFromCompleteOutage the shell hangs on "NOTE: Cancelling active GR auto-initialization at mysql_node1:3310"
I waited more than 45 minutes but nothing happened !

Do you have any news on this subject ?

Thanks. 

Have a nice day

I can still see this on version": "8.0.31" on centos 9.0
I setup 3 new instances, createCluster and addInstance is ok, then I restart all 3 instances => mysql hangs on: NOTE: Cancelling active GR auto-initialization at mysql-node1:3306

Regards

I suffered the same problem. 

And I found another clue. reboot all the linux servers, the mysql will use 100% of cpu ( one core of cpu) .

Tasks: 113 total,   2 running, 111 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.9 us, 23.2 sy,  0.0 ni, 48.7 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  4907100 total,  3414200 free,  1141520 used,   351380 buff/cache
KiB Swap:  2097148 total,  2093556 free,     3592 used.  3692584 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 9968 polkitd   20   0 3995296 721876  24552 S 101.7 14.7  23:03.92 mysqld
    1 root      20   0  125764   3620   1908 S   0.0  0.1   0:01.27 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd

[root@centos-61 ~]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
[root@centos-61 ~]# uname -a
Linux centos-61 3.10.0-1160.el7.x86_64 #1 SMP Mon Oct 19 16:18:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

MySQL version: 8.0.31

I use following commands to start docker containers to start mysql. 

mkdir -p /opt/mysql/data
chown -R 999:999 /opt/mysql
chmod 700 /opt/mysql

docker run -d --name mysql \
    --cpus 2 \
    --memory 1.5GB \
    --network host \
    --restart unless-stopped \
    -v /opt/mysql/data:/var/lib/mysql \
    -e MYSQL_ROOT_PASSWORD=somePassword\
    --security-opt seccomp=unconfined \
    mysql:8 \
    --innodb-dedicated-server=ON \
    --group-replication-consistency=AFTER

docker exec -it mysql mysql -p

CREATE USER 'cluster_root'@'%' IDENTIFIED BY 'somePassword';
GRANT ALL PRIVILEGES ON *.* TO 'cluster_root'@'%' WITH GRANT OPTION; 
show global variables like 'innodb_dedicated_server';
show global variables like 'group_replication_consistency';

mysqlsh

dba.configureInstance('cluster_root@debian-101:3306')
dba.configureInstance('cluster_root@debian-102:3306')
dba.configureInstance('cluster_root@debian-103:3306')

docker restart mysql

mysqlsh

shell.connect('cluster_root@debian-101:3306')
cluster = dba.createCluster('my_innodb_cluster');
cluster.addInstance('debian-102:3306')
cluster.addInstance('debian-103:3306')

I still encounter this problem (stack in "NOTE: Cancelling active GR auto-initialization at nodename:port"), running mysql  Ver 8.0.30 for Linux on x86_64 (MySQL Community Server - GPL) on Cent OS 7.

But I have found a solution!

This problem occurs to me when the MySQL service is running (after a restart in all the nodes) and I try to run the command dba.rebootClusterFromCompleteOutage() in the lest active node in order to power up the cluster.

But!

If I stop the services on the other 2 nodes and reboot the cluster only on the last primary node (keeping the metadata of the other 2 nodes) the cluster is rebooted successfully, and after that I start the services on the other 2 nodes and they rejoin the cluster immediately and Voilà!

I hope this approach will help you too.