Bug #108339 dba.rebootClusterFromCompleteOutage() hanging
Submitted: 30 Aug 2022 18:16 Modified: 25 Jan 2023 16:11
Reporter: Jay Janssen Email Updates:
Status: Closed Impact on me:
None 
Category:Shell General / Core Client Severity:S2 (Serious)
Version:8.0.30 OS:Any
Assigned to: CPU Architecture:Any

[30 Aug 2022 18:16] Jay Janssen
Description:
I have a 3 node cluster that all experienced an OOM kill.  All 3 nodes restarted successfully.

I connect with mysql-shell to the node that was primary and try to run dba.rebootClusterFromCompleteOutage(), but it hangs.  

2022-08-30T18:04:43.035625Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.30'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2022-08-30T18:04:43.038508Z 11 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.

<dba.rebootClusterFromCompleteOutage() run here>

2022-08-30T18:05:43.044820Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2022-08-30T18:05:43.044918Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

So, I kill that node, and try to recover from another, but I get the same result.  I don't know at this point if I skipped the original primary node from the start would help.  

How to repeat:
This is reproducible in my environment, at least with OOM kills.  I haven't experimented with other scenarios to know exactly how to reproduce it.
[30 Aug 2022 19:57] Alfredo Kojima
Could you start the shell with --dba-log-sql=2 --verbose=2, try reboot again and paste the last few lines before the hang?
[30 Aug 2022 19:58] Alfredo Kojima
Also, what was the last thing the shell printed before the hang?
[30 Aug 2022 20:08] Jay Janssen
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT(@gtidset_b, @gtidset_a)
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT('aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-148:1000139-1000147,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-8', 'aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-3735668,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-18') = ''
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('persisted_globals_load')
Rejoining instance '10.162.254.151:3306' to cluster 'jaytest-staging-002-use1'...

verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: SELECT plugin_status FROM information_schema.plugins WHERE plugin_name = 'group_replication'
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('super_read_only')
NOTE: Cancelling active GR auto-initialization at 10.162.254.151:3306
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: STOP GROUP_REPLICATION

This seems to be where it is hanging.

Here's the end of the log on the the 10.162.254.151 instance if that helps, though it doesn't have any entries at or after the last shell line:

[root@ip-10-162-254-151 ~]# tail -f /var/log/mysqld.log
2022-08-30T20:03:16.724119Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:16.724178Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:16.735000Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:16.735980Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:24.337927Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2022-08-30T20:03:24.338021Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2022-08-30T20:03:29.784694Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:29.784748Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:29.795698Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:29.796715Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.'
[30 Aug 2022 20:23] Alfredo Kojima
The hang while waiting for 

NOTE: Cancelling active GR auto-initialization at ...

is expected, because GR is doing its start_on_boot routine, trying to connect to other members, but none of them are ONLINE, so they all eventually timeout. The cancellation by executing STOP GROUP_REPLICATION is meant to shorten that waiting time, but it's still a bit long. That said, I usually only see hangs of a minute or so, if you're seeing longer freezes, there might be something in your environment or test case interfering.

Also, there's a manualStartOnBoot option for createCluster() that you can set to true to disable start_on_boot, but whether that helps depends on your environment. For example, our k8s operator uses that and handles rejoins externally.
[2 Sep 2022 10:58] Miguel Araujo
Hi,

Can you please check the process list of the instance that hung?

I was able to reproduce the problem although it took me several attempts so it seems random.

Anyway, I saw in the process list that "STOP group_replication" was blocked:

         Id: 52
       User: root
       Host: localhost:51110
         db: NULL
    Command: Query
       Time: 447
      State: starting
       Info: STOP GROUP_REPLICATION

Because it's waiting on:

         Id: 59
       User: mysql_innodb_cluster_639490406
       Host: localhost:59618
         db: NULL
    Command: Group Replication Data Stream subscription
       Time: 591
      State: starting
       Info: NULL

Any attempt to kill "Group Replication Data Stream subscription" just results in a new process being spawned.

It's a deadlock in Group Replication.
[2 Sep 2022 11:27] Jay Janssen
I no longer have the test cluster that created this issue, but it sounds like a good find on your side.  

If I end up reproducing the scenario, I will attempt to check as you did.
[2 Sep 2022 16:25] Jay Janssen
Possibly a similar issue.  I had an OOM on two out of 3 of my cluster members.  The remaining member I used cluster.forceQuorumUsingPartitionOf(), which worked.

However, then I tried to rejoinInstance on one of the OOM'ed members that had since restarted mysqld:

 MySQL  localhost:33060+ ssl  JS > cluster.rejoinInstance("10.162.255.158:3306")
Validating instance configuration at 10.162.255.158:3306...

This instance reports its own address as 10.162.255.158:3306

Instance configuration is suitable.
Rejoining instance '10.162.255.158:3306' to cluster 'jaytest-staging-002-use1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_3683005888'@'%' already existed at instance '10.162.254.197:3306'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100%

The processlist on the the 10.162.255.158 host is:

 MySQL  localhost:33060+ ssl  SQL > show processlist\G
*************************** 1. row ***************************
     Id: 6
   User: event_scheduler
   Host: localhost
     db: NULL
Command: Daemon
   Time: 1380
  State: Waiting on empty queue
   Info: NULL

... skipped some router connections....

*************************** 8. row ***************************
     Id: 1540
   User: config
   Host: 10.162.254.197:35194
     db: NULL
Command: Sleep
   Time: 295
  State:
   Info: NULL
*************************** 9. row ***************************
     Id: 1545
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 295
  State: waiting for handler commit
   Info: Group replication applier module
*************************** 10. row ***************************
     Id: 1564
   User: mysql_innodb_cluster_3683005888
   Host: 10.162.254.197:47640
     db: NULL
Command: Group Replication Data Stream subscription
   Time: 291
  State: Connection delegated to Group Replication
   Info: NULL
*************************** 11. row ***************************
     Id: 1589
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 281
  State: Replica has read all relay log; waiting for more updates
   Info: NULL
*************************** 12. row ***************************
     Id: 1590
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 281
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 13. row ***************************
     Id: 1591
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 281
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 14. row ***************************
     Id: 1592
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 281
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 15. row ***************************
     Id: 1593
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 281
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 16. row ***************************
     Id: 1594
   User: system user
   Host: connecting host
     db: NULL
Command: Connect
   Time: 281
  State: Waiting for source to send event
   Info: NULL
*************************** 17. row ***************************
     Id: 1595
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 274
  State: Replica has read all relay log; waiting for more updates
   Info: NULL
*************************** 18. row ***************************
     Id: 1596
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 285
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 19. row ***************************
     Id: 1597
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 284
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 20. row ***************************
     Id: 1598
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 2030
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 21. row ***************************
     Id: 1599
   User: system user
   Host:
     db: NULL
Command: Query
   Time: 306
  State: Waiting for an event from Coordinator
   Info: NULL
*************************** 22. row ***************************
     Id: 1785
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: init
   Info: PLUGIN: show processlist
22 rows in set (0.0003 sec)

and its log just shows this:

2022-09-02T16:23:10.583929Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:20.595157Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:30.606396Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:40.617479Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:50.628975Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:00.640109Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:10.651300Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:20.662620Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
[2 Sep 2022 16:33] Jay Janssen
And interestingly I have Too many connections on the primary node that I recovered with force Quorum:

MySQL Shell 8.0.30-g1

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
WARNING: Using a password on the command line interface can be insecure.
Creating a session to 'config@localhost'
MySQL Error 1040: Too many connections

Its log reads:

2022-09-02T16:17:59.857930Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2022-09-02T16:17:59.858621Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.162.255.158:3306'
2022-09-02T16:17:59.858678Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.254.197:3306 on view 16620457143435462:29.'
2022-09-02T16:18:05.569499Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:18:15.581613Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:18:21.829228Z 16700 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-09-02T16:18:34.278971Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.255.158:3306, 10.162.254.197:3306 on view 16620457143435462:30.'
2022-09-02T16:27:19.511251Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '34: Error initializing session for connection: Too many connections'
2022-09-02T16:28:04.224906Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '36: Error initializing session for connection: Too many connections'
2022-09-02T16:28:49.009037Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '37: Error initializing session for connection: Too many connections'

2022-09-02T16:17:59.858678Z was about when I issued the forceQuorum. 

I stopped all my mysqlrouter instances to see if it was them causing the issue, but I still can't get into this instance.
[7 Sep 2022 12:51] Jay Janssen
Another update, I am seeing the same issue again where the rebootCluster call hangs indefinitely.  I indeed see similar processlist entries on the primary node:

*************************** 10. row ***************************
     Id: 127
   User: config
   Host: 127.0.0.1:60490
     db: NULL
Command: Query
   Time: 291
  State: starting
   Info: STOP GROUP_REPLICATION
*************************** 11. row ***************************
     Id: 137
   User: mysql_innodb_cluster_2448256127
   Host: 10.162.255.29:50432
     db: NULL
Command: Group Replication Data Stream subscription
   Time: 287
  State: starting
   Info: NULL
*************************** 12. row ***************************
     Id: 150
   User: mysql_innodb_cluster_87169661
   Host: 10.162.255.199:48958
     db: NULL
Command: Group Replication Data Stream subscription
   Time: 274
  State: starting
   Info: NULL
[7 Sep 2022 12:55] Jay Janssen
Are you aware of any kind of workaround? I'd prefer not to use manualStartOnBoot and that doesn't help on an existing cluster.    

Not being able to recover clusters is pretty painful.
[14 Sep 2022 20:03] Jay Janssen
I can reproduce this on 8.0.30 by simply having 3 nodes in a cluster, kill -9 mysqld on all of them, and try to issue the rebootCluster when they all start.

I would swear I did not have this issue like this in 8.0.29.  If I get a chance to deploy an 8.0.29 cluster I will submit evidence.
[14 Sep 2022 20:03] Jay Janssen
updating to S2, this is my most critical bug currently with mysql cluster.
[20 Sep 2022 17:32] Jay Janssen
I found an older 8.0.29 cluster that I had been using prior to upgrading to 8.0.30.  I did a kill -9 on all instances, executed dba.rebootClusterFromCompleteOutage and it came right back up no problem.  

[ec2-user@ip-10-162-0-229 bin]$ mysqlsh -u ods.cluster.config
MySQL Shell 8.0.29

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a session to 'ods.cluster.config@localhost'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 1124 (X protocol)
Server version: 8.0.29 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  localhost:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
Restoring the cluster 'jay-test2-east' from complete outage...

The instance '10.162.0.219:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y

The instance '10.162.0.248:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y

* Waiting for seed instance to become ONLINE...
10.162.0.229:3306 was restored.
Rejoining '10.162.0.219:3306' to the cluster.
Rejoining instance '10.162.0.219:3306' to cluster 'jay-test2-east'...

The instance '10.162.0.219:3306' was successfully rejoined to the cluster.

Rejoining '10.162.0.248:3306' to the cluster.
Rejoining instance '10.162.0.248:3306' to cluster 'jay-test2-east'...

The instance '10.162.0.248:3306' was successfully rejoined to the cluster.

The cluster was successfully rebooted.

<Cluster:jay-test2-east>

This is the mysqld log from the node I issued the reboot from:

2022-09-20T17:27:58.454120Z 1125 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2022-09-20T17:27:58.455146Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2022-09-20T17:27:58.455167Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2022-09-20T17:27:58.456720Z 1142 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-09-20T17:27:59.616928Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2022-09-20T17:28:00.617581Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306 on view 16636948786163307:1.'
2022-09-20T17:28:25.910616Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'
2022-09-20T17:28:25.910772Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 10.162.0.229:3306 was elected. The new primary will execute all previous group transactions before allowing writes.'
2022-09-20T17:28:26.211494Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_disable_super_read_only_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "1" will be run.'
2022-09-20T17:28:26.211654Z 1150 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'
2022-09-20T17:28:26.211728Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_start_failover_channels_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "10" will be run.'
2022-09-20T17:28:26.211939Z 1200 [System] [MY-011510] [Repl] Plugin group_replication reported: 'This server is working as primary member.'
2022-09-20T17:28:28.727883Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:2.'
2022-09-20T17:28:30.737382Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.248:3306, 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:3.'
2022-09-20T17:28:31.836076Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.248:3306 was declared online within the replication group.'
2022-09-20T17:28:55.017376Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.219:3306 was declared online within the replication group.'

[ec2-user@ip-10-162-0-229 bin]$ cat /etc/my.cnf /etc/my.cnf.d/*
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

!includedir /etc/my.cnf.d

innodb-dedicated-server=on
skip-name-resolve
[mysqld]
report-host=10.162.0.229

Happy to provide more detail, but
[22 Sep 2022 15:46] Tiago Jorge
Posted by developer:
 
> I found an older 8.0.29 cluster that I had been using prior to upgrading to
> 8.0.30.  I did a kill -9 on all instances, executed
> dba.rebootClusterFromCompleteOutage and it came right back up no problem.

You are correct. From our findings, this only occurs if you use the MySQL
Communication Stack, that became the default stack in 8.0.30 version
of InnoDB Cluster (https://dev.mysql.com/doc/refman/8.0/en/group-replication-connection-security.html)
[22 Sep 2022 15:50] Miguel Araujo
Posted by developer:
 
The documentation about the Communication Stack handling in the AdminAPI can be found at https://dev.mysql.com/doc/mysql-shell/8.0/en/shell-admin-api-communication-stack.html

To use the non-default 'XCOM' stack, use the 'communicationStack' option at dba.createCluster() and/or <ClusterSet>.createReplicaCluster()
[25 Jan 2023 16:11] Jon Stephens
Documented fix as follows in the MySQL 8.0.32 changelog:

    In a 3 node cluster, all nodes were killed due to running out of
    memory. Subsequently, after all nodes were restarted
    successfully, attempting to bring the cluster back online caused
    the node that had been the primary to hang.

Closed.