MySQL Bugs: #108339: dba.rebootClusterFromCompleteOutage() hanging

Bug #108339	dba.rebootClusterFromCompleteOutage() hanging
Submitted:	30 Aug 2022 18:16	Modified:	25 Jan 2023 16:11
Reporter:	Jay Janssen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Shell General / Core Client	Severity:	S2 (Serious)
Version:	8.0.30	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I have a 3 node cluster that all experienced an OOM kill.  All 3 nodes restarted successfully.

I connect with mysql-shell to the node that was primary and try to run dba.rebootClusterFromCompleteOutage(), but it hangs.  

2022-08-30T18:04:43.035625Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.30'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2022-08-30T18:04:43.038508Z 11 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.

<dba.rebootClusterFromCompleteOutage() run here>

2022-08-30T18:05:43.044820Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2022-08-30T18:05:43.044918Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'

So, I kill that node, and try to recover from another, but I get the same result.  I don't know at this point if I skipped the original primary node from the start would help.  

How to repeat:
This is reproducible in my environment, at least with OOM kills.  I haven't experimented with other scenarios to know exactly how to reproduce it.

Could you start the shell with --dba-log-sql=2 --verbose=2, try reboot again and paste the last few lines before the hang?

Also, what was the last thing the shell printed before the hang?

verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT(@gtidset_b, @gtidset_a)
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT('aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-148:1000139-1000147,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-8', 'aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-3735668,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-18') = ''
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('persisted_globals_load')
Rejoining instance '10.162.254.151:3306' to cluster 'jaytest-staging-002-use1'...

verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: SELECT plugin_status FROM information_schema.plugins WHERE plugin_name = 'group_replication'
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('super_read_only')
NOTE: Cancelling active GR auto-initialization at 10.162.254.151:3306
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: STOP GROUP_REPLICATION

This seems to be where it is hanging.

Here's the end of the log on the the 10.162.254.151 instance if that helps, though it doesn't have any entries at or after the last shell line:

[root@ip-10-162-254-151 ~]# tail -f /var/log/mysqld.log
2022-08-30T20:03:16.724119Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:16.724178Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:16.735000Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:16.735980Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:24.337927Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2022-08-30T20:03:24.338021Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
2022-08-30T20:03:29.784694Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:29.784748Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.'
2022-08-30T20:03:29.795698Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-08-30T20:03:29.796715Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.'

The hang while waiting for 

NOTE: Cancelling active GR auto-initialization at ...

is expected, because GR is doing its start_on_boot routine, trying to connect to other members, but none of them are ONLINE, so they all eventually timeout. The cancellation by executing STOP GROUP_REPLICATION is meant to shorten that waiting time, but it's still a bit long. That said, I usually only see hangs of a minute or so, if you're seeing longer freezes, there might be something in your environment or test case interfering.

Also, there's a manualStartOnBoot option for createCluster() that you can set to true to disable start_on_boot, but whether that helps depends on your environment. For example, our k8s operator uses that and handles rejoins externally.

Hi,

Can you please check the process list of the instance that hung?

I was able to reproduce the problem although it took me several attempts so it seems random.

Anyway, I saw in the process list that "STOP group_replication" was blocked:

         Id: 52
       User: root
       Host: localhost:51110
         db: NULL
    Command: Query
       Time: 447
      State: starting
       Info: STOP GROUP_REPLICATION

Because it's waiting on:

         Id: 59
       User: mysql_innodb_cluster_639490406
       Host: localhost:59618
         db: NULL
    Command: Group Replication Data Stream subscription
       Time: 591
      State: starting
       Info: NULL

Any attempt to kill "Group Replication Data Stream subscription" just results in a new process being spawned.

It's a deadlock in Group Replication.

I no longer have the test cluster that created this issue, but it sounds like a good find on your side.  

If I end up reproducing the scenario, I will attempt to check as you did.

Possibly a similar issue. I had an OOM on two out of 3 of my cluster members. The remaining member I used cluster.forceQuorumUsingPartitionOf(), which worked.

However, then I tried to rejoinInstance on one of the OOM'ed members that had since restarted mysqld:

MySQL localhost:33060+ ssl JS > cluster.rejoinInstance("10.162.255.158:3306")
Validating instance configuration at 10.162.255.158:3306...

This instance reports its own address as 10.162.255.158:3306

Instance configuration is suitable.
Rejoining instance '10.162.255.158:3306' to cluster 'jaytest-staging-002-use1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_3683005888'@'%' already existed at instance '10.162.254.197:3306'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated ############################################################ 100%

The processlist on the the 10.162.255.158 host is:

MySQL localhost:33060+ ssl SQL > show processlist\G
*************************** 1. row ***************************
Id: 6
User: event_scheduler
Host: localhost
db: NULL
Command: Daemon
Time: 1380
State: Waiting on empty queue
Info: NULL

... skipped some router connections....

*************************** 8. row ***************************
Id: 1540
User: config
Host: 10.162.254.197:35194
db: NULL
Command: Sleep
Time: 295
State:
Info: NULL
*************************** 9. row ***************************
Id: 1545
User: system user
Host:
db: NULL
Command: Connect
Time: 295
State: waiting for handler commit
Info: Group replication applier module
*************************** 10. row ***************************
Id: 1564
User: mysql_innodb_cluster_3683005888
Host: 10.162.254.197:47640
db: NULL
Command: Group Replication Data Stream subscription
Time: 291
State: Connection delegated to Group Replication
Info: NULL
*************************** 11. row ***************************
Id: 1589
User: system user
Host:
db: NULL
Command: Query
Time: 281
State: Replica has read all relay log; waiting for more updates
Info: NULL
*************************** 12. row ***************************
Id: 1590
User: system user
Host:
db: NULL
Command: Connect
Time: 281
State: Waiting for an event from Coordinator
Info: NULL
*************************** 13. row ***************************
Id: 1591
User: system user
Host:
db: NULL
Command: Connect
Time: 281
State: Waiting for an event from Coordinator
Info: NULL
*************************** 14. row ***************************
Id: 1592
User: system user
Host:
db: NULL
Command: Connect
Time: 281
State: Waiting for an event from Coordinator
Info: NULL
*************************** 15. row ***************************
Id: 1593
User: system user
Host:
db: NULL
Command: Connect
Time: 281
State: Waiting for an event from Coordinator
Info: NULL
*************************** 16. row ***************************
Id: 1594
User: system user
Host: connecting host
db: NULL
Command: Connect
Time: 281
State: Waiting for source to send event
Info: NULL
*************************** 17. row ***************************
Id: 1595
User: system user
Host:
db: NULL
Command: Query
Time: 274
State: Replica has read all relay log; waiting for more updates
Info: NULL
*************************** 18. row ***************************
Id: 1596
User: system user
Host:
db: NULL
Command: Query
Time: 285
State: Waiting for an event from Coordinator
Info: NULL
*************************** 19. row ***************************
Id: 1597
User: system user
Host:
db: NULL
Command: Query
Time: 284
State: Waiting for an event from Coordinator
Info: NULL
*************************** 20. row ***************************
Id: 1598
User: system user
Host:
db: NULL
Command: Query
Time: 2030
State: Waiting for an event from Coordinator
Info: NULL
*************************** 21. row ***************************
Id: 1599
User: system user
Host:
db: NULL
Command: Query
Time: 306
State: Waiting for an event from Coordinator
Info: NULL
*************************** 22. row ***************************
Id: 1785
User: root
Host: localhost
db: NULL
Command: Query
Time: 0
State: init
Info: PLUGIN: show processlist
22 rows in set (0.0003 sec)

and its log just shows this:

2022-09-02T16:23:10.583929Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:20.595157Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:30.606396Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:40.617479Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:23:50.628975Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:00.640109Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:10.651300Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:24:20.662620Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'

And interestingly I have Too many connections on the primary node that I recovered with force Quorum:

MySQL Shell 8.0.30-g1

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
WARNING: Using a password on the command line interface can be insecure.
Creating a session to 'config@localhost'
MySQL Error 1040: Too many connections

Its log reads:

2022-09-02T16:17:59.857930Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2022-09-02T16:17:59.858621Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.162.255.158:3306'
2022-09-02T16:17:59.858678Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.254.197:3306 on view 16620457143435462:29.'
2022-09-02T16:18:05.569499Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:18:15.581613Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
2022-09-02T16:18:21.829228Z 16700 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-09-02T16:18:34.278971Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.255.158:3306, 10.162.254.197:3306 on view 16620457143435462:30.'
2022-09-02T16:27:19.511251Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '34: Error initializing session for connection: Too many connections'
2022-09-02T16:28:04.224906Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '36: Error initializing session for connection: Too many connections'
2022-09-02T16:28:49.009037Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '37: Error initializing session for connection: Too many connections'

2022-09-02T16:17:59.858678Z was about when I issued the forceQuorum. 

I stopped all my mysqlrouter instances to see if it was them causing the issue, but I still can't get into this instance.

Another update, I am seeing the same issue again where the rebootCluster call hangs indefinitely.  I indeed see similar processlist entries on the primary node:

*************************** 10. row ***************************
     Id: 127
   User: config
   Host: 127.0.0.1:60490
     db: NULL
Command: Query
   Time: 291
  State: starting
   Info: STOP GROUP_REPLICATION
*************************** 11. row ***************************
     Id: 137
   User: mysql_innodb_cluster_2448256127
   Host: 10.162.255.29:50432
     db: NULL
Command: Group Replication Data Stream subscription
   Time: 287
  State: starting
   Info: NULL
*************************** 12. row ***************************
     Id: 150
   User: mysql_innodb_cluster_87169661
   Host: 10.162.255.199:48958
     db: NULL
Command: Group Replication Data Stream subscription
   Time: 274
  State: starting
   Info: NULL

Are you aware of any kind of workaround? I'd prefer not to use manualStartOnBoot and that doesn't help on an existing cluster.    

Not being able to recover clusters is pretty painful.

I can reproduce this on 8.0.30 by simply having 3 nodes in a cluster, kill -9 mysqld on all of them, and try to issue the rebootCluster when they all start.

I would swear I did not have this issue like this in 8.0.29.  If I get a chance to deploy an 8.0.29 cluster I will submit evidence.

updating to S2, this is my most critical bug currently with mysql cluster.

I found an older 8.0.29 cluster that I had been using prior to upgrading to 8.0.30.  I did a kill -9 on all instances, executed dba.rebootClusterFromCompleteOutage and it came right back up no problem.  

[ec2-user@ip-10-162-0-229 bin]$ mysqlsh -u ods.cluster.config
MySQL Shell 8.0.29

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a session to 'ods.cluster.config@localhost'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 1124 (X protocol)
Server version: 8.0.29 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  localhost:33060+ ssl  JS > dba.rebootClusterFromCompleteOutage()
Restoring the cluster 'jay-test2-east' from complete outage...

The instance '10.162.0.219:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y

The instance '10.162.0.248:3306' was part of the cluster configuration.
Would you like to rejoin it to the cluster? [y/N]: y

* Waiting for seed instance to become ONLINE...
10.162.0.229:3306 was restored.
Rejoining '10.162.0.219:3306' to the cluster.
Rejoining instance '10.162.0.219:3306' to cluster 'jay-test2-east'...

The instance '10.162.0.219:3306' was successfully rejoined to the cluster.

Rejoining '10.162.0.248:3306' to the cluster.
Rejoining instance '10.162.0.248:3306' to cluster 'jay-test2-east'...

The instance '10.162.0.248:3306' was successfully rejoined to the cluster.

The cluster was successfully rebooted.

<Cluster:jay-test2-east>

This is the mysqld log from the node I issued the reboot from:

2022-09-20T17:27:58.454120Z 1125 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2022-09-20T17:27:58.455146Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2022-09-20T17:27:58.455167Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2022-09-20T17:27:58.456720Z 1142 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2022-09-20T17:27:59.616928Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2022-09-20T17:28:00.617581Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306 on view 16636948786163307:1.'
2022-09-20T17:28:25.910616Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'
2022-09-20T17:28:25.910772Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 10.162.0.229:3306 was elected. The new primary will execute all previous group transactions before allowing writes.'
2022-09-20T17:28:26.211494Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_disable_super_read_only_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "1" will be run.'
2022-09-20T17:28:26.211654Z 1150 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'
2022-09-20T17:28:26.211728Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_start_failover_channels_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "10" will be run.'
2022-09-20T17:28:26.211939Z 1200 [System] [MY-011510] [Repl] Plugin group_replication reported: 'This server is working as primary member.'
2022-09-20T17:28:28.727883Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:2.'
2022-09-20T17:28:30.737382Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.248:3306, 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:3.'
2022-09-20T17:28:31.836076Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.248:3306 was declared online within the replication group.'
2022-09-20T17:28:55.017376Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.219:3306 was declared online within the replication group.'

[ec2-user@ip-10-162-0-229 bin]$ cat /etc/my.cnf /etc/my.cnf.d/*
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock

log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

!includedir /etc/my.cnf.d

innodb-dedicated-server=on
skip-name-resolve
[mysqld]
report-host=10.162.0.229

Happy to provide more detail, but

Posted by developer:
 
> I found an older 8.0.29 cluster that I had been using prior to upgrading to
> 8.0.30.  I did a kill -9 on all instances, executed
> dba.rebootClusterFromCompleteOutage and it came right back up no problem.

You are correct. From our findings, this only occurs if you use the MySQL
Communication Stack, that became the default stack in 8.0.30 version
of InnoDB Cluster (https://dev.mysql.com/doc/refman/8.0/en/group-replication-connection-security.html)

Posted by developer:
 
The documentation about the Communication Stack handling in the AdminAPI can be found at https://dev.mysql.com/doc/mysql-shell/8.0/en/shell-admin-api-communication-stack.html

To use the non-default 'XCOM' stack, use the 'communicationStack' option at dba.createCluster() and/or <ClusterSet>.createReplicaCluster()

Documented fix as follows in the MySQL 8.0.32 changelog:

    In a 3 node cluster, all nodes were killed due to running out of
    memory. Subsequently, after all nodes were restarted
    successfully, attempting to bring the cluster back online caused
    the node that had been the primary to hang.

Closed.