Bug #108339 | dba.rebootClusterFromCompleteOutage() hanging | ||
---|---|---|---|
Submitted: | 30 Aug 2022 18:16 | Modified: | 25 Jan 2023 16:11 |
Reporter: | Jay Janssen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | Shell General / Core Client | Severity: | S2 (Serious) |
Version: | 8.0.30 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[30 Aug 2022 18:16]
Jay Janssen
[30 Aug 2022 19:57]
Alfredo Kojima
Could you start the shell with --dba-log-sql=2 --verbose=2, try reboot again and paste the last few lines before the hang?
[30 Aug 2022 19:58]
Alfredo Kojima
Also, what was the last thing the shell printed before the hang?
[30 Aug 2022 20:08]
Jay Janssen
verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT(@gtidset_b, @gtidset_a) verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=366: SQL: SELECT GTID_SUBTRACT('aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-148:1000139-1000147,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-8', 'aa924afc-2863-11ed-bbe9-0af98151b5c9:1-4,\ncdc0e103-2863-11ed-af6a-0af98151b5c9:1-3735668,\ncdc0e58f-2863-11ed-af6a-0af98151b5c9:1-18') = '' verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('persisted_globals_load') Rejoining instance '10.162.254.151:3306' to cluster 'jaytest-staging-002-use1'... verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: SELECT plugin_status FROM information_schema.plugins WHERE plugin_name = 'group_replication' verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: show GLOBAL variables where `variable_name` in ('super_read_only') NOTE: Cancelling active GR auto-initialization at 10.162.254.151:3306 verbose: 2022-08-30T20:06:17Z: Dba.rebootClusterFromCompleteOutage: tid=166: SQL: STOP GROUP_REPLICATION This seems to be where it is hanging. Here's the end of the log on the the 10.162.254.151 instance if that helps, though it doesn't have any entries at or after the last shell line: [root@ip-10-162-254-151 ~]# tail -f /var/log/mysqld.log 2022-08-30T20:03:16.724119Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-08-30T20:03:16.724178Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.' 2022-08-30T20:03:16.735000Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-08-30T20:03:16.735980Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.' 2022-08-30T20:03:24.337927Z 2 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group' 2022-08-30T20:03:24.338021Z 2 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.' 2022-08-30T20:03:29.784694Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-08-30T20:03:29.784748Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.255.203:3306 when joining a group. My local port is: 3306.' 2022-08-30T20:03:29.795698Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-08-30T20:03:29.796715Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.162.254.40:3306 when joining a group. My local port is: 3306.'
[30 Aug 2022 20:23]
Alfredo Kojima
The hang while waiting for NOTE: Cancelling active GR auto-initialization at ... is expected, because GR is doing its start_on_boot routine, trying to connect to other members, but none of them are ONLINE, so they all eventually timeout. The cancellation by executing STOP GROUP_REPLICATION is meant to shorten that waiting time, but it's still a bit long. That said, I usually only see hangs of a minute or so, if you're seeing longer freezes, there might be something in your environment or test case interfering. Also, there's a manualStartOnBoot option for createCluster() that you can set to true to disable start_on_boot, but whether that helps depends on your environment. For example, our k8s operator uses that and handles rejoins externally.
[2 Sep 2022 10:58]
Miguel Araujo
Hi, Can you please check the process list of the instance that hung? I was able to reproduce the problem although it took me several attempts so it seems random. Anyway, I saw in the process list that "STOP group_replication" was blocked: Id: 52 User: root Host: localhost:51110 db: NULL Command: Query Time: 447 State: starting Info: STOP GROUP_REPLICATION Because it's waiting on: Id: 59 User: mysql_innodb_cluster_639490406 Host: localhost:59618 db: NULL Command: Group Replication Data Stream subscription Time: 591 State: starting Info: NULL Any attempt to kill "Group Replication Data Stream subscription" just results in a new process being spawned. It's a deadlock in Group Replication.
[2 Sep 2022 11:27]
Jay Janssen
I no longer have the test cluster that created this issue, but it sounds like a good find on your side. If I end up reproducing the scenario, I will attempt to check as you did.
[2 Sep 2022 16:25]
Jay Janssen
Possibly a similar issue. I had an OOM on two out of 3 of my cluster members. The remaining member I used cluster.forceQuorumUsingPartitionOf(), which worked. However, then I tried to rejoinInstance on one of the OOM'ed members that had since restarted mysqld: MySQL localhost:33060+ ssl JS > cluster.rejoinInstance("10.162.255.158:3306") Validating instance configuration at 10.162.255.158:3306... This instance reports its own address as 10.162.255.158:3306 Instance configuration is suitable. Rejoining instance '10.162.255.158:3306' to cluster 'jaytest-staging-002-use1'... Re-creating recovery account... NOTE: User 'mysql_innodb_cluster_3683005888'@'%' already existed at instance '10.162.254.197:3306'. It will be deleted and created again with a new password. * Waiting for the Cluster to synchronize with the PRIMARY Cluster... ** Transactions replicated ############################################################ 100% The processlist on the the 10.162.255.158 host is: MySQL localhost:33060+ ssl SQL > show processlist\G *************************** 1. row *************************** Id: 6 User: event_scheduler Host: localhost db: NULL Command: Daemon Time: 1380 State: Waiting on empty queue Info: NULL ... skipped some router connections.... *************************** 8. row *************************** Id: 1540 User: config Host: 10.162.254.197:35194 db: NULL Command: Sleep Time: 295 State: Info: NULL *************************** 9. row *************************** Id: 1545 User: system user Host: db: NULL Command: Connect Time: 295 State: waiting for handler commit Info: Group replication applier module *************************** 10. row *************************** Id: 1564 User: mysql_innodb_cluster_3683005888 Host: 10.162.254.197:47640 db: NULL Command: Group Replication Data Stream subscription Time: 291 State: Connection delegated to Group Replication Info: NULL *************************** 11. row *************************** Id: 1589 User: system user Host: db: NULL Command: Query Time: 281 State: Replica has read all relay log; waiting for more updates Info: NULL *************************** 12. row *************************** Id: 1590 User: system user Host: db: NULL Command: Connect Time: 281 State: Waiting for an event from Coordinator Info: NULL *************************** 13. row *************************** Id: 1591 User: system user Host: db: NULL Command: Connect Time: 281 State: Waiting for an event from Coordinator Info: NULL *************************** 14. row *************************** Id: 1592 User: system user Host: db: NULL Command: Connect Time: 281 State: Waiting for an event from Coordinator Info: NULL *************************** 15. row *************************** Id: 1593 User: system user Host: db: NULL Command: Connect Time: 281 State: Waiting for an event from Coordinator Info: NULL *************************** 16. row *************************** Id: 1594 User: system user Host: connecting host db: NULL Command: Connect Time: 281 State: Waiting for source to send event Info: NULL *************************** 17. row *************************** Id: 1595 User: system user Host: db: NULL Command: Query Time: 274 State: Replica has read all relay log; waiting for more updates Info: NULL *************************** 18. row *************************** Id: 1596 User: system user Host: db: NULL Command: Query Time: 285 State: Waiting for an event from Coordinator Info: NULL *************************** 19. row *************************** Id: 1597 User: system user Host: db: NULL Command: Query Time: 284 State: Waiting for an event from Coordinator Info: NULL *************************** 20. row *************************** Id: 1598 User: system user Host: db: NULL Command: Query Time: 2030 State: Waiting for an event from Coordinator Info: NULL *************************** 21. row *************************** Id: 1599 User: system user Host: db: NULL Command: Query Time: 306 State: Waiting for an event from Coordinator Info: NULL *************************** 22. row *************************** Id: 1785 User: root Host: localhost db: NULL Command: Query Time: 0 State: init Info: PLUGIN: show processlist 22 rows in set (0.0003 sec) and its log just shows this: 2022-09-02T16:23:10.583929Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:23:20.595157Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:23:30.606396Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:23:40.617479Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:23:50.628975Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:24:00.640109Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:24:10.651300Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:24:20.662620Z 0 [ERROR] [MY-013780] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error establishing connection. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.'
[2 Sep 2022 16:33]
Jay Janssen
And interestingly I have Too many connections on the primary node that I recovered with force Quorum: MySQL Shell 8.0.30-g1 Copyright (c) 2016, 2022, Oracle and/or its affiliates. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type '\help' or '\?' for help; '\quit' to exit. WARNING: Using a password on the command line interface can be insecure. Creating a session to 'config@localhost' MySQL Error 1040: Too many connections Its log reads: 2022-09-02T16:17:59.857930Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.' 2022-09-02T16:17:59.858621Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.162.255.158:3306' 2022-09-02T16:17:59.858678Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.254.197:3306 on view 16620457143435462:29.' 2022-09-02T16:18:05.569499Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:18:15.581613Z 0 [ERROR] [MY-013781] [Repl] Plugin group_replication reported: 'Failed to establish MySQL client connection in Group Replication. Error sending connection delegation command. Please refer to the manual to make sure that you configured Group Replication properly to work with MySQL Protocol connections.' 2022-09-02T16:18:21.829228Z 16700 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. 2022-09-02T16:18:34.278971Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.255.158:3306, 10.162.254.197:3306 on view 16620457143435462:30.' 2022-09-02T16:27:19.511251Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '34: Error initializing session for connection: Too many connections' 2022-09-02T16:28:04.224906Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '36: Error initializing session for connection: Too many connections' 2022-09-02T16:28:49.009037Z 0 [Warning] [MY-011306] [Server] Plugin mysqlx reported: '37: Error initializing session for connection: Too many connections' 2022-09-02T16:17:59.858678Z was about when I issued the forceQuorum. I stopped all my mysqlrouter instances to see if it was them causing the issue, but I still can't get into this instance.
[7 Sep 2022 12:51]
Jay Janssen
Another update, I am seeing the same issue again where the rebootCluster call hangs indefinitely. I indeed see similar processlist entries on the primary node: *************************** 10. row *************************** Id: 127 User: config Host: 127.0.0.1:60490 db: NULL Command: Query Time: 291 State: starting Info: STOP GROUP_REPLICATION *************************** 11. row *************************** Id: 137 User: mysql_innodb_cluster_2448256127 Host: 10.162.255.29:50432 db: NULL Command: Group Replication Data Stream subscription Time: 287 State: starting Info: NULL *************************** 12. row *************************** Id: 150 User: mysql_innodb_cluster_87169661 Host: 10.162.255.199:48958 db: NULL Command: Group Replication Data Stream subscription Time: 274 State: starting Info: NULL
[7 Sep 2022 12:55]
Jay Janssen
Are you aware of any kind of workaround? I'd prefer not to use manualStartOnBoot and that doesn't help on an existing cluster. Not being able to recover clusters is pretty painful.
[14 Sep 2022 20:03]
Jay Janssen
I can reproduce this on 8.0.30 by simply having 3 nodes in a cluster, kill -9 mysqld on all of them, and try to issue the rebootCluster when they all start. I would swear I did not have this issue like this in 8.0.29. If I get a chance to deploy an 8.0.29 cluster I will submit evidence.
[14 Sep 2022 20:03]
Jay Janssen
updating to S2, this is my most critical bug currently with mysql cluster.
[20 Sep 2022 17:32]
Jay Janssen
I found an older 8.0.29 cluster that I had been using prior to upgrading to 8.0.30. I did a kill -9 on all instances, executed dba.rebootClusterFromCompleteOutage and it came right back up no problem. [ec2-user@ip-10-162-0-229 bin]$ mysqlsh -u ods.cluster.config MySQL Shell 8.0.29 Copyright (c) 2016, 2022, Oracle and/or its affiliates. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type '\help' or '\?' for help; '\quit' to exit. Creating a session to 'ods.cluster.config@localhost' Fetching schema names for autocompletion... Press ^C to stop. Your MySQL connection id is 1124 (X protocol) Server version: 8.0.29 MySQL Community Server - GPL No default schema selected; type \use <schema> to set one. MySQL localhost:33060+ ssl JS > dba.rebootClusterFromCompleteOutage() Restoring the cluster 'jay-test2-east' from complete outage... The instance '10.162.0.219:3306' was part of the cluster configuration. Would you like to rejoin it to the cluster? [y/N]: y The instance '10.162.0.248:3306' was part of the cluster configuration. Would you like to rejoin it to the cluster? [y/N]: y * Waiting for seed instance to become ONLINE... 10.162.0.229:3306 was restored. Rejoining '10.162.0.219:3306' to the cluster. Rejoining instance '10.162.0.219:3306' to cluster 'jay-test2-east'... The instance '10.162.0.219:3306' was successfully rejoined to the cluster. Rejoining '10.162.0.248:3306' to the cluster. Rejoining instance '10.162.0.248:3306' to cluster 'jay-test2-east'... The instance '10.162.0.248:3306' was successfully rejoined to the cluster. The cluster was successfully rebooted. <Cluster:jay-test2-east> This is the mysqld log from the node I issued the reboot from: 2022-09-20T17:27:58.454120Z 1125 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.' 2022-09-20T17:27:58.455146Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.' 2022-09-20T17:27:58.455167Z 1125 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.' 2022-09-20T17:27:58.456720Z 1142 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. 2022-09-20T17:27:59.616928Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor' 2022-09-20T17:28:00.617581Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306 on view 16636948786163307:1.' 2022-09-20T17:28:25.910616Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.' 2022-09-20T17:28:25.910772Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 10.162.0.229:3306 was elected. The new primary will execute all previous group transactions before allowing writes.' 2022-09-20T17:28:26.211494Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_disable_super_read_only_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "1" will be run.' 2022-09-20T17:28:26.211654Z 1150 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.' 2022-09-20T17:28:26.211728Z 1150 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_start_failover_channels_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "10" will be run.' 2022-09-20T17:28:26.211939Z 1200 [System] [MY-011510] [Repl] Plugin group_replication reported: 'This server is working as primary member.' 2022-09-20T17:28:28.727883Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:2.' 2022-09-20T17:28:30.737382Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.162.0.248:3306, 10.162.0.229:3306, 10.162.0.219:3306 on view 16636948786163307:3.' 2022-09-20T17:28:31.836076Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.248:3306 was declared online within the replication group.' 2022-09-20T17:28:55.017376Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 10.162.0.219:3306 was declared online within the replication group.' [ec2-user@ip-10-162-0-229 bin]$ cat /etc/my.cnf /etc/my.cnf.d/* [mysqld] datadir=/var/lib/mysql socket=/var/lib/mysql/mysql.sock log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid !includedir /etc/my.cnf.d innodb-dedicated-server=on skip-name-resolve [mysqld] report-host=10.162.0.229 Happy to provide more detail, but
[22 Sep 2022 15:46]
Tiago Jorge
Posted by developer: > I found an older 8.0.29 cluster that I had been using prior to upgrading to > 8.0.30. I did a kill -9 on all instances, executed > dba.rebootClusterFromCompleteOutage and it came right back up no problem. You are correct. From our findings, this only occurs if you use the MySQL Communication Stack, that became the default stack in 8.0.30 version of InnoDB Cluster (https://dev.mysql.com/doc/refman/8.0/en/group-replication-connection-security.html)
[22 Sep 2022 15:50]
Miguel Araujo
Posted by developer: The documentation about the Communication Stack handling in the AdminAPI can be found at https://dev.mysql.com/doc/mysql-shell/8.0/en/shell-admin-api-communication-stack.html To use the non-default 'XCOM' stack, use the 'communicationStack' option at dba.createCluster() and/or <ClusterSet>.createReplicaCluster()
[25 Jan 2023 16:11]
Jon Stephens
Documented fix as follows in the MySQL 8.0.32 changelog: In a 3 node cluster, all nodes were killed due to running out of memory. Subsequently, after all nodes were restarted successfully, attempting to bring the cluster back online caused the node that had been the primary to hang. Closed.