Bug #108417 | Innodb cluster failed after primary removed | ||
---|---|---|---|
Submitted: | 7 Sep 2022 15:27 | Modified: | 20 Sep 2022 11:26 |
Reporter: | Jay Janssen | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
Version: | 8.0.30 | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[7 Sep 2022 15:27]
Jay Janssen
[8 Sep 2022 9:39]
MySQL Verification Team
Hi, Can you please upload full config and full log as I am not able to reproduce this. Can you elaborate on "The Primary node was shutdown"? Did you manually shutdown -h or you mysqladmin shutdown or ? Thanks
[8 Sep 2022 11:19]
Jay Janssen
The nodes run on EC2. IIRC the instance the primary node was on got terminated. I'm not able to reproduce it either, but I had not seen that issue before so I wanted to get it recorded here. This log is what I have left, I've cycled the instances fully. The config was more or less this: [jayj@ip-10-162-254-200 ~]$ cat /etc/my.cnf /etc/my.cnf.d/* [mysqld] datadir=/data/mysql socket=/data/mysql/mysql.sock log-error=/var/log/mysqld.log pid-file=/var/run/mysqld/mysqld.pid !includedir /etc/my.cnf.d skip-name-resolve # Innodb innodb-dedicated-server=on # Replication binlog_expire_logs_seconds=604800 # Cluster / Group replication loose_group_replication_paxos_single_leader=ON # Dynamic settings based on available resources [mysqld] max-connections=2636 [mysqld] report-host=10.162.254.200
[8 Sep 2022 13:21]
MySQL Verification Team
Hi, I can't reproduce this but it looks like one of the set of known issues with AWS installations. For some reason network between EC2 instances is "low quality" and can introduce some of these unpronounceable issues, something I never seen on GC or OC, only on AWS and Azure, but as it is some intermittent networking issue catching it in action is almost impossible. We introduced configurable timeouts for group replication so you can try to increase those to prevent this for happening again. kind regards
[13 Sep 2022 11:40]
MySQL Verification Team
Hi, I discussed with our group replication and replication teams and their answer summarizes as: The message is related to two things that the server does: 1. When the server starts, it loads the replication repositories, aka 'slave_relay_log_info' / 'slave_master_info'. 2. Later, the server starts replication threads: either during server start when --skip-replica-start is not used, or during START REPLICA. The error is generated during (2), whenever there was an earlier error during (1) that made it fail to load the repositories. The possible reasons are mostly unusual conditions like file corruption, and there ought to be a previous message emitted when (1) failed. This does not appear to be the bug and there is no way to reproduce this. What I can suggest is that you can open a support ticket with MySQL Support team and analyze your log files to see what kind of issues you may have with your system but unless you can have a procedure to reproduce this issue we do not believe it is a bug but expected behavior. kind regards
[14 Sep 2022 19:39]
Jay Janssen
FWIW, I just had this happen again as I was refreshing all the instances in my cluster
[14 Sep 2022 20:23]
MySQL Verification Team
Hi, On it's own this is not a bug, this is error that happens after another issue as I believe I explained. What this previous issue is usually have to do with hardware error or missconfiguration and for that ideal solution is to contact support team and supply them the full log
[16 Sep 2022 18:37]
MySQL Verification Team
If you could upload full logs and some additional data about the setup we might be able to figure something out - details on how you configured and used async replication channel failover - these: SELECT * FROM performance_schema.replication_connection_configuration; SELECT * FROM performance_schema.replication_applier_configuration; SELECT * FROM performance_schema.replication_asynchronous_connection_failover; SELECT * FROM performance_schema.replication_asynchronous_connection_failover_managed; # Exclude the credentials when selecting from slave_master_info. SELECT Number_of_lines, Master_log_name, Master_log_pos, Host, Port, Connect_retry, Enabled_ssl, Ssl_ca, Ssl_capath, Ssl_cert, Ssl_cipher, Ssl_key, Ssl_verify_server_cert, Heartbeat, Bind, Ignored_server_ids, Uuid, Retry_count, Ssl_crl, Ssl_crl_path, Enabled_auto_position FROM mysql.slave_master_infp; SELECT * FROM mysql.slave_relay_log_info; SELECT * FROM mysql.replication_asynchronous_connection_failover; SELECT * FROM mysql.replication_asynchronous_connection_failover_managed;
[20 Sep 2022 11:26]
Jay Janssen
If I end up replicating the bug, I will provide what you requested. I don't have the cluster that had this issue.