MySQL Bugs: #108417: Innodb cluster failed after primary removed

Bug #108417	Innodb cluster failed after primary removed
Submitted:	7 Sep 2022 15:27	Modified:	20 Sep 2022 11:26
Reporter:	Jay Janssen	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.30	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:

Primary in the 3 node cluster was 10.160.132.7.  The Primary node was shutdown, but the cluster was stable before this.  The following log is from the 10.160.133.191 instance that was elected as new primary.  The cluster was a replica cluster in a clusterset.    

2022-09-07T15:12:15.452151Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 10.160.132.7:3306'
2022-09-07T15:12:15.452184Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 10.160.132.7:3306 left the group. Electing new Primary.'
2022-09-07T15:12:15.452279Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 10.160.133.191:3306 was elected. The new primary will execute all previous group transactions before allowing writes.'
2022-09-07T15:12:15.452409Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.160.133.191:3306, 10.160.132.215:3306 on view 16619763586084429:44.'
2022-09-07T15:12:15.453466Z 19 [System] [MY-013731] [Repl] Plugin group_replication reported: 'The member action "mysql_start_failover_channels_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "10" will be run.'
2022-09-07T15:12:15.453579Z 19 [ERROR] [MY-013124] [Repl] Slave SQL for channel 'clusterset_replication': Slave failed to initialize relay log info structure from the repository, Error_code: MY-013124
2022-09-07T15:12:15.453599Z 19 [ERROR] [MY-013733] [Repl] Plugin group_replication reported: 'The member action "mysql_start_failover_channels_if_primary" for event "AFTER_PRIMARY_ELECTION" with priority "10" failed. Please check previous messages in the error log for hints about what could have caused this failure.'
2022-09-07T15:12:15.453667Z 19 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2022-09-07T15:12:19.803813Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

How to repeat:
Not sure I can, never seen this before.

Hi,

Can you please upload full config and full log as I am not able to reproduce this.

Can you elaborate on "The Primary node was shutdown"? Did you manually shutdown -h or you mysqladmin shutdown or ?

Thanks

The nodes run on EC2.  IIRC the instance the primary node was on got terminated.  

I'm not able to reproduce it either, but I had not seen that issue before so I wanted to get it recorded here.  This log is what I have left, I've cycled the instances fully.

The config was more or less this:

[jayj@ip-10-162-254-200 ~]$ cat /etc/my.cnf /etc/my.cnf.d/*
[mysqld]
datadir=/data/mysql
socket=/data/mysql/mysql.sock

log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

!includedir /etc/my.cnf.d

skip-name-resolve

# Innodb
innodb-dedicated-server=on

# Replication
binlog_expire_logs_seconds=604800

# Cluster / Group replication
loose_group_replication_paxos_single_leader=ON

# Dynamic settings based on available resources
[mysqld]
max-connections=2636
[mysqld]
report-host=10.162.254.200

Hi,

I can't reproduce this but it looks like one of the set of known issues with AWS installations. For some reason network between EC2 instances is "low quality" and can introduce some of these unpronounceable issues, something I never seen on GC or OC, only on AWS and Azure, but as it is some intermittent networking issue catching it in action is almost impossible. We introduced configurable timeouts for group replication so you can try to increase those to prevent this for happening again.

kind regards

Hi,

I discussed with our group replication and replication teams and their answer summarizes as:

The message is related to two things that the server does:
1. When the server starts, it loads the replication repositories, aka 'slave_relay_log_info' / 'slave_master_info'.
2. Later, the server starts replication threads: either during server start when --skip-replica-start is not used, or during START REPLICA.

The error is generated during (2), whenever there was an earlier error during (1) that made it fail to load the repositories. The possible reasons are mostly unusual conditions like file corruption, and there ought to be a previous message emitted when (1) failed. 

This does not appear to be the bug and there is no way to reproduce this. What I can suggest is that you can open a support ticket with MySQL Support team and analyze your log files to see what kind of issues you may have with your system but unless you can have a procedure to reproduce this issue we do not believe it is a bug but expected behavior.

kind regards

FWIW, I just had this happen again as I was refreshing all the instances in my cluster

Hi,

On it's own this is not a bug, this is error that happens after another issue as I believe I explained. What this previous issue is usually have to do with hardware error or missconfiguration and for that ideal solution is to contact support team and supply them the full log

If you could upload full logs and some additional data about the setup we might be able to figure something out

- details on how you configured and used async replication channel failover
- these:
SELECT * FROM performance_schema.replication_connection_configuration;
SELECT * FROM performance_schema.replication_applier_configuration;
SELECT * FROM performance_schema.replication_asynchronous_connection_failover;
SELECT * FROM performance_schema.replication_asynchronous_connection_failover_managed;
# Exclude the credentials when selecting from slave_master_info.
SELECT Number_of_lines, Master_log_name, Master_log_pos, Host, Port, Connect_retry, Enabled_ssl, Ssl_ca, Ssl_capath, Ssl_cert, Ssl_cipher, Ssl_key, Ssl_verify_server_cert, Heartbeat, Bind, Ignored_server_ids, Uuid, Retry_count, Ssl_crl, Ssl_crl_path, Enabled_auto_position FROM mysql.slave_master_infp;
SELECT * FROM mysql.slave_relay_log_info;
SELECT * FROM mysql.replication_asynchronous_connection_failover;
SELECT * FROM mysql.replication_asynchronous_connection_failover_managed;

If I end up replicating the bug, I will provide what you requested.  I don't have the cluster that had this issue.