MySQL Bugs: #97764: InnoDB Cluster is NOT tolerate to even one failure when using consistency BEFORE

Bug #97764	InnoDB Cluster is NOT tolerate to even one failure when using consistency BEFORE
Submitted:	25 Nov 2019 6:39	Modified:	18 Jul 2020 11:21
Reporter:	Yoseph Phillips	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S1 (Critical)
Version:	8.0.18	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
InnoDB Cluster is NOT tolerate to even one failure when using consistency BEFORE.

We have 3 instances (all using MySQL and MySQL Shell 8.0.18) in our InnoDB Cluster using single primary mode, consistency BEFORE and autoRejoinTries of 10.
We also have MySQL Router 8.0.18 configured to use these 3 instances.

It seems that every single time there is a failure with the primary node, that the one trying to become primary ends up with a growing list of connections from MySQL Router all stuck on calling 'SELECT * FROM mysql_innodb_cluster_metadata.schema_version;'.

The entire cluster will not respond to any calls, and 'var cluster = dba.getCluster('cluster')' just hangs.

Likewise when the MySQL service from the old primary node comes online again, it cannot connect to the cluster, and the cluster remains hung.

We have seen this happen lots of times on Linux where there are unexpected failures of the primary node. We can also reproduce this every time on Windows by purposely killing the MySQL service of the primary node.

Also the mysqlrouter.log grows out of control.

This might be related to 97279, and so fixing one might fix the other as well.

How to repeat:
* Create the InnoDB Cluster:
var cluster = dba.createCluster('cluster')
cluster.addInstance('clusteradmin@xxx.xxx.xxx.xxx:xxxx')
cluster.addInstance('clusteradmin@yyy.yyy.yyy.yyy:yyyy')
cluster.setOption('consistency', 'BEFORE')
cluster.setOption('autoRejoinTries', 10)

* Bootstrap MySQL Router

* Stop the MySQL service of the PRIMARY node.

* Try to access the cluster.

* Look at the MySQL Process list on both of the remaining two instances.

* Look at the mysqlrouter.log.

* Start the MySQL service again.

* Try to access the cluster.

* Look at the MySQL Process list on all 3 instances.

* Look at the mysqlrouter.log.

Hello Yoseph Phillips,

Thank you for the report and feedback.

regards,
Umesh

Posted by developer:
 
Fixed as of the upcoming MySQL Router 8.0.20 release, and here's the proposed changelog entry from the documentation team:

Internal metadata queries were affected by global MySQL Server settings;
but now Router explicitly sets session parameters to make metadata queries
and updates consistent. These settings are group_replication_consistency,
autocommit, sql_mode, character_set_client, character_set_results, and
character_set_connection.

Thank you for the bug report.

This was still causing us issues in 8.0.20.

It looks like these other fixes in 8.0.21 might be the remainder of the solution to this issue:
•	Fixed Replication issue 1: A global value that is set for the group_replication_consistency system variable, which controls all user connections, is applied on Group Replication's internal connections to MySQL Server modules using the SQL API, which are handled in a similar way to user connections. This could sometimes lead to Group Replication reporting the use of group_replication_consistency as an error, for example when checking the clone plugin status during distributed recovery. Group Replication's internal connections using the SQL API are now configured to use the consistency level EVENTUAL, which matches the behavior before the group_replication_consistency option was available, and does not cause an error message. (Bug #31303354, Bug #99345)
•	Fixed Replication issue 2: If a group's consistency level (set by the group_replication_consistency system variable) was set to BEFORE or BEFORE_AND_AFTER, it was possible for a deadlock to occur in the event of a primary failover. The primary failover is now registered differently to avoid this situation. (Bug #31175066, Bug #98643)

Yoseph, Can you confirm:

1. This is still happening with the same test case as described initially? If not, give us the steps to reproduce
2. What version of Server/Shell/Router is being used?

Yes, the steps to reproduce the issue are the same.
We have reproduced this on Linux using 8.0.20 for MySQL, MySQL Shell and MySQL Router. We even used a fresh install instead of upgrading from 8.0.19 so that nothing could be left over causing the issue.
Things seem to have improved since 8.0.18 as it is not failing as often as it was, however during a demo to the client when we stopped the MySQL service of the primary node, the cluster was hung and neither of the two slaves took over the role of the primary node. After restarting the MySQL service, we were also not able to rejoin the node to the cluster. We could not check the process lists as this was during a demo to a client using the Enterprise Edition. We have not checked the logs as MySQL 8.0.21 has just been released.

We have now installed MySQL 8.0.21 on Windows. So far so good. Hopefully we cannot reproduce the issue on there. We also plan to install 8.0.21 on the test Linux environment next week as well.