MySQL Bugs: #97279: Conflicting Transaction Sets following Complete Outage of InnoDB Cluster

Bug #97279	Conflicting Transaction Sets following Complete Outage of InnoDB Cluster
Submitted:	18 Oct 2019 6:01	Modified:	20 Dec 2019 15:25
Reporter:	Yoseph Phillips	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Shell AdminAPI InnoDB Cluster / ReplicaSet	Severity:	S1 (Critical)
Version:	8.0.18	OS:	Windows
Assigned to:		CPU Architecture:	Any

Description:
Conflicting Transaction Sets following Complete Outage of InnoDB Cluster.

We have 3 instances (all using MySQL and MySQL Shell 8.0.18) in our InnoDB Cluster using single primary mode.
Following a complete outage all 3 instances started again, however the InnoDB cluster didn't restart.
All 3 instances ended up in R/W mode as standalone instances (metadata exists, instances belong to that metadata, but GR is not active).
This is kind of a split brain scenario as writes were then done to different instances.

Now calling dba.rebootClusterFromCompleteOutage() results in:
ERROR: Conflicting transaction sets between xxx.xxx.xxx.xxx:xxxx and yyy.yyy.yyy.yyy:yyyy
Dba.rebootClusterFromCompleteOutage: Conflicting transaction sets between xxx.xxx.xxx.xxx:xxxx and yyy.yyy.yyy.yyy:yyyy (MYSQLSH 51152)

How to repeat:
* Create the InnoDB Cluster:
var cluster = dba.createCluster('cluster')
cluster.addInstance('clusteradmin@xxx.xxx.xxx.xxx:xxxx')
cluster.addInstance('clusteradmin@yyy.yyy.yyy.yyy:yyyy')
cluster.setOption("consistency", "BEFORE")

* Stop the MySQL Service on all 3 servers

* Restart the MySQL Service on all 3 servers

* INSERT some random data into the same table on all 3 instances

* Try calling dba.rebootClusterFromCompleteOutage()

Suggested fix:
Following a complete outage we would expect all instances to restart in RO mode, not in R/W mode. 

Then ideally once all (or possibly the majority [if there is no possibility of a split brain]) of instances in the metadata have been restarted, automatically restart the InnoDB Cluster.

Hi,

Thanks for your report, verified as described.

Changing severity to S1 (Critical) as we are seeing similar behavior on Linux also using 8.0.18 with the same setup as on Windows. This is making InnoDB Cluster almost unusable for us. It was running fine for about 39 hours, then one server could not reach the other two servers for some reason. It was then able to reconnect and shortly after the other two servers expelled it from the group. Looking at the router logs it would appear that the system was running for the next about 18 hours with just two servers (one R/W and one RO). Then at some stage all 3 servers ended up in a state similar to this bug report - however this time we were able to use dba.rebootClusterFromCompleteOutage(). We have lots of information in all of these logs.

This might be related to 97764, and so fixing one might fix the other as well.

This bug appears to affect 8.0.17 and later, as the affected code was introduced with WL#13176.

There are 2 faults with this bug:

1) SUBSET GTID sets are erroneously reported as INTERSECTS and therefore cause an error in Shell
2) When asked (or told via removeInstances) to remove instances from the cluster, Shell ignores affirmative responses and still puts all instances in the list to be considered for primary election

The workaround to this is:

a) Shutdown the instance that contains the GTID subset (typically a secondary that was behind, or a primary that was ejected from the cluster)
b) Execute rebootClusterFromCompleteOutage to bring the cluster up
c) Start the instance from step a)
d) Execute a cluster.rescan() to bring the instance back into the cluster

* After the complete outage, all servers come online and they try to form a group. 
* This is not possible as there is no majority partition
* After a timeout group replication is stopped
* It seems that in this case, members are in R/W mode, and that should not happen. They should be going into super_read_only=ON mode
* Because some members were R/W some connections were performing writes to multiple members, effectively creating a split brain

What has to happen is to make sure all servers have compatible GTID sets. Take the transaction(s) from one conflicting one and apply it manually on the other. Then 

Then use `rebootClusterFromCompleteOutage()`.

* With InnoDB Cluster, the servers know how many instances there are in the cluster (from the metadata) and so after two of them manage to reconnect then a majority partition is possible.
* Ideally GR shouldn't be stopped after a timeout and autoRejoinTries could be used to decide how many times to retry before giving up. We are happy to set autoRejoinTries to a very high number to help the instances to reconnect.
* Agreed that 'It seems that in this case, members are in R/W mode, and that should not happen. They should be going into super_read_only=ON mode' - this is the main point of this bug report. 
* We managed to reset one instance back to the state of another one back in October, the point of this bug report is to prevent it happening again - and not for recovering from a split brain.

I agree this is a bug that should be fixed.

To comment on your first bullet points:
> * With InnoDB Cluster, the servers know how many instances there are in the cluster (from the metadata) and so after two of them manage to reconnect then a majority partition is possible.
> * Ideally GR shouldn't be stopped after a timeout and autoRejoinTries could be used to decide how many times to retry before giving up. We are happy to set autoRejoinTries to a very high number to help the instances to reconnect.

At this moment, 8.0.18 being latest release, Group Replication does not automatically recover from all members being down due to power outage. 
Unfortunately, adding this functionality is not as simple as reading the metadata as you pointed out. I suggest to create a separate FR request for this.

We were just pointing out that it is possible to read that info from the metadata, not that it would be simple to (we would assume if it was simple then there wouldn't be this issue in the first place).

Once this bug report and also https://bugs.mysql.com/bug.php?id=97764 have both been fixed then we can check if we need to add another bug report. In our case there wasn't a complete power outage, and it seems that each instance went offline over a long period of time, possibly due to https://bugs.mysql.com/bug.php?id=97764. Hence if we can get 97764 fixed and likewise not having to worry about offline instances ending up in R/W mode in 8.0.19 then the issue of being able to recover from a complete outage would be much less important for us.

Yoseph: Until the issue is resolved in Shell, manually run  'SET PERSIST super_read_only=ON' to ensure members do not go back in RW when Group Replication fails to start.

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.19 changelog:

When an instance restarted, for example after a complete outage, it could have super_read_only disabled. This meant that instances which were not the primary could be written to, resulting in the instances no longer being in synchrony. This could result in dba.rebootClusterFromCompleteOutage() failing with a Conflicting transaction sets error. The fix ensures that all instances have super_read_only=1 persisted while they belong to the cluster, either through SET PERSIST_ONLY, or through dba.configureLocalInstance() for instances which do not support persisting.