Bug #99133 Group Replication Performance Degradation with partial network outage
Submitted: 31 Mar 2020 15:08 Modified: 7 Apr 2020 13:38
Reporter: Tibor Korocz Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any

[31 Mar 2020 15:08] Tibor Korocz
Description:
Hi, 

I have a three node InnoDB Cluster. mysql1,mysql2,mysql3

mysql2 is the Primary , mysql1 and mysql3 are the readers.

If we simulate a partial network outage example with iptables:

Running this on mysql3:

mysql3# iptables -A INPUT -s mysql2 \
     -j DROP; iptables -A OUTPUT -s mysql2 -j DROP

mysql3 will still get all the changes made on mysql2 because mysql1 is
going act like a relay node and send all the changes to mysql3. You can
confirm this even with tcpdump. 

However it has a huge Performance impact. Before I cut the network I was able to insert 60-80 rows per second after that only 1-3 roes per second, which is a huge degradation.

Also the cluster.status() on mysql2 reports that mysql3 is not reachable , but mysql2 reports that all the nodes are Online, which is also interesting in a cluster I would love to see if all the nodes are reporting the same cluster status, except if a node is totally isolated.

How to repeat:
How to repeat:
create a 3 node InnoDb cluster.
create a table on primary:

CREATE TABLE `lab` ( `id` int NOT NULL AUTO_INCREMENT, `hostname`
varchar(20) DEFAULT NULL, `created_at` datetime DEFAULT
CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY
`idx_created` (`created_at`) ) ENGINE=InnoDB;

Insert some data in a loop on mysql2:
while true;do mysql -usbtest -pxxxxx -P3306 -h127.0.0.1 -e "INSERT INTO
sysbench.lab (hostname) VALUES ( @@hostname)"; done 2>/dev/null

On mysql2 also start another loop to select roughly how many rows are inserted per second:
while true;do mysql -BN -usbtest --pxxxxx-P3306 -hmysql2 -e "select 'mysql2',count(*),now() from sysbench.lab where created_at BETWEEN now() - INTERVAL 1 second AND now()"; sleep 1; done 2>/dev/null

Cut network between a reader and primary:
mysql3# iptables -A INPUT -s mysql2 \
     -j DROP; iptables -A OUTPUT -s mysql2 -j DROP

You will see the impact immediately:

mysql2	48	2020-03-31 12:27:15
mysql2	50	2020-03-31 12:27:16
mysql2	51	2020-03-31 12:27:17
mysql2	51	2020-03-31 12:27:18
mysql2	52	2020-03-31 12:27:19
mysql2	53	2020-03-31 12:27:20
mysql2	54	2020-03-31 12:27:21
mysql2	55	2020-03-31 12:27:22
mysql2	56	2020-03-31 12:27:23
mysql2	56	2020-03-31 12:27:24
mysql2	26	2020-03-31 12:27:25
mysql2	8	2020-03-31 12:27:26
mysql2	7	2020-03-31 12:27:27
mysql2	8	2020-03-31 12:27:28
mysql2	4	2020-03-31 12:27:29
mysql2	2	2020-03-31 12:27:30
mysql2	2	2020-03-31 12:27:31
mysql2	2	2020-03-31 12:27:32
mysql2	2	2020-03-31 12:27:33

Suggested fix:
I am not sure what is causing this degradation but a partial network failure should not impact the performance that badly.
[7 Apr 2020 13:38] MySQL Verification Team
Hi Tibor,

Thanks for the report. I can reproduce this so I am verifying the report.
I'm trying to figure out if I'd agree that this is a bug or not and I tend to agree with you that this is a bug. I'm verifying this but we'll see what our GR team will have to say about this.

Again, thanks for reporting this and providing excellent test case

good health
Bogdan
[30 Apr 2020 14:06] Boris R
Any update on this? Does this also effect 8.0.20? This is a serious issue.
[11 Jun 2020 11:45] MySQL Verification Team
Bug #99830 is marked as duplicate of this one.
[31 Aug 2022 17:45] Matthew Boehm
Updates? This is still a HUGE issue in 8.0.28 when the network partitions.
[15 Sep 2022 16:29] Kenny Gryp
Please try this scenario with  group_replication_paxos_single_leader=ON (https://dev.mysql.com/doc/refman/8.0/en/group-replication-single-consensus-leader.html)

(The cluster.status() output is no longer reporting inconsistent status depending on where you poll)
[30 Jan 17:45] Matthew Boehm
MySQL 8.0.35 - group_replication_paxos_single_leader=ON does not help. The above issue is still observed. node1 sees all nodes online; node2 only sees node1; node3 only sees node2. Quorum cannot be agreed upon. Nodes are not evicted; txn are not certified on partially blocked node.
[30 Jan 17:50] Matthew Boehm
group_replication_unreachable_majority_timeout also does not help because current primary still sees a majority.