MySQL Bugs: #99133: Group Replication Performance Degradation with partial network outage

Bug #99133	Group Replication Performance Degradation with partial network outage
Submitted:	31 Mar 2020 15:08	Modified:	7 Apr 2020 13:38
Reporter:	Tibor Korocz	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.19	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi, 

I have a three node InnoDB Cluster. mysql1,mysql2,mysql3

mysql2 is the Primary , mysql1 and mysql3 are the readers.

If we simulate a partial network outage example with iptables:

Running this on mysql3:

mysql3# iptables -A INPUT -s mysql2 \
     -j DROP; iptables -A OUTPUT -s mysql2 -j DROP

mysql3 will still get all the changes made on mysql2 because mysql1 is
going act like a relay node and send all the changes to mysql3. You can
confirm this even with tcpdump. 

However it has a huge Performance impact. Before I cut the network I was able to insert 60-80 rows per second after that only 1-3 roes per second, which is a huge degradation.

Also the cluster.status() on mysql2 reports that mysql3 is not reachable , but mysql2 reports that all the nodes are Online, which is also interesting in a cluster I would love to see if all the nodes are reporting the same cluster status, except if a node is totally isolated.

How to repeat:
How to repeat:
create a 3 node InnoDb cluster.
create a table on primary:

CREATE TABLE `lab` ( `id` int NOT NULL AUTO_INCREMENT, `hostname`
varchar(20) DEFAULT NULL, `created_at` datetime DEFAULT
CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY
`idx_created` (`created_at`) ) ENGINE=InnoDB;

Insert some data in a loop on mysql2:
while true;do mysql -usbtest -pxxxxx -P3306 -h127.0.0.1 -e "INSERT INTO
sysbench.lab (hostname) VALUES ( @@hostname)"; done 2>/dev/null

On mysql2 also start another loop to select roughly how many rows are inserted per second:
while true;do mysql -BN -usbtest --pxxxxx-P3306 -hmysql2 -e "select 'mysql2',count(*),now() from sysbench.lab where created_at BETWEEN now() - INTERVAL 1 second AND now()"; sleep 1; done 2>/dev/null

Cut network between a reader and primary:
mysql3# iptables -A INPUT -s mysql2 \
     -j DROP; iptables -A OUTPUT -s mysql2 -j DROP

You will see the impact immediately:

mysql2	48	2020-03-31 12:27:15
mysql2	50	2020-03-31 12:27:16
mysql2	51	2020-03-31 12:27:17
mysql2	51	2020-03-31 12:27:18
mysql2	52	2020-03-31 12:27:19
mysql2	53	2020-03-31 12:27:20
mysql2	54	2020-03-31 12:27:21
mysql2	55	2020-03-31 12:27:22
mysql2	56	2020-03-31 12:27:23
mysql2	56	2020-03-31 12:27:24
mysql2	26	2020-03-31 12:27:25
mysql2	8	2020-03-31 12:27:26
mysql2	7	2020-03-31 12:27:27
mysql2	8	2020-03-31 12:27:28
mysql2	4	2020-03-31 12:27:29
mysql2	2	2020-03-31 12:27:30
mysql2	2	2020-03-31 12:27:31
mysql2	2	2020-03-31 12:27:32
mysql2	2	2020-03-31 12:27:33

Suggested fix:
I am not sure what is causing this degradation but a partial network failure should not impact the performance that badly.

Hi Tibor,

Thanks for the report. I can reproduce this so I am verifying the report.
I'm trying to figure out if I'd agree that this is a bug or not and I tend to agree with you that this is a bug. I'm verifying this but we'll see what our GR team will have to say about this.

Again, thanks for reporting this and providing excellent test case

good health
Bogdan

Any update on this? Does this also effect 8.0.20? This is a serious issue.

Bug #99830 is marked as duplicate of this one.

Updates? This is still a HUGE issue in 8.0.28 when the network partitions.

Please try this scenario with  group_replication_paxos_single_leader=ON (https://dev.mysql.com/doc/refman/8.0/en/group-replication-single-consensus-leader.html)

(The cluster.status() output is no longer reporting inconsistent status depending on where you poll)

MySQL 8.0.35 - group_replication_paxos_single_leader=ON does not help. The above issue is still observed. node1 sees all nodes online; node2 only sees node1; node3 only sees node2. Quorum cannot be agreed upon. Nodes are not evicted; txn are not certified on partially blocked node.

group_replication_unreachable_majority_timeout also does not help because current primary still sees a majority.