MySQL Bugs: #107575: Group replication should pick node with no delay when electing new primary

Bug #107575	Group replication should pick node with no delay when electing new primary
Submitted:	16 Jun 2022 5:49	Modified:	16 Jun 2022 6:17
Reporter:	Marcos Albe (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S4 (Feature request)
Version:	8.0.28	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
If in a cluster we have some secondary nodes suffering delay in either applying or certifying transactions, then when a new election term happens due to primary failure, delayed nodes should not be promoted.
Or perhaps the amount of delay should act as a negative offset for the weight?  In any case, blindly failing over to a delayed node even when all weights are the same seems just like not-best-option.

In my test I had 10.124.33.210 as primary, then killed it and this is how it looks afterwards (output is from lefred's gr_info mysqlsh report)

+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+
| server             | role      | version | quorum | tx behind | tx to cert | remote tx | local tx |
+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+
| 10.124.33.210:3306 | SECONDARY | 8.0.28  | YES    | 0         | 1          | 0         | 0        |
| 10.124.33.88:3306  | PRIMARY   | 8.0.28  | YES    | 17291     | 0          | 2141956   | 1        |
| 10.124.33.36:3306  | SECONDARY | 8.0.28  | YES    | 17441     | 0          | 857563    | 0        |
| 10.124.33.170:3306 | SECONDARY | 8.0.28  | YES    | 0         | 0          | 2159240   | 0        |
| 10.124.33.176:3306 | SECONDARY | 8.0.28  | YES    | 15733     | 0          | 1931694   | 0        |
+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+

How to repeat:
Setup cluster in single writer mode, and allow tons of delay:

set global group_replication_flow_control_applier_threshold=500000;
set global group_replication_flow_control_certifier_threshold=500000;            

Run simple sysbench; This was mine:
sysbench --threads=4 --tables=2 --table_size=10000 --time=0 --range-size=10 --index-updates=1 --report-interval=1 --db-ps-mode=disable --mysql-host=127.0.0.1 --mysql-user='sb' --mysql-password='secret' --mysql-db=test --rate=400 /usr/share/sysbench/oltp_write_only.lua run

Have some nodes with performance penalty (different IO sync options for example) and let them fall behind while observing with aforementioned mysqlsh report;

Kill the writer.

Not 100% sure it will repeat every time; Apparently it just picked second node in the list, so I let 4th node be the one without delay (which would have been ideal candidate)

Suggested fix:
Check delay of nodes as part of election process and take it into consideration when electing new primary. This would prevent clusters using BEFORE_ON_PRIMARY_FAILOVER from blocking unnecessarily.

Corrected Severity!

Hello Marcos,

Thank you for your reasonable feature request.

regards,
Umesh

Ideally, the relay log should be written to disk faster, and the replica's replay should be quicker to keep up with the primary. This is the key to fundamentally solving this issue.

If you're interested, you could try our MGR; it may perform better in this area. Here’s the link: https://github.com/advancedmysql/mysqlplus.