Description:
If in a cluster we have some secondary nodes suffering delay in either applying or certifying transactions, then when a new election term happens due to primary failure, delayed nodes should not be promoted.
Or perhaps the amount of delay should act as a negative offset for the weight? In any case, blindly failing over to a delayed node even when all weights are the same seems just like not-best-option.
In my test I had 10.124.33.210 as primary, then killed it and this is how it looks afterwards (output is from lefred's gr_info mysqlsh report)
+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+
| server | role | version | quorum | tx behind | tx to cert | remote tx | local tx |
+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+
| 10.124.33.210:3306 | SECONDARY | 8.0.28 | YES | 0 | 1 | 0 | 0 |
| 10.124.33.88:3306 | PRIMARY | 8.0.28 | YES | 17291 | 0 | 2141956 | 1 |
| 10.124.33.36:3306 | SECONDARY | 8.0.28 | YES | 17441 | 0 | 857563 | 0 |
| 10.124.33.170:3306 | SECONDARY | 8.0.28 | YES | 0 | 0 | 2159240 | 0 |
| 10.124.33.176:3306 | SECONDARY | 8.0.28 | YES | 15733 | 0 | 1931694 | 0 |
+--------------------+-----------+---------+--------+-----------+------------+-----------+----------+
How to repeat:
Setup cluster in single writer mode, and allow tons of delay:
set global group_replication_flow_control_applier_threshold=500000;
set global group_replication_flow_control_certifier_threshold=500000;
Run simple sysbench; This was mine:
sysbench --threads=4 --tables=2 --table_size=10000 --time=0 --range-size=10 --index-updates=1 --report-interval=1 --db-ps-mode=disable --mysql-host=127.0.0.1 --mysql-user='sb' --mysql-password='secret' --mysql-db=test --rate=400 /usr/share/sysbench/oltp_write_only.lua run
Have some nodes with performance penalty (different IO sync options for example) and let them fall behind while observing with aforementioned mysqlsh report;
Kill the writer.
Not 100% sure it will repeat every time; Apparently it just picked second node in the list, so I let 4th node be the one without delay (which would have been ideal candidate)
Suggested fix:
Check delay of nodes as part of election process and take it into consideration when electing new primary. This would prevent clusters using BEFORE_ON_PRIMARY_FAILOVER from blocking unnecessarily.