MySQL Bugs: #110785: Wrong docs on implementing replication failover

Bug #110785	Wrong docs on implementing replication failover
Submitted:	24 Apr 2023 16:43	Modified:	26 Apr 2023 15:55
Reporter:	Mikael Ronström	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The docs currently says this:
Using @latest as the epoch value obtained in the previous step, you can obtain the correct starting position @pos in the correct binary log file @file from the mysql.ndb_binlog_index table on the source. The query shown here gets these from the Position and File columns from the last epoch applied before the logical restore position:

mysqlS> SELECT
     ->     @file:=SUBSTRING_INDEX(File, '/', -1),
     ->     @pos:=Position
     -> FROM mysql.ndb_binlog_index
     -> WHERE epoch > @latest
     -> ORDER BY epoch ASC LIMIT 1;
In the event that there is currently no replication traffic, you can get similar information by running SHOW MASTER STATUS on the source and using the value shown in the Position column of the output for the file whose name has the suffix with the greatest value for all files shown in the File column. In this case, you must determine which file this is and supply the name in the next step manually or by parsing the output with a script.

The above command on ndb_binlog_index often fails if attempted after a backup where the source cluster was inactive at the time of the backup and after it.

How to repeat:
Docs related

Suggested fix:
Change the query to use the following query instead.
This will work as long as there are any records of
epochs before or at the epoch we are attempting to
start the replication from.

mysqlS> SELECT
     ->     @file:=SUBSTRING_INDEX(next_file, '/', -1),
     ->     @pos:=next_position
     -> FROM mysql.ndb_binlog_index
     -> WHERE epoch <= @latest
     -> ORDER BY epoch DESC LIMIT 1;

If none such records exists it must have been purged, but in this case
the original query is quite likely to also fail since it is possible
that the purge removed records required for the start of the replication.

Thanks for the report Mikael,

all best

Posted by developer:

Docs currently describe two use cases related to replication positioning :
1) Failover https://dev.mysql.com/doc/refman/8.0/en/mysql-cluster-replication-failover.html
2) Deployment from backup : https://dev.mysql.com/doc/refman/8.0/en/mysql-cluster-replication-backups.html

In the failover case it is expected that the mysql.ndb_apply_status table contains a valid epoch number, which
can be looked up exactly to find the correct next binlog file + position.
Looking it up exactly (using = epoch) avoids ambiguity introduced by using inequality operators which can silently
skip over epoch gaps.

This bug report is concerned with the second Deployment from backup case.

---

When restoring a backup using ndb_restore and the --restore-epoch option, a pseudo epoch is added to the
mysql.ndb_apply_status of the form <gci>/<0xffffffff>.
This represents a logical position :
- *after* any sane epoch values <gci>/0 ... <gci>/0xfffffffe
- *before* <gci+1>/0

Note that it is not an actual epoch value that a cluster would generate or use as the microGCI is too high.
Any actual stored epoch number will be lower or higher than this, therefore an equality comparison will fail.

This means that it is guaranteed not to be present in the mysql.ndb_binlog_index table(s) of the source site.

Therefore it is required to use an inequality of some sort to find a start file and position.

Cases :
1) Source cluster has some post-backup binlog epochs, mysql.ndb_binlog_index contains entries > pseudo epoch.
2) Source cluster has no post-backup binlog epochs, mysql.ndb_binlog_index contains no entries > pseudo epoch
3) Source cluster has no binlog epochs.
4) Source cluster has purged binlog epochs up to some point after pseudo epoch - gap.

Ideal positioning in each case :
1) Position at start of first following epoch ({File, Position} > epoch)
2) Position after end of last preceding epoch ({next_file, next_position} < epoch)
3) Position at start of source Binlog files.
4) Reject positioning attempt as pseudo epoch is not usable for positioning.

Given that the pseudo epoch can be arbitrarily distant from any preceding or succeeding epoch, it is difficult
in the deployment case to determine whether there may be an epoch gap between it and the epochs available (case 4).

Cases 1 and 2 are trying to compute the same position, but case 1 requires a recorded epoch > the pseudo
epoch and case 2 requires an epoch < the pseudo epoch. In both cases such an epoch may not exist.

Therefore changing the deployment use case documentation to use a statement of form 2) is not guaranteed
to solve all potential problems, but an argument could be made that it might be more or less likely to
succeed than the existing statement of form 1).

The need for more 'decision making' in the positioning led to the filing of the following bug on which
some work has been done :
Bug#34190136 ADDS A STORED ROUTINE TO FETCH BINLOG POSITION AND FILE PATH

---

The statement proposed :

SELECT
-> @file:=SUBSTRING_INDEX(next_file, '/', -1),
-> @pos:=next_position
-> FROM mysql.ndb_binlog_index
-> WHERE epoch <= @latest
-> ORDER BY epoch DESC LIMIT 1;

is of form 2).
If there are no epoch entries preceding the Backup time, but there are entries afterwards then it
will not position correctly.

The use of <= rather than < indicates either that the nature of the pseudo epoch is misunderstood
or that this is intended to also be used as an alternative general channel cutover positioning
statement.

Currently we document a separate statement for the cutover use case using equality rather than an
inequality.
This is done in an attempt to improve the 'strictness' of the cutover positioning, at the cost of
having > 1 statement form to position replication which probably confuses.

---

Summary

- Proposed form may be confusing cutover + B+R deployment use cases or merging them on purpose.
- Proposal works for the case where epochs exist prior to pseudo epoch and not after.
- Proposal does not work for the case where epochs exist after the pseudo epoch and not before.
- Proposal does not work for the case where there are no epochs available.

Need to consider further how documentation can be improved to best inform users without overwhelming with details.

Thx for a thorough explanation. Wasn't aware that the epoch generated
by the Backup wasn't a real epoch. Helps explain why the MTR test case
goes through a bit more complex scheme to handle things.