Bug #119855 Relay Log Recovery after mts_recovery_groups failure should be retried
Submitted: 6 Feb 18:36 Modified: 6 Feb 18:38
Reporter: Jervin Real Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[6 Feb 18:36] Jervin Real
Description:
During a server crash, on binlogless replicas, a relay log may have un-synced data and therefore mysql.slave_relay_log_info and mysql.slave_worker_info will be out of sync on the actual on disk relay logs.

1. On https://github.com/mysql/mysql-server/blob/mysql-8.0.45/sql/rpl_replica.cc#L6362 the offset parameter to relaylog_file_reader.open(linfo.log_file_name, offset) will be beyond the on disk size of the relay log. However, the open still succeeds.
2. On https://github.com/mysql/mysql-server/blob/mysql-8.0.45/sql/rpl_replica.cc#L6369 the resulting ev will be expectedly nullptr and the while loop will exit.
3. On https://github.com/mysql/mysql-server/blob/mysql-8.0.45/sql/rpl_replica.cc#L6433, since the MTS gaps determination fails, this function returns an error and init_recovery eventually fail without running recover_relay_log

How to repeat:
1. Configure a replica with default sync_relay_log settings and a max relay log size, GTID mode disabled.
2. Run workload from the source
3. While replica is both writing and reading on the same relay log, execute /bin/echo c > /proc/sysrq-trigger on the host

Suggested fix:
Ideally the server should be able to detect the relay log truncation and refetch the relay logs up to a point then retry the MTS gaps recovery.

This problem is synonymous to https://bugs.mysql.com/bug.php?id=81840 but slave_preserve_commit_order = ON (when slave_parallel_mode = LOGICAL_CLOCK) does not prevent the problem from happening.
[6 Feb 18:38] Jervin Real
On repeat, the following errors will be logged.

[ERROR] [MY-010575] [Repl] Error looking for file after /logs/relaylogs/relay_log.000893.
[ERROR] [MY-010426] [Repl] Replica: Failed to initialize the connection metadata structure for channel ''; its record may still be present in the applier metadata repository, consider deleting it.
[ERROR] [MY-010529] [Repl] Failed to create or recover replication info repositories.