MySQL Bugs: #112218: relay log event crc check failed on arm platform

Bug #112218	relay log event crc check failed on arm platform
Submitted:	30 Aug 2023 8:06	Modified:	17 Apr 5:00
Reporter:	Allen Iverson	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	8.0.28	OS:	Linux
Assigned to:	MySQL Verification Team	CPU Architecture:	ARM
Tags:	arm, corruption, crc check, relay

Description:

MySQL 8.0 slave under arm architecture occasionally encounters replication interruptions. it seems relay log event readed by sql thread is corrupted ,according to MySQL error log. However, the relay log can be parsed normally using mysqlbinlog; it can also recover normally after restarting the sql thread.

error log below:
```
2023-08-25T21:32:08.289746+08:00 1386201 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel '': Event crc check failed! Most likely there is event corruption.
2023-08-25T21:32:08.289813+08:00 1386201 [ERROR] [MY-013121] [Repl] Slave SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
2023-08-25T21:32:08.291711+08:00 1386201 [ERROR] [MY-010586] [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.006892' position 367290759
2023-08-25T21:45:17.956168+08:00 16728861 [Note] [MY-010581] [Repl] Slave SQL thread for channel '' initialized, starting replication in log 'binlog.006892' at position 367290759, relay log 'binlog/relay.014693' position: 367290929
```

How to repeat:
It happens occasionally, don’t know how to repeat

Hi,

More data is needed
 - have you tried 8.0.34 or 8.1 ?
 - what build are you using? 
 - what OS are you running this build on?
 - what hardware are you running this build on?

Thanks

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

we encountered the same problem on ARMky10,mysql8.0.25...

Can you please try latest 8.0? as really not much we can do to make 8.0.25 work.

I've encountered likely the same issue with 8.0.41 on ARM instances in Amazon RDS. In some cases the crc check failure is mentioned, but it's not in some others. I do not have any additional details to share at this time.

Two examples:

(1)
2025-04-12T03:38:18.238750Z 11531 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel '': Event crc check failed! Most likely there is event corruption.
2025-04-12T03:38:18.238841Z 11531 [ERROR] [MY-013121] [Repl] Replica SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121

(2)
2025-04-12T02:51:24.921299Z 11100 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel '': corrupted data in log event
2025-04-12T02:51:24.921351Z 11100 [ERROR] [MY-013121] [Repl] Replica SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121

Additional information:

- replication can be restarted and it simply resumes as if nothing happened, just like the original reporter wrote
- so far no skipped/lost transactions have been found (with limited pt-table-checksum runs as GITD is not available in this cluster)
- not sure if it could matter or not, but while the replica uses ARM, the master is x86

I cannot reproduce this. I replicated over a terabyte between two arm boxes and between x86 and arm and between arm and x86 without a single problem...

Since your replication "continue without problem" this seems to be a problem with network. Not much we can do about that except know there is a problem and stop replication.

You can try VPN between those two machines to circumvent cloud network issues.

Reason:
The error is not caused by a MySQL bug, but by the weak memory consistency of the ARM architecture.

Add debug log to MySQL to print out the event data when the problem occurs. Compare the event data in relay log and you will find that the two are indeed inconsistent. The event header data becomes 00 00 00 00 00

ARM architecture employs a weak consistency memory model. In MySQL's SQL THREAD when reading relay logs to determine if the file size exceeds the current read position, accessing atomic variables without lock protection leads to abnormal data reads and replication interruption.

Thread A (assumed as the writer thread) completes data writing (assuming these reside in CoreX's L1 cache). Thread B in CoreY's L1 cache might observe updated locks and atomic variables, but the actual data in CoreX's L1 cache might not have fully synchronized yet.

To ensure Thread B in CoreY's L1 cache can see the latest data:

After writing, Thread A must execute memory barrier instructions (ensuring broadcast synchronization to other cores' L1 caches).
Before reading, Thread B should also execute memory barrier instructions.
Proper usage of C library locks (e.g., pthread_mutex) inherently handles this issue. In theory, if both read/write threads ultimately use underlying pthread_mutex locks, cache consistency can be guaranteed.

How to repeat:

Environment Setup:
1. Prepare 3 ARM-based physical machines/VMs
2. Deploy 3 MySQL instances on each machine (9 instances total), where:
Each machine's 3 instances form a one-master-two-slaves cluster (random master selection)
3. Ensure the 3 masters are distributed across different physical nodes
4. Run sysbench test on each master with high-concurrency read/write continuously for several hours
5. After running for a while, such as an hour, the problem occurs.

Hi,

Are you reproducing this on .42 as I moved a lot of data (using sysbench) on arm without reproducing this?