Bug #112218 | relay log event crc check failed on arm platform | ||
---|---|---|---|
Submitted: | 30 Aug 2023 8:06 | Modified: | 17 Apr 5:00 |
Reporter: | Allen Iverson | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server | Severity: | S1 (Critical) |
Version: | 8.0.28 | OS: | Linux |
Assigned to: | MySQL Verification Team | CPU Architecture: | ARM |
Tags: | arm, corruption, crc check, relay |
[30 Aug 2023 8:06]
Allen Iverson
[7 Sep 2023 14:36]
MySQL Verification Team
Hi, More data is needed - have you tried 8.0.34 or 8.1 ? - what build are you using? - what OS are you running this build on? - what hardware are you running this build on? Thanks
[8 Oct 2023 1:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[8 Nov 2024 1:35]
Xuyang Zhang
we encountered the same problem on ARMky10,mysql8.0.25...
[8 Nov 2024 8:48]
MySQL Verification Team
Can you please try latest 8.0? as really not much we can do to make 8.0.25 work.
[12 Apr 14:07]
Maciej Dobrzanski
I've encountered likely the same issue with 8.0.41 on ARM instances in Amazon RDS. In some cases the crc check failure is mentioned, but it's not in some others. I do not have any additional details to share at this time. Two examples: (1) 2025-04-12T03:38:18.238750Z 11531 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel '': Event crc check failed! Most likely there is event corruption. 2025-04-12T03:38:18.238841Z 11531 [ERROR] [MY-013121] [Repl] Replica SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121 (2) 2025-04-12T02:51:24.921299Z 11100 [ERROR] [MY-010596] [Repl] Error reading relay log event for channel '': corrupted data in log event 2025-04-12T02:51:24.921351Z 11100 [ERROR] [MY-013121] [Repl] Replica SQL for channel '': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the source's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the replica's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the source's or replica's MySQL code. If you want to check the source's binary log or replica's relay log, you will be able to know their names by issuing 'SHOW REPLICA STATUS' on this replica. Error_code: MY-013121
[12 Apr 16:22]
Maciej Dobrzanski
Additional information: - replication can be restarted and it simply resumes as if nothing happened, just like the original reporter wrote - so far no skipped/lost transactions have been found (with limited pt-table-checksum runs as GITD is not available in this cluster) - not sure if it could matter or not, but while the replica uses ARM, the master is x86
[15 Apr 12:32]
MySQL Verification Team
I cannot reproduce this. I replicated over a terabyte between two arm boxes and between x86 and arm and between arm and x86 without a single problem... Since your replication "continue without problem" this seems to be a problem with network. Not much we can do about that except know there is a problem and stop replication. You can try VPN between those two machines to circumvent cloud network issues.
[17 Apr 5:00]
Allen Iverson
Reason: The error is not caused by a MySQL bug, but by the weak memory consistency of the ARM architecture. Add debug log to MySQL to print out the event data when the problem occurs. Compare the event data in relay log and you will find that the two are indeed inconsistent. The event header data becomes 00 00 00 00 00 ARM architecture employs a weak consistency memory model. In MySQL's SQL THREAD when reading relay logs to determine if the file size exceeds the current read position, accessing atomic variables without lock protection leads to abnormal data reads and replication interruption. Thread A (assumed as the writer thread) completes data writing (assuming these reside in CoreX's L1 cache). Thread B in CoreY's L1 cache might observe updated locks and atomic variables, but the actual data in CoreX's L1 cache might not have fully synchronized yet. To ensure Thread B in CoreY's L1 cache can see the latest data: After writing, Thread A must execute memory barrier instructions (ensuring broadcast synchronization to other cores' L1 caches). Before reading, Thread B should also execute memory barrier instructions. Proper usage of C library locks (e.g., pthread_mutex) inherently handles this issue. In theory, if both read/write threads ultimately use underlying pthread_mutex locks, cache consistency can be guaranteed. How to repeat: Environment Setup: 1. Prepare 3 ARM-based physical machines/VMs 2. Deploy 3 MySQL instances on each machine (9 instances total), where: Each machine's 3 instances form a one-master-two-slaves cluster (random master selection) 3. Ensure the 3 masters are distributed across different physical nodes 4. Run sysbench test on each master with high-concurrency read/write continuously for several hours 5. After running for a while, such as an hour, the problem occurs.
[17 Apr 15:41]
MySQL Verification Team
Hi, Are you reproducing this on .42 as I moved a lot of data (using sysbench) on arm without reproducing this?