Bug #118229 | Async replica fails in DELETE event with HA_ERR_KEY_NOT_FOUND when using InnoDB cluster (group replication) | ||
---|---|---|---|
Submitted: | 19 May 23:05 | Modified: | 3 Jun 19:31 |
Reporter: | Justin Swanhart | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S3 (Non-critical) |
Version: | 8.4.5, 8.0.41 | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[19 May 23:05]
Justin Swanhart
[21 May 12:55]
Justin Swanhart
Skipping the transaction by setting @@session.gtid_next and committing an empty transaction heals the problem because in effect, it is running twice and skipping the second time doesn't put anything out of sync.
[21 May 18:42]
MySQL Verification Team
Hi Justin, I tried everything I can imagine to reproduce this but I cannot. Might be something with that backup. I will verify the bug so that replication team can take a look but it is not reproduced and if they can't find anything reviewing the code they can't do much. We would appreciate if you can save that backup that makes a problem reproducible so you can test it on some test system if our team have more questions since I did not manage to reproduce this. Thanks
[24 May 20:15]
Justin Swanhart
I can reproduce this using only the cne_job table by repeatedly dumping and reloading the table in a loop on the master. I don't think the contents of the table really matter, but this is how I got it to reproduce so I've provided the dump to you in an attachment. root@testdb4:~# cat doit.sh until false; do mysql vultr < ./load_and_delete.sql done root@testdb4:~# cat load_and_delete.sql \. cne_job.sql delete from cne_job;
[24 May 20:19]
Justin Swanhart
Note is took a few hours in the loop to produce the error on the replica.
[25 May 11:58]
Justin Swanhart
Here is the binary log from the primary when the SQL thread failed with HA_ERR_KEY_NOT_FOUND. I've included the event before the failure showing the GTID of the upstream server, the next GTID event with the GTID from the replica, and the beginning of the DELETE event which removed the rows but then the SQL thread tries the DELETE again. SET @@SESSION.GTID_NEXT= '048a2d66-37cb-11f0-bcc2-56000571f1e5:111873'/*!*/; # at 544000168p #250524 17:57:17 server id 6000 end_log_pos 544000303 CRC32 0x2ba84c4a Query thread_id=2027 exec_time=0 error_code=0 Xid = 62768999 SET TIMESTAMP=1748109437/*!*/; /*!80013 SET @@session.sql_require_primary_key=0*//*!*/; /*!40000 ALTER TABLE `cne_job` ENABLE KEYS */ /*!*/; # at 544000303 250524 17:57:19 server id 5011 end_log_pos 544000390 CRC32 0x10405153 GTID last_committed=465 sequence_number=466 rbr_only=yes original_committed_timestamp=1748109439738447 immediate_commit_timestamp=1748109439749902 transaction_length=1582808 /*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/; # original_commit_timestamp=1748109439738447 (2025-05-24 17:57:19.738447 UTC) # immediate_commit_timestamp=1748109439749902 (2025-05-24 17:57:19.749902 UTC) /*! 80001 SET @@session.original_commit_timestamp=1748109439738447*//*!*/; /*! 80014 SET @@session.original_server_version=80404*//*!*/; /*! 80014 SET @@session.immediate_server_version=80404*//*!*/; SET @@SESSION.GTID_NEXT= '91e9dec1-357e-11f0-acb6-5600056de5a1:163197'/*!*/; # at 544000390 #250524 17:57:19 server id 5011 end_log_pos 544000479 CRC32 0xbf1f1656 Query thread_id=1693418 exec_time=0 error_code=0 SET TIMESTAMP=1748109439/*!*/; SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=1, @@session.unique_checks=1/*!*/; SET @@session.sql_mode=1149239296/*!*/; SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=2/*!*/; BEGIN /*!*/; # at 544000479 #250524 17:57:19 server id 5011 end_log_pos 544000580 CRC32 0x9e6248a9 Table_map: `vultr`.`cne_job` mapped to number 6227 # has_generated_invisible_primary_key=0 # at 544000580 #250524 17:57:19 server id 5011 end_log_pos 544008701 CRC32 0x91c3e148 Delete_rows: table id 6227 ... #250524 17:57:19 server id 5011 end_log_pos 545583080 CRC32 0xd5b788ce Delete_rows: table id 6227 flags: STMT_END_F ### DELETE FROM `vultr`.`cne_job` ### WHERE ### @1=2307000
[25 May 12:21]
Justin Swanhart
I noticed that the timestamps indicate that the error occurred at the same time of day on both failures, to the second. This to me indicates that the problem may be related to the clock.
[25 May 22:49]
MySQL Verification Team
Awesome, thanks, I'll try reproducing this using your table.
[28 May 7:07]
MySQL Verification Team
Hi Justin, Still not reproduced on my side after 2 days. All nodes run some ntpd or similar time syncing daemon? Do you have something in cron being executed around that time? Thanks
[2 Jun 19:23]
Justin Swanhart
This morning I've confirmed this bug also affects 8.4.5. We are not using ntp or anything unusual related to the time on the servers. I spin up three boxes, install the latest packages and software and set up the cluster. The time corresponds with a cron on one of our application servers that executes: DELETE FROM `vultr`.`cne_job` where created_at < NOW() - INTERVAL 1 MONTH; I've included row from performance_schema.replication_applier_status_by_worker: | | 1 | NULL | OFF | 1032 | Worker 1 failed executing transaction '6ec10a05-f0ff-11ee-a42f-ac1f6b643f90:1341786214' at source log binlog.003307, end_log_pos 952853605; Could not execute Delete_rows event on table vultr.cne_job; Can't find record in 'cne_job', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's source log binlog.003307, end_log_pos 952853605 | 2025-06-02 17:57:20.044743 | 6ec10a05-f0ff-11ee-a42f-ac1f6b643f90:1341786213 | 2025-06-02 17:57:19.966355 | 2025-06-02 17:57:19.966355 | 2025-06-02 17:57:19.977473 | 2025-06-02 17:57:19.988088 | 6ec10a05-f0ff-11ee-a42f-ac1f6b643f90:1341786214 | 2025-06-02 17:57:20.028788 | 2025-06-02 17:57:20.028788 | 2025-06-02 17:57:20.044442 | 0 | 0 | | 0000-00-00 00:00:00.000000 | 0 | 0 | | 0000-00-00 00:00:00.000000 | | | 2 | NULL | OFF | 0 | As you can see it failed at about the same time again. Nothing else deletes from `cne_job` and only a dozen or so insertions per day with some updates. This is a completely new set of machines with a different OS (Ubuntu 22.04). We also have this problem on a regular source/replica replication setup of one of our other servers with a completely different data set. After running for 1 to 2 months the replica will simply fail in a delete with HA_ERR_KEY_NOT_FOUND. Once again I know for certain that the node was read-only and nothing changed data on the node. I've just been resyncing it when it fails, though I imagine I could just skip the error and it would be fine for another two months. I'm not sure what to do to reproduce this bug since a delete loop worked for me but doesn't work for you.
[3 Jun 10:33]
Justin Swanhart
Actually I was wrong, the DELETE comes from an event. Maybe this can help with reproducibility. It is the only event that we have: mysql> show events\G *************************** 1. row *************************** Db: vultr Name: clean_cne_job Definer: REDACTED Time zone: SYSTEM Type: RECURRING Execute at: NULL Interval value: 1 Interval field: DAY Starts: 2023-10-18 13:57:19 Ends: NULL Status: ENABLED Originator: 3015 character_set_client: utf8mb4 collation_connection: utf8mb4_0900_ai_ci Database Collation: utf8mb4_0900_ai_ci 1 row in set (0.00 sec) mysql> show create event clean_cne_job\G *************************** 1. row *************************** Event: clean_cne_job sql_mode: NO_ZERO_IN_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION time_zone: SYSTEM Create Event: CREATE DEFINER=`REDACTED` EVENT `clean_cne_job` ON SCHEDULE EVERY 1 DAY STARTS '2023-10-18 13:57:19' ON COMPLETION NOT PRESERVE ENABLE DO DELETE FROM cne_job WHERE created_at < DATE_SUB(NOW(), INTERVAL 1 MONTH) character_set_client: utf8mb4 collation_connection: utf8mb4_0900_ai_ci Database Collation: utf8mb4_0900_ai_ci 1 row in set (0.01 sec)
[3 Jun 10:55]
Justin Swanhart
I found something else interesting. I examined the master's binary log and found something I've never seen before: mysqlbinlog --base64-output=decode-rows -v -j952853605 binlog.003307 # The proper term is pseudo_replica_mode, but we use this compatibility alias # to make the statement usable on server versions 8.0.24 and older. /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 237 #250602 12:27:06 server id 3010 end_log_pos 126 CRC32 0x7f1e743e Start: binlog v 4, server v 8.0.39-30 created 250602 12:27:06 # at 952853605 #250602 13:57:19 server id 3010 end_log_pos 952861769 CRC32 0x484fe982 Delete_rows: table id 2327 # at 952861769 #250602 13:57:19 server id 3010 end_log_pos 952869672 CRC32 0x078749f0 Delete_rows: table id 2327 ... # at 953974596 #250602 13:57:19 server id 3010 end_log_pos 953982760 CRC32 0xaaeb072a Delete_rows: table id 2327 # at 953982760 #250602 13:57:19 server id 3010 end_log_pos 953990924 CRC32 0x00606f72 Delete_rows: table id 2327 # at 953990924 #250602 13:57:19 server id 3010 end_log_pos 953995259 CRC32 0x0f08d89d Delete_rows: table id 2327 flags: STMT_END_F ### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327### Row event for unknown table #2327# at 953995259 #250602 13:57:19 server id 3010 end_log_pos 953995290 CRC32 0xafb22565 Xid = 132160396242 COMMIT/*!*/; I am going to attach the binlog to this ticket.
[3 Jun 10:56]
Justin Swanhart
I think I realize that that is because I started at a position after the table map and asked it to decode rows. So maybe nothing unusual with the binary log after all.
[3 Jun 19:31]
Justin Swanhart
I disabled the event scheduler on the GR boxes and the problem went away, so chalk this up to user error. Sorry to waste your time.
[10 Jun 15:59]
MySQL Verification Team
Many thanks for the update!