Bug #97836 | In group replication foreign key will break slave aplier | ||
---|---|---|---|
Submitted: | 30 Nov 2019 12:54 | Modified: | 5 Dec 2019 7:49 |
Reporter: | zhang chen | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S2 (Serious) |
Version: | 8.0.18 | OS: | Linux (CentOS Linux release 7.6.1810) |
Assigned to: | CPU Architecture: | Any (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) | |
Tags: | 1452;slave down |
[30 Nov 2019 12:54]
zhang chen
[3 Dec 2019 7:12]
MySQL Verification Team
Hello Zhang chen, Thank you for the report and test case. Verified as described. regards, Umesh
[5 Dec 2019 7:49]
zhang chen
Is there any way to solve this problem now?
[11 Aug 2020 9:17]
Andrew Ernst
I've seen this same situation play out somewhat randomly in 3-node Group Replication clusters. One of the two secondary nodes will randomly throw this error. I believe it's related to having parallel replication appliers (4). I have not seen this occur when the appliers are reduced to 1. It has been observed in MySQL 8.0.19 in multiple environments (running on CentOS 7.8)
[21 Aug 2020 23:04]
Andrew Ernst
This remains a problem on MySQL 8.0.21. I'm seeing this regardless of cpu type. Parallel appliers are set to 4 in my environment. model name : Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
[24 Feb 2021 22:29]
Keith Lammers
I'm running into this issue still on 8.0.23 as well. Default parallel workers at 4.
[12 Mar 2021 10:42]
Bin Wang
If foreign key is part of another table's primary key,then Group replication could not build correct last_committed value which causes wrong parallel slave working. That's why "slave_parallel_workers=1" all works right and "slave_parallel_workers>1" probably causes slave nodes error.
[24 May 2021 14:51]
Ye Jinrong
I also encountered the same problem with mysql 8.0.25. One difference is that in my case, the foreign key of table b is the same as the primary key of table a. ``` CREATE TABLE `a` ( `a` int NOT NULL, `b` char(10) NOT NULL, `c` char(10) NOT NULL, PRIMARY KEY (`a`,`b`,`c`) ) ENGINE=InnoDB; CREATE TABLE `b` ( `a` int NOT NULL, `b` char(10) NOT NULL, `c` char(10) NOT NULL, PRIMARY KEY (`a`,`b`,`c`), CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`) ) ENGINE=InnoDB; ``` In a 3 nodes group, one secondary node got errors, and the other secondary node is normal. The error message is ``` 2021-05-24T14:43:42.915978Z 286 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'f195537d-19ac-11eb-b29f-5254002eb6d6:109922895'; Could not execute Write_rows event on table mymgr.b; Cannot add or update a child row: a foreign key constraint fails (`mymgr`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452 2021-05-24T14:43:42.916223Z 284 [Warning] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756 2021-05-24T14:43:42.916259Z 284 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.' 2021-05-24T14:43:42.916378Z 281 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.' 2021-05-24T14:43:42.916471Z 281 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.' ```
[30 May 2021 16:11]
Raghavendra V
I am also seeing this issue when running TPCC load tests on MySQL 8.0.25. I have modified the history table with primary keys to avoid "unsupported by plugin" errors MySQL mysql-node01:3306 ssl JS > cluster.status() { "clusterName": "myCluster", "defaultReplicaSet": { "name": "default", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures. 4 members are not active.", "topology": { "mysql-node01:3306": { "address": "mysql-node01:3306", "memberRole": "PRIMARY", "mode": "R/W", "readReplicas": {}, "replicationLag": null, "role": "HA", "status": "ONLINE", "version": "8.0.25" }, "mysql-node02:3306": { "address": "mysql-node02:3306", "instanceErrors": [ "ERROR: GR Applier channel applier stopped with an error: Worker 1 failed executing transaction 'db787556-bfb3-11eb-8c87-59124d4c9815:921'; Could not execute Write_rows event on table tpcc.STOCK; Cannot add or update a child row: a foreign key constraint fails (`tpcc`.`STOCK`, CONSTRAINT `FKEY_STOCK_2` FOREIGN KEY (`S_I_ID`) REFERENCES `ITEM` (`I_ID`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW (1452) at 2021-05-28 18:48:37.355181", "ERROR: group_replication has stopped with an error." ], "memberRole": "SECONDARY", "memberState": "ERROR", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "(MISSING)", "version": "8.0.25" }, ,........... same error like mysql-node02 for the other 3 nodes.
[7 Jul 2021 13:48]
Roman Vynar
Hello, Same problem. - MySQL versions affected at least 8.0.22, 8.0.25 - Group Replication - slave_parallel_workers default 10 - Delete query relying on FK CASCADE DELETE - It is written ordered to the relay log (1st child row delete, then 2nd parent row) but applied in parallel and can happen to apply in the wrong order breaking the slave applier thread. Shouldn't be ER_NO_REFERENCED_ROW type error considered for slave_retries when it is GR + slave_parallel_workers>0? Temporary fix is whether to not rely on CASCADE DELETE at all or set slave_parallel_workers=0/1 which is not efficient. Thanks, Roman
[7 Jul 2021 14:02]
Roman Vynar
Addition to my previous comment. Not just FK CASCADE DELETE but also FK unordered INSERT. Another use/break case: - Same MySQL versions, Group Replication and slave_parallel_workers. - INSERT into parent table. - INSERT into child table. Then it is Group Replicated to a secondary, written in the right order to relay log but applied in parallel thus sometimes a record to the child table is tried to be inserted before parent and failing the applier thread. Thanks, Roman
[25 Nov 2021 8:45]
Simon Mudd
Setting slave_parallel_workers = 0/1 is not really a fix. In many cases on busy systems it's unacceptable to apply this workaround as replication delay will have an impact on the business which parallel replication was designed to resolve. We're now on 8.0.27 so have had 8 minor version updates since this was reported and I consider this could be a blocker to moving from a master to GR due to the breakages it can cause. So please consider giving this more priority than seems to have happened so far.
[4 Mar 2022 6:34]
Shubham Jaiswal
I have tested the same scenario of zhang chen on MySQL 8.0.28. Still facing the same. If a table having child parent relationship, the replication applier thread will break on secondary nodes if they are very frequent. Please take it on priority, this is a major issue.
[10 May 2022 12:46]
MySQL Verification Team
Bug #107254 marked as duplicate of this one
[8 Jul 2022 7:03]
cc cc
in 2022/7/8 this bug still exists.
[6 Sep 2022 14:19]
Nuno Carvalho
The current information on server SQL layer does not have enough information in order to extract foreign keys dependencies from non key columns. The workaround is only reference complete keys on foreign keys. Example ------- CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c)); CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f), CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d) REFERENCES t1(a)); Table t1 has a primary key composed by three columns: a, b, c. The same happens to table t2, but the foreign key references a single column that is not a key by itself. The workaround is reference the complete key on table `t2` CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c)); CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f), CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d,e,f) REFERENCES t1(a,b,c));
[18 Dec 2023 5:59]
Aristotle Po
Hi MySQL Team, This is still reproducable in 8.0.35, in my case it is referencing to a PK with single column. With below structure : CREATE TABLE `t1` ( `id` int NOT NULL AUTO_INCREMENT, `c1` varchar(25) NOT NULL, PRIMARY KEY (`c1`), UNIQUE KEY id (`id`), KEY `c1` (`c1`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ; CREATE TABLE `t2` ( `id2` int NOT NULL AUTO_INCREMENT, `i1` int NOT NULL, `i2` int NOT NULL, `c2` varchar(45) NOT NULL, `c1` varchar(45) NOT NULL, PRIMARY KEY (`id2`), UNIQUE KEY `Unique` (`i1`,`i2`,`c2`,`c1`), CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ; ################################# # Script used ################################# shell> cat ./t2-bug97836.sh #!/bin/bash count=$2 if [ $# -eq 2 ]; then echo else echo " `basename $0` MYSQL_BINARY ITERATIONS `basename $0` /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 999 " exit; fi MYSQL="$1 test" i=0 while true do uuid=$(uuidgen) i=$((i+1)) $MYSQL -e"INSERT INTO t1(id, c1) VALUES ($i, left('$uuid',25)); INSERT INTO t2 (id2, i1, i2, c2, c1) VALUES ($i, $i, $i, left('$uuid',25), left('$uuid',25));" $MYSQL -e"DELETE FROM t2 WHERE id2 = $i; DELETE FROM t1 WHERE id = $i;" echo $i if [ $i -gt $count ] then exit 0; fi v_node_count=$($MYSQL -Bn --skip-column-names -e"SELECT count(*) FROM performance_schema.replication_group_members WHERE MEMBER_STATE = 'ONLINE';") if [ $v_node_count -lt 3 ]; then echo v_node_count = $v_node_count exit fi done ################################# # Execute script. It exited after 29 iterations. ################################# shell> ./t2-bug97836.sh /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 9999 1 [...] 29 v_node_count = 2 ################################# # Log ################################# 2023-12-18T11:54:34.581632+08:00 16 [ERROR] [MY-010584] [Repl] Replica SQL for channel 'group_replication_applier': Worker 2 failed executing transaction '00024535-bbbb-cccc-dddd-eeeeeeeeeeee:9358'; Could not execute Delete_rows event on table test.t1; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t2`, CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED, Error_code: MY-001451 [..] 2023-12-18T11:54:37.630278+08:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
[18 Jun 16:41]
Gary Whelan
Bug is still present in 8.0.36
[26 Aug 10:03]
Michael De Vijlder
I can confirm this bug also exists in the next LTS, confirmed as present in 8.4.2 This is breaking our cluster on regular base unfortunately, so the fact that this bug is open for 5 years is more than worrisome...
[26 Aug 20:39]
Evan Elias
Question for recent commenters: in your foreign key which triggers this problem, does the referenced table (parent table) have a unique key or primary key which exactly matches the columns in the foreign key constraint? MySQL 8.4 has been made stricter about this, at least when creating a new foreign key constraint. With default server settings, you now need to have an exactly matching unique key on the parent table, or else the FK creation fails with error 6125. I wrote more here https://www.skeema.io/blog/2024/05/14/mysql84-surprises/#foreign-key-restrictions The reason I bring this up is that in the MySQL 8.4.0 release notes, when describing this new foreign key behavior, the release notes reference this bug report. So that may imply this bug won't be fixed, but rather the solution is just to ensure you always have an exactly-matching unique key (or primary key) on the parent table. However, a couple of older the reports above do seem to already have an exactly matching unique key on the parent table... so that's confusing. In one of those cases though there's *also* a redundant non-unique index present, so perhaps that is related or is a separate bug?