Bug #97836 In group replication foreign key will break slave aplier
Submitted: 30 Nov 2019 12:54 Modified: 5 Dec 2019 7:49
Reporter: zhang chen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.18 OS:Linux (CentOS Linux release 7.6.1810)
Assigned to: CPU Architecture:Any (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)
Tags: 1452;slave down

[30 Nov 2019 12:54] zhang chen
Description:
In group replication, a table has foreign keys that depend on another table. If the main database is executed successfully, the slave may execute abnormally and exit the group.

I have tried to commit_order and other configurations, but they are still unsolvable

Currently, it can only be solved by setting slave_parallel_workers to 0, but it seriously affects efficiency.

How to repeat:
create table zxc.a
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c)
);

create table zxc.b
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c),
	CONSTRAINT foreign_key_b2a FOREIGN KEY (a) REFERENCES a (a)
);

use a shell run these:

#bin/sh
count=$1
i=0
while true
do
        let i=i+1
        mysql zxc -e"insert into a (a,b,c) values ($i,1,1);insert into b (a,b,c) values ($i,1,1);"

        mysql zxc -e"delete from b where a=$i;delete from a where a=$i;"
        echo $i
        if [ $i -gt $count ]
        then
                exit 0;
        fi
done

It works normally on the master, but you can see mysql.err in slave like these:
2019-11-30T20:40:52.064868+08:00 335242 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:12492217'; Could not execute Write_rows event on table zxc.b; Cannot add or update a child row: a foreign key constraint fails (`zxc`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`) REFERENCES `a` (`a`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452
[3 Dec 2019 7:12] MySQL Verification Team
Hello Zhang chen,

Thank you for the report and test case.
Verified as described.

regards,
Umesh
[5 Dec 2019 7:49] zhang chen
Is there any way to solve this problem now?
[11 Aug 2020 9:17] Andrew Ernst
I've seen this same situation play out somewhat randomly in 3-node Group Replication clusters.

One of the two secondary nodes will randomly throw this error.

I believe it's related to having parallel replication appliers (4). I have not seen this occur when the appliers are reduced to 1.

It has been observed in MySQL 8.0.19 in multiple environments (running on CentOS 7.8)
[21 Aug 2020 23:04] Andrew Ernst
This remains a problem on MySQL 8.0.21.
I'm seeing this regardless of cpu type.  Parallel appliers are set to 4 in my environment.

model name	: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
[24 Feb 2021 22:29] Keith Lammers
I'm running into this issue still on 8.0.23 as well. Default parallel workers at 4.
[12 Mar 2021 10:42] Bin Wang
If foreign key is part of another table's primary key,then Group replication could not build correct last_committed value which causes wrong parallel slave working.
That's why "slave_parallel_workers=1" all works right and "slave_parallel_workers>1" probably causes slave nodes error.
[24 May 2021 14:51] Ye Jinrong
I also encountered the same problem with mysql 8.0.25.

One difference is that in my case, the foreign key of table b is the same as the primary key of table a.
```
CREATE TABLE `a` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`)
) ENGINE=InnoDB;

CREATE TABLE `b` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`),
  CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)
) ENGINE=InnoDB;
```

In a 3 nodes group, one secondary node got errors, and the other secondary node is normal. The error message is
```
2021-05-24T14:43:42.915978Z 286 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'f195537d-19ac-11eb-b29f-5254002eb6d6:109922895'; Could not execute Write_rows event on table mymgr.b; Cannot add or update a child row: a foreign key constraint fails (`mymgr`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452
2021-05-24T14:43:42.916223Z 284 [Warning] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756
2021-05-24T14:43:42.916259Z 284 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2021-05-24T14:43:42.916378Z 281 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2021-05-24T14:43:42.916471Z 281 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
```
[30 May 2021 16:11] Raghavendra V
I am also seeing this issue when running TPCC load tests on MySQL 8.0.25. I have modified the history table with primary keys to avoid "unsupported by plugin" errors

 MySQL  mysql-node01:3306 ssl  JS > cluster.status()
{
    "clusterName": "myCluster",
    "defaultReplicaSet": {
        "name": "default",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 4 members are not active.",
        "topology": {
            "mysql-node01:3306": {
                "address": "mysql-node01:3306",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.25"
            },
            "mysql-node02:3306": {
                "address": "mysql-node02:3306",
                "instanceErrors": [
                    "ERROR: GR Applier channel applier stopped with an error: Worker 1 failed executing transaction 'db787556-bfb3-11eb-8c87-59124d4c9815:921'; Could not execute Write_rows event on table tpcc.STOCK; Cannot add or update a child row: a foreign key constraint fails (`tpcc`.`STOCK`, CONSTRAINT `FKEY_STOCK_2` FOREIGN KEY (`S_I_ID`) REFERENCES `ITEM` (`I_ID`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW (1452) at 2021-05-28 18:48:37.355181",
                    "ERROR: group_replication has stopped with an error."
                ],
                "memberRole": "SECONDARY",
                "memberState": "ERROR",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)",
                "version": "8.0.25"
            },
,...........
same error like mysql-node02 for the other 3 nodes.
[7 Jul 2021 13:48] Roman Vynar
Hello,

Same problem.

- MySQL versions affected at least 8.0.22, 8.0.25
- Group Replication
- slave_parallel_workers default 10
- Delete query relying on FK CASCADE DELETE
- It is written ordered to the relay log (1st child row delete, then 2nd parent row) but applied in parallel and can happen to apply in the wrong order breaking the slave applier thread.

Shouldn't be ER_NO_REFERENCED_ROW type error considered for slave_retries when it is GR + slave_parallel_workers>0?

Temporary fix is whether to not rely on CASCADE DELETE at all or set slave_parallel_workers=0/1 which is not efficient.

Thanks,
Roman
[7 Jul 2021 14:02] Roman Vynar
Addition to my previous comment. 
Not just FK CASCADE DELETE but also FK unordered INSERT.

Another use/break case:
- Same MySQL versions, Group Replication and slave_parallel_workers.
- INSERT into parent table.
- INSERT into child table.

Then it is Group Replicated to a secondary, written in the right order to relay log but applied in parallel thus sometimes a record to the child table is tried to be inserted before parent and failing the applier thread.

Thanks,
Roman
[25 Nov 2021 8:45] Simon Mudd
Setting slave_parallel_workers = 0/1 is not really a fix. In many cases on busy systems it's unacceptable to apply this workaround as replication delay will have an impact on the business which parallel replication was designed to resolve.

We're now on 8.0.27 so have had 8 minor version updates since this was reported and I consider this could be a blocker to moving from a master to GR due to the breakages it can cause.

So please consider giving this more priority than seems to have happened so far.
[4 Mar 2022 6:34] Shubham Jaiswal
I have tested the same scenario of zhang chen on MySQL 8.0.28. Still facing the same.

If a table having child parent relationship, the replication applier thread will break on secondary nodes if they are very frequent.

Please take it on priority, this is a major issue.
[10 May 2022 12:46] MySQL Verification Team
Bug #107254 marked as duplicate of this one
[8 Jul 2022 7:03] cc cc
in 2022/7/8 this bug still exists.
[6 Sep 2022 14:19] Nuno Carvalho
The current information on server SQL layer does not have enough
information in order to extract foreign keys dependencies from non
key columns.
The workaround is only reference complete keys on foreign keys.

Example
-------
CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c));
CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f),
                CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d) REFERENCES t1(a));

Table t1 has a primary key composed by three columns: a, b, c.
The same happens to table t2, but the foreign key references a
single column that is not a key by itself.

The workaround is reference the complete key on table `t2`
CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c));
CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f),
                CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d,e,f) REFERENCES t1(a,b,c));
[18 Dec 2023 5:59] Aristotle Po
Hi MySQL Team,
This is still reproducable in 8.0.35, in my case it is referencing to a PK with single column.
With below structure :

CREATE TABLE `t1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `c1` varchar(25) NOT NULL,
  PRIMARY KEY (`c1`),
  UNIQUE KEY id (`id`),
  KEY `c1` (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;
CREATE TABLE `t2` (
  `id2` int NOT NULL AUTO_INCREMENT,
  `i1` int NOT NULL,
  `i2` int NOT NULL,
  `c2` varchar(45) NOT NULL,
  `c1` varchar(45) NOT NULL,
  PRIMARY KEY (`id2`),
  UNIQUE KEY `Unique` (`i1`,`i2`,`c2`,`c1`),
  CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;

#################################
# Script used
#################################
shell> cat ./t2-bug97836.sh
#!/bin/bash
count=$2
if [ $# -eq 2 ]; then
  echo
else
  echo "
`basename $0` MYSQL_BINARY                             ITERATIONS
`basename $0` /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 999
"
  exit;
fi

MYSQL="$1 test"
i=0
while true
do
        uuid=$(uuidgen)
        i=$((i+1))
        $MYSQL -e"INSERT INTO t1(id, c1) VALUES ($i, left('$uuid',25)); INSERT INTO t2 (id2, i1, i2, c2, c1) VALUES ($i, $i, $i, left('$uuid',25), left('$uuid',25));"

        $MYSQL -e"DELETE FROM t2 WHERE id2 = $i; DELETE FROM t1 WHERE id = $i;"
        echo $i
        if [ $i -gt $count ]
        then
                exit 0;
        fi
        v_node_count=$($MYSQL -Bn --skip-column-names -e"SELECT count(*) FROM performance_schema.replication_group_members WHERE MEMBER_STATE = 'ONLINE';")
        if [ $v_node_count -lt 3 ]; then
          echo v_node_count = $v_node_count
          exit
        fi
done

#################################
# Execute script. It exited after 29 iterations.
#################################
shell> ./t2-bug97836.sh /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 9999
1
[...]
29
v_node_count = 2

#################################
# Log
#################################
2023-12-18T11:54:34.581632+08:00 16 [ERROR] [MY-010584] [Repl] Replica SQL for channel 'group_replication_applier': Worker 2 failed executing transaction '00024535-bbbb-cccc-dddd-eeeeeeeeeeee:9358'; Could not execute Delete_rows event on table test.t1; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t2`, CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED, Error_code: MY-001451
[..]
2023-12-18T11:54:37.630278+08:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'