Bug #97836 In group replication foreign key will break slave aplier
Submitted: 30 Nov 2019 12:54 Modified: 5 Dec 2019 7:49
Reporter: zhang chen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.18 OS:Linux (CentOS Linux release 7.6.1810)
Assigned to: CPU Architecture:Any (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)
Tags: 1452;slave down

[30 Nov 2019 12:54] zhang chen
Description:
In group replication, a table has foreign keys that depend on another table. If the main database is executed successfully, the slave may execute abnormally and exit the group.

I have tried to commit_order and other configurations, but they are still unsolvable

Currently, it can only be solved by setting slave_parallel_workers to 0, but it seriously affects efficiency.

How to repeat:
create table zxc.a
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c)
);

create table zxc.b
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c),
	CONSTRAINT foreign_key_b2a FOREIGN KEY (a) REFERENCES a (a)
);

use a shell run these:

#bin/sh
count=$1
i=0
while true
do
        let i=i+1
        mysql zxc -e"insert into a (a,b,c) values ($i,1,1);insert into b (a,b,c) values ($i,1,1);"

        mysql zxc -e"delete from b where a=$i;delete from a where a=$i;"
        echo $i
        if [ $i -gt $count ]
        then
                exit 0;
        fi
done

It works normally on the master, but you can see mysql.err in slave like these:
2019-11-30T20:40:52.064868+08:00 335242 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:12492217'; Could not execute Write_rows event on table zxc.b; Cannot add or update a child row: a foreign key constraint fails (`zxc`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`) REFERENCES `a` (`a`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452
[3 Dec 2019 7:12] MySQL Verification Team
Hello Zhang chen,

Thank you for the report and test case.
Verified as described.

regards,
Umesh
[5 Dec 2019 7:49] zhang chen
Is there any way to solve this problem now?
[11 Aug 2020 9:17] Andrew Ernst
I've seen this same situation play out somewhat randomly in 3-node Group Replication clusters.

One of the two secondary nodes will randomly throw this error.

I believe it's related to having parallel replication appliers (4). I have not seen this occur when the appliers are reduced to 1.

It has been observed in MySQL 8.0.19 in multiple environments (running on CentOS 7.8)
[21 Aug 2020 23:04] Andrew Ernst
This remains a problem on MySQL 8.0.21.
I'm seeing this regardless of cpu type.  Parallel appliers are set to 4 in my environment.

model name	: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
[24 Feb 22:29] Keith Lammers
I'm running into this issue still on 8.0.23 as well. Default parallel workers at 4.
[12 Mar 10:42] Bin Wang
If foreign key is part of another table's primary key,then Group replication could not build correct last_committed value which causes wrong parallel slave working.
That's why "slave_parallel_workers=1" all works right and "slave_parallel_workers>1" probably causes slave nodes error.
[24 May 14:51] Ye Jinrong
I also encountered the same problem with mysql 8.0.25.

One difference is that in my case, the foreign key of table b is the same as the primary key of table a.
```
CREATE TABLE `a` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`)
) ENGINE=InnoDB;

CREATE TABLE `b` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`),
  CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)
) ENGINE=InnoDB;
```

In a 3 nodes group, one secondary node got errors, and the other secondary node is normal. The error message is
```
2021-05-24T14:43:42.915978Z 286 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'f195537d-19ac-11eb-b29f-5254002eb6d6:109922895'; Could not execute Write_rows event on table mymgr.b; Cannot add or update a child row: a foreign key constraint fails (`mymgr`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452
2021-05-24T14:43:42.916223Z 284 [Warning] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756
2021-05-24T14:43:42.916259Z 284 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2021-05-24T14:43:42.916378Z 281 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2021-05-24T14:43:42.916471Z 281 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
```
[30 May 16:11] Raghavendra V
I am also seeing this issue when running TPCC load tests on MySQL 8.0.25. I have modified the history table with primary keys to avoid "unsupported by plugin" errors

 MySQL  mysql-node01:3306 ssl  JS > cluster.status()
{
    "clusterName": "myCluster",
    "defaultReplicaSet": {
        "name": "default",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 4 members are not active.",
        "topology": {
            "mysql-node01:3306": {
                "address": "mysql-node01:3306",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.25"
            },
            "mysql-node02:3306": {
                "address": "mysql-node02:3306",
                "instanceErrors": [
                    "ERROR: GR Applier channel applier stopped with an error: Worker 1 failed executing transaction 'db787556-bfb3-11eb-8c87-59124d4c9815:921'; Could not execute Write_rows event on table tpcc.STOCK; Cannot add or update a child row: a foreign key constraint fails (`tpcc`.`STOCK`, CONSTRAINT `FKEY_STOCK_2` FOREIGN KEY (`S_I_ID`) REFERENCES `ITEM` (`I_ID`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW (1452) at 2021-05-28 18:48:37.355181",
                    "ERROR: group_replication has stopped with an error."
                ],
                "memberRole": "SECONDARY",
                "memberState": "ERROR",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)",
                "version": "8.0.25"
            },
,...........
same error like mysql-node02 for the other 3 nodes.
[7 Jul 13:48] Roman Vynar
Hello,

Same problem.

- MySQL versions affected at least 8.0.22, 8.0.25
- Group Replication
- slave_parallel_workers default 10
- Delete query relying on FK CASCADE DELETE
- It is written ordered to the relay log (1st child row delete, then 2nd parent row) but applied in parallel and can happen to apply in the wrong order breaking the slave applier thread.

Shouldn't be ER_NO_REFERENCED_ROW type error considered for slave_retries when it is GR + slave_parallel_workers>0?

Temporary fix is whether to not rely on CASCADE DELETE at all or set slave_parallel_workers=0/1 which is not efficient.

Thanks,
Roman
[7 Jul 14:02] Roman Vynar
Addition to my previous comment. 
Not just FK CASCADE DELETE but also FK unordered INSERT.

Another use/break case:
- Same MySQL versions, Group Replication and slave_parallel_workers.
- INSERT into parent table.
- INSERT into child table.

Then it is Group Replicated to a secondary, written in the right order to relay log but applied in parallel thus sometimes a record to the child table is tried to be inserted before parent and failing the applier thread.

Thanks,
Roman