MySQL Bugs: #97836: In group replication foreign key will break slave aplier

Bug #97836	In group replication foreign key will break slave aplier
Submitted:	30 Nov 2019 12:54	Modified:	5 Dec 2019 7:49
Reporter:	zhang chen	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	8.0.18	OS:	Linux (CentOS Linux release 7.6.1810)
Assigned to:		CPU Architecture:	Any (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)
Tags:	1452;slave down

Description:
In group replication, a table has foreign keys that depend on another table. If the main database is executed successfully, the slave may execute abnormally and exit the group.

I have tried to commit_order and other configurations, but they are still unsolvable

Currently, it can only be solved by setting slave_parallel_workers to 0, but it seriously affects efficiency.

How to repeat:
create table zxc.a
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c)
);

create table zxc.b
(
	a int,
	b char(10),
	c char(10),
	primary key (a,b,c),
	CONSTRAINT foreign_key_b2a FOREIGN KEY (a) REFERENCES a (a)
);

use a shell run these:

#bin/sh
count=$1
i=0
while true
do
        let i=i+1
        mysql zxc -e"insert into a (a,b,c) values ($i,1,1);insert into b (a,b,c) values ($i,1,1);"

        mysql zxc -e"delete from b where a=$i;delete from a where a=$i;"
        echo $i
        if [ $i -gt $count ]
        then
                exit 0;
        fi
done

It works normally on the master, but you can see mysql.err in slave like these:
2019-11-30T20:40:52.064868+08:00 335242 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:12492217'; Could not execute Write_rows event on table zxc.b; Cannot add or update a child row: a foreign key constraint fails (`zxc`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`) REFERENCES `a` (`a`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452

Hello Zhang chen,

Thank you for the report and test case.
Verified as described.

regards,
Umesh

Is there any way to solve this problem now?

I've seen this same situation play out somewhat randomly in 3-node Group Replication clusters.

One of the two secondary nodes will randomly throw this error.

I believe it's related to having parallel replication appliers (4). I have not seen this occur when the appliers are reduced to 1.

It has been observed in MySQL 8.0.19 in multiple environments (running on CentOS 7.8)

This remains a problem on MySQL 8.0.21.
I'm seeing this regardless of cpu type.  Parallel appliers are set to 4 in my environment.

model name	: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz

I'm running into this issue still on 8.0.23 as well. Default parallel workers at 4.

If foreign key is part of another table's primary key,then Group replication could not build correct last_committed value which causes wrong parallel slave working.
That's why "slave_parallel_workers=1" all works right and "slave_parallel_workers>1" probably causes slave nodes error.

I also encountered the same problem with mysql 8.0.25.

One difference is that in my case, the foreign key of table b is the same as the primary key of table a.
```
CREATE TABLE `a` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`)
) ENGINE=InnoDB;

CREATE TABLE `b` (
  `a` int NOT NULL,
  `b` char(10) NOT NULL,
  `c` char(10) NOT NULL,
  PRIMARY KEY (`a`,`b`,`c`),
  CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)
) ENGINE=InnoDB;
```

In a 3 nodes group, one secondary node got errors, and the other secondary node is normal. The error message is
```
2021-05-24T14:43:42.915978Z 286 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'f195537d-19ac-11eb-b29f-5254002eb6d6:109922895'; Could not execute Write_rows event on table mymgr.b; Cannot add or update a child row: a foreign key constraint fails (`mymgr`.`b`, CONSTRAINT `foreign_key_b2a` FOREIGN KEY (`a`, `b`, `c`) REFERENCES `a` (`a`, `b`, `c`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW, Error_code: MY-001452
2021-05-24T14:43:42.916223Z 284 [Warning] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756
2021-05-24T14:43:42.916259Z 284 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2021-05-24T14:43:42.916378Z 281 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2021-05-24T14:43:42.916471Z 281 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
```

I am also seeing this issue when running TPCC load tests on MySQL 8.0.25. I have modified the history table with primary keys to avoid "unsupported by plugin" errors

 MySQL  mysql-node01:3306 ssl  JS > cluster.status()
{
    "clusterName": "myCluster",
    "defaultReplicaSet": {
        "name": "default",
        "ssl": "REQUIRED",
        "status": "OK_NO_TOLERANCE",
        "statusText": "Cluster is NOT tolerant to any failures. 4 members are not active.",
        "topology": {
            "mysql-node01:3306": {
                "address": "mysql-node01:3306",
                "memberRole": "PRIMARY",
                "mode": "R/W",
                "readReplicas": {},
                "replicationLag": null,
                "role": "HA",
                "status": "ONLINE",
                "version": "8.0.25"
            },
            "mysql-node02:3306": {
                "address": "mysql-node02:3306",
                "instanceErrors": [
                    "ERROR: GR Applier channel applier stopped with an error: Worker 1 failed executing transaction 'db787556-bfb3-11eb-8c87-59124d4c9815:921'; Could not execute Write_rows event on table tpcc.STOCK; Cannot add or update a child row: a foreign key constraint fails (`tpcc`.`STOCK`, CONSTRAINT `FKEY_STOCK_2` FOREIGN KEY (`S_I_ID`) REFERENCES `ITEM` (`I_ID`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW (1452) at 2021-05-28 18:48:37.355181",
                    "ERROR: group_replication has stopped with an error."
                ],
                "memberRole": "SECONDARY",
                "memberState": "ERROR",
                "mode": "R/O",
                "readReplicas": {},
                "role": "HA",
                "status": "(MISSING)",
                "version": "8.0.25"
            },
,...........
same error like mysql-node02 for the other 3 nodes.

Hello,

Same problem.

- MySQL versions affected at least 8.0.22, 8.0.25
- Group Replication
- slave_parallel_workers default 10
- Delete query relying on FK CASCADE DELETE
- It is written ordered to the relay log (1st child row delete, then 2nd parent row) but applied in parallel and can happen to apply in the wrong order breaking the slave applier thread.

Shouldn't be ER_NO_REFERENCED_ROW type error considered for slave_retries when it is GR + slave_parallel_workers>0?

Temporary fix is whether to not rely on CASCADE DELETE at all or set slave_parallel_workers=0/1 which is not efficient.

Thanks,
Roman

Addition to my previous comment. 
Not just FK CASCADE DELETE but also FK unordered INSERT.

Another use/break case:
- Same MySQL versions, Group Replication and slave_parallel_workers.
- INSERT into parent table.
- INSERT into child table.

Then it is Group Replicated to a secondary, written in the right order to relay log but applied in parallel thus sometimes a record to the child table is tried to be inserted before parent and failing the applier thread.

Thanks,
Roman

Setting slave_parallel_workers = 0/1 is not really a fix. In many cases on busy systems it's unacceptable to apply this workaround as replication delay will have an impact on the business which parallel replication was designed to resolve.

We're now on 8.0.27 so have had 8 minor version updates since this was reported and I consider this could be a blocker to moving from a master to GR due to the breakages it can cause.

So please consider giving this more priority than seems to have happened so far.

I have tested the same scenario of zhang chen on MySQL 8.0.28. Still facing the same.

If a table having child parent relationship, the replication applier thread will break on secondary nodes if they are very frequent.

Please take it on priority, this is a major issue.

Bug #107254 marked as duplicate of this one

in 2022/7/8 this bug still exists.

The current information on server SQL layer does not have enough
information in order to extract foreign keys dependencies from non
key columns.
The workaround is only reference complete keys on foreign keys.

Example
-------
CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c));
CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f),
                CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d) REFERENCES t1(a));

Table t1 has a primary key composed by three columns: a, b, c.
The same happens to table t2, but the foreign key references a
single column that is not a key by itself.

The workaround is reference the complete key on table `t2`
CREATE TABLE t1(a INT, b INT, c INT, PRIMARY KEY (a,b,c));
CREATE TABLE t2(d INT, e INT, f INT, PRIMARY KEY (d,e,f),
                CONSTRAINT foreign_key_t2_t1 FOREIGN KEY (d,e,f) REFERENCES t1(a,b,c));

Hi MySQL Team,
This is still reproducable in 8.0.35, in my case it is referencing to a PK with single column.
With below structure :

CREATE TABLE `t1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `c1` varchar(25) NOT NULL,
  PRIMARY KEY (`c1`),
  UNIQUE KEY id (`id`),
  KEY `c1` (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;
CREATE TABLE `t2` (
  `id2` int NOT NULL AUTO_INCREMENT,
  `i1` int NOT NULL,
  `i2` int NOT NULL,
  `c2` varchar(45) NOT NULL,
  `c1` varchar(45) NOT NULL,
  PRIMARY KEY (`id2`),
  UNIQUE KEY `Unique` (`i1`,`i2`,`c2`,`c1`),
  CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
;

#################################
# Script used
#################################
shell> cat ./t2-bug97836.sh
#!/bin/bash
count=$2
if [ $# -eq 2 ]; then
  echo
else
  echo "
`basename $0` MYSQL_BINARY                             ITERATIONS
`basename $0` /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 999
"
  exit;
fi

MYSQL="$1 test"
i=0
while true
do
        uuid=$(uuidgen)
        i=$((i+1))
        $MYSQL -e"INSERT INTO t1(id, c1) VALUES ($i, left('$uuid',25)); INSERT INTO t2 (id2, i1, i2, c2, c1) VALUES ($i, $i, $i, left('$uuid',25), left('$uuid',25));"

        $MYSQL -e"DELETE FROM t2 WHERE id2 = $i; DELETE FROM t1 WHERE id = $i;"
        echo $i
        if [ $i -gt $count ]
        then
                exit 0;
        fi
        v_node_count=$($MYSQL -Bn --skip-column-names -e"SELECT count(*) FROM performance_schema.replication_group_members WHERE MEMBER_STATE = 'ONLINE';")
        if [ $v_node_count -lt 3 ]; then
          echo v_node_count = $v_node_count
          exit
        fi
done

#################################
# Execute script. It exited after 29 iterations.
#################################
shell> ./t2-bug97836.sh /nfs/sandboxes/my/group_sp_msb_8_0_35/n1 9999
1
[...]
29
v_node_count = 2

#################################
# Log
#################################
2023-12-18T11:54:34.581632+08:00 16 [ERROR] [MY-010584] [Repl] Replica SQL for channel 'group_replication_applier': Worker 2 failed executing transaction '00024535-bbbb-cccc-dddd-eeeeeeeeeeee:9358'; Could not execute Delete_rows event on table test.t1; Cannot delete or update a parent row: a foreign key constraint fails (`test`.`t2`, CONSTRAINT `FK_c1` FOREIGN KEY (`c1`) REFERENCES `t1` (`c1`)), Error_code: 1451; handler error HA_ERR_ROW_IS_REFERENCED, Error_code: MY-001451
[..]
2023-12-18T11:54:37.630278+08:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

Bug is still present in 8.0.36

I can confirm this bug also exists in the next LTS, confirmed as present in 8.4.2

This is breaking our cluster on regular base unfortunately, so the fact that this bug is open for 5 years is more than worrisome...

Question for recent commenters: in your foreign key which triggers this problem, does the referenced table (parent table) have a unique key or primary key which exactly matches the columns in the foreign key constraint?

MySQL 8.4 has been made stricter about this, at least when creating a new foreign key constraint. With default server settings, you now need to have an exactly matching unique key on the parent table, or else the FK creation fails with error 6125. I wrote more here https://www.skeema.io/blog/2024/05/14/mysql84-surprises/#foreign-key-restrictions

The reason I bring this up is that in the MySQL 8.4.0 release notes, when describing this new foreign key behavior, the release notes reference this bug report. So that may imply this bug won't be fixed, but rather the solution is just to ensure you always have an exactly-matching unique key (or primary key) on the parent table.

However, a couple of older the reports above do seem to already have an exactly matching unique key on the parent table... so that's confusing. In one of those cases though there's *also* a redundant non-unique index present, so perhaps that is related or is a separate bug?

Maybe this reasoning model output can help target the changes needed?

Likely Code References for Investigation:

sql/binlog.cc, around MYSQL_BIN_LOG::ordered_commit()

This function calculates the last_committed value for each transaction. The logic must ensure foreign key references in one transaction create a dependency on the transaction that inserted/deleted the parent row. A known deficiency here can result in parallel replication applying child-table transactions before their parent, causing the foreign key error.

sql/log_event.cc, especially Gtid_log_event construction

The GTID event contains last_committed and sequence_number fields. Ensure the event logic is updated to reflect the correct dependency when foreign keys are involved.

Write-set extraction logic

The code that collects write-sets for transactions (e.g., in sql/transaction.cc or via service_rpl_transaction_write_set.h) may need modifications so that foreign key references are treated as a dependency. Otherwise, the parent/child relationship is not recognized, and last_committed remains too low.
Replication Applier (e.g., sql/rpl_slave.cc and sql/rpl_parallel.cc)

Confirm that the coordinator/worker threads (which rely on last_committed for scheduling parallel transactions) can properly serialize transactions with FK dependencies. The primary fix is likely on the master side, but verifying applier logic ensures there are no further ordering oversights.