Bug #94991 multi-table deletes with foreign keys take down all replicas
Submitted: 11 Apr 2019 23:43 Modified: 17 Apr 2019 18:25
Reporter: Trey Raymond Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.7.25, 8.0.15 OS:Any
Assigned to: CPU Architecture:Any

[11 Apr 2019 23:43] Trey Raymond
Description:
consider two tables, a and b, where b references a via a foreign key constraint.  a user attempts a multi-table delete on both, referencing a parent and child row.

-- WITHOUT cascade option in fk, it fails with the following error
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`ops_ia`.`b`, CONSTRAINT `asdf` FOREIGN KEY (`a_id`) REFERENCES `a` (`id`))
-- this ideally would not error and the multi table delete would work as it's deleting the child row so not actually violating anything, but it's not that critical.  the fix is tied into the more serious case, below:

-- WITH cascade option, it's catastrophic, takes all slaves down
Last_SQL_Errno: 1032
Last_SQL_Error: Could not execute Delete_rows event on table ops_ia.b; Can't find record in 'b', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log spare-002-bin.000032, end_log_pos 408791771
-- this is probably because it deletes the record from 'a' with the first row event, which cascades down to 'b', then the second row event can't find it.  slaves shouldn't be processing cascades, though

-- mysqlbinlog output, what the slave sees:
#190411 23:24:26 server id 184354198  end_log_pos 408791523 CRC32 0x490c3efc 	GTID	last_committed=84998	sequence_number=84999	rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= 'ff0d4d97-2bf9-11e9-a7a6-008cfa1eefb4:2694913'/*!*/;
#190411 23:24:26 server id 184354198  end_log_pos 408791599 CRC32 0x4fb55be0 	Query	thread_id=229293	exec_time=0	error_code=0
SET TIMESTAMP=1555025066/*!*/;
SET @@session.pseudo_thread_id=229293/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1342701568/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
#190411 23:24:26 server id 184354198  end_log_pos 408791645 CRC32 0x18192d09 	Table_map: `ops_ia`.`a` mapped to number 3917
#190411 23:24:26 server id 184354198  end_log_pos 408791691 CRC32 0xff6192ab 	Table_map: `ops_ia`.`b` mapped to number 3919
#190411 23:24:26 server id 184354198  end_log_pos 408791731 CRC32 0xaa406089 	Delete_rows: table id 3917
#190411 23:24:26 server id 184354198  end_log_pos 408791771 CRC32 0xb8647900 	Delete_rows: table id 3919 flags: STMT_END_F

BINLOG '
qsyvXBOWBf0KLgAAAF2qXRgAAE0PAAAAAAMABm9wc19pYQABYQABAwAACS0ZGA==
qsyvXBOWBf0KLgAAAIuqXRgAAE8PAAAAAAMABm9wc19pYQABYgABAwAAq5Jh/w==
qsyvXCCWBf0KKAAAALOqXRgAAE0PAAAAAAAAAgABAf4BAAAAiWBAqg==
qsyvXCCWBf0KKAAAANuqXRgAAE8PAAAAAAEAAgABAf4BAAAAAHlkuA==
'/*!*/;
### DELETE FROM `ops_ia`.`a`
### WHERE
###   @1=1
### DELETE FROM `ops_ia`.`b`
### WHERE
###   @1=1
#190411 23:24:26 server id 184354198  end_log_pos 408791802 CRC32 0xaea87584 	Xid = 37138747
COMMIT/*!*/;

How to repeat:
-- ON MASTER:

CREATE TABLE `a` (
  `id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `b` (
  `a_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`a_id`),
  CONSTRAINT `asdf` FOREIGN KEY (`a_id`) REFERENCES `a` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

insert into a set id=1;
insert into b set a_id=1;

delete a,b
from a
join b on b.a_id=a.id
where b.a_id=1;

-- check slaves, they will be stopped on error

ALTERNATE case [still with cascade, without no binlog is written because error on master], if I run:
delete from a where id=1;
then it logs only the 'a' row, and the slave processes the cascade.

#190411 23:34:24 server id 184354198  end_log_pos 408793209 CRC32 0xdf9d3bda 	GTID	last_committed=85005	sequence_number=85006	rbr_only=yes
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
SET @@SESSION.GTID_NEXT= 'ff0d4d97-2bf9-11e9-a7a6-008cfa1eefb4:2694920'/*!*/;
#190411 23:34:24 server id 184354198  end_log_pos 408793285 CRC32 0xa3f787bd 	Query	thread_id=229315	exec_time=0	error_code=0
SET TIMESTAMP=1555025664/*!*/;
SET @@session.pseudo_thread_id=229315/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1342701568/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=33/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
#190411 23:34:24 server id 184354198  end_log_pos 408793331 CRC32 0xc5b28ff1 	Table_map: `ops_ia`.`a` mapped to number 3917
#190411 23:34:24 server id 184354198  end_log_pos 408793371 CRC32 0x3f15ead7 	Delete_rows: table id 3917 flags: STMT_END_F

BINLOG '
AM+vXBOWBf0KLgAAAPOwXRgAAE0PAAAAAAMABm9wc19pYQABYQABAwAA8Y+yxQ==
AM+vXCCWBf0KKAAAABuxXRgAAE0PAAAAAAEAAgABAf4BAAAA1+oVPw==
'/*!*/;
### DELETE FROM `ops_ia`.`a`
### WHERE
###   @1=1
#190411 23:34:24 server id 184354198  end_log_pos 408793402 CRC32 0xc33d10f6 	Xid = 37139216
COMMIT/*!*/;

Suggested fix:
Two fixes:
1. process cascades on the master and log the results as row events, vs processing them on slaves.  this ensures all changes are deterministic (for example - if slaves have different constraints than master).  this would be the same way triggers are currently handled (and everything else in the RBR world).
2. make multi-table deletes aware of constraints, basically so the no-cascade-fk option described would succeed, that would have the effect of making the cascade option succeed as well

Doing only #2 should solve this specific issue, but leaves a large risk of other issues, because nondeterministic replication is bad juju.
[11 Apr 2019 23:52] Trey Raymond
observed on 5.7.25
[12 Apr 2019 8:53] MySQL Verification Team
Hello Trey Raymond,

Thank you for the report and test case.
Verified as described.

Thanks,
Umesh
[12 Apr 2019 16:57] Trey Raymond
interesting note:
changing the order of the tables in the delete spec does nothing, i.e. "delete a,b from..." fails just the same as "delete b,a from..."
however, changing the specified join order does make a difference.  i.e. "from a join b" fails, but "from b join a" succeeds, meaning that child rows are purged first here, and the deletes behave very differently based on that order.
[15 Apr 2019 13:46] Sveta Smirnova
Duplicate of bug #80821?
[17 Apr 2019 18:22] Dmitry Lenev
Posted by developer:
 
Hello!

The issue described above is the same as one described in
bug #23029272 / #80821 "REPLICATION BREAKS IF MULTI-TABLE DELETE
IS USED IN CONJUNCTION WITH FOREIGN KEY".

So I am marking this bug report as a duplicate of the latter.