MySQL Bugs: #26551: Aborted query for non-transactional table breaks replication

Bug #26551	Aborted query for non-transactional table breaks replication
Submitted:	22 Feb 2007 0:50	Modified:	1 May 2007 14:02
Reporter:	Arjen Lentz	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	any	OS:	Any (any)
Assigned to:	Mats Kindahl	CPU Architecture:	Any
Tags:	bfsm_2007_03_01, replication

Description:
If a query on a master is aborted (through kill or other means) but operated on a non-transactional table, it is nevertheless logged, with the appropriate error code.
A slave then executes this query, inevitably breaking replication because
 a) it won't get an error, but that doesn't match up with the errorcode logged and so the SQL thread stops.
 b) the data is out of sync, since the full query was executed on the slave but not on the master.

This handling appears fundamentally wrong, since it never has the desired effect from the user perspective.

How to repeat:
Abort a query on a master on a non-transactional table.

Suggested fix:
Considering replication will break anyway, I'd rather see the slave stop BEFORE this statement, then people will have a better chance of fixing things.

An alternative idea from Sinisa:
Don't log an error, but change the query, so that only rows that made into the master are logged into event log. This is doable for all three situations:
1) For INSERT ... SELECT, that would imply adding LIMIT at the end of the command
2) For multi-row INSERT, that would mean that only rows that made into the master would go into the event
3) For LOAD DATA, a file would be truncated at the point where last row was inserted.

If the Sinisa solution works in all possible cases, it is of course preferred over the "stop" alternative. However, something has to be done to fix up the current behaviour as it is just not practical.

Thank you for the report.

Verified as described.

test case

Attachment: rpl_bug26551.test (application/octet-stream, text), 626 bytes.

Rewriting the statement with a LIMIT last in the case that the query completed partially will not work. Here is a simple example demonstrating why.

  # Setting up the evil scenario. The difference in insert ordering
  # below can be caused by suitable application of ALTER TABLE, or by a
  # DBA patching the slave database due to an error.

  connection master;
  CREATE TABLE t1 (a INT);
  CREATE TABLE t2 (a INT);

  SET SQL_LOG_BIN = 0;
  INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
  SET SQL_LOG_BIN = 1;

  sync_slave_with_master;
  INSERT INTO t1 VALUES (1),(3),(2),(4),(5);

  connection master;

  # Simulate a failing insert that is rewritten with a LIMIT last
  INSERT INTO t2 SELECT * FROM t1 LIMIT 2;
  SELECT * FROM t1 ORDER BY a;
  SELECT * FROM t2 ORDER BY a;
  sync_slave_with_master;
  SELECT * FROM t1 ORDER BY a;
  SELECT * FROM t2 ORDER BY a;

  connection master;
  DROP TABLE IF EXISTS t1, t2;
  sync_slave_with_master;

... and here is the result of the execution

  CREATE TABLE t1 (a INT);
  CREATE TABLE t2 (a INT);
  SET SQL_LOG_BIN = 0;
  INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
  SET SQL_LOG_BIN = 1;
  INSERT INTO t1 VALUES (1),(3),(2),(4),(5);
  INSERT INTO t2 SELECT * FROM t1 LIMIT 2;
  SELECT * FROM t1 ORDER BY a;
  a
  1
  2
  3
  4
  5
  SELECT * FROM t2 ORDER BY a;
  a
  1
  2
  SELECT * FROM t1 ORDER BY a;
  a
  1
  2
  3
  4
  5
  SELECT * FROM t2 ORDER BY a;
  a
  1
  3

The given test case, as written, cannot trigger the bug. When killing the server with a SIGKILL, the statement is aborted in the middle, and therefore not even written to the binary log. I tried switching to using SIGTERM, SIGINT, and SIGHUP, but all failed to generate the aborted statement in the binary log. Instead, I elected to abort a statement using KILL CONNECTION, which caused it to be written to the binary log. In this case, the statement was *not* executed on the slave side, since the ER_NEW_ABORTING_CONNECTION (1184) was generated.

However, it turned out that when using KILL QUERY, the statement *was* executed since that generates an ER_QUERY_INTERRUPTED (1317), which is not in the list of special errors that prevent the statement from being executed.

I have attached a test file that can reproduce the error, and I have changed the title of the bug report to more accurately reflect the actual error.

Test file to check that killing a connection will prevent the statement from executing on the slave

Attachment: rpl_bug26551.test (application/octet-stream, text), 933 bytes.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23750

ChangeSet@1.2428, 2007-04-04 09:30:45+02:00, mats@romeo.(none) +3 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23751

ChangeSet@1.2428, 2007-04-04 09:33:42+02:00, mats@romeo.(none) +4 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/24375

ChangeSet@1.2428, 2007-04-12 14:00:45+02:00, mats@romeo.(none) +3 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.

Pushed into 5.0.42

Pushed into 5.1.18-beta

Noted in 5.0.42, 5.1.18 changelogs.

Aborting a statement on the master that applied to a
non-transactional statement broke replication. The statement was
written to the binary log but not completely executed on the master.
Slaves receiving the statement executed it completely, resulting in
loss of data synchrony. Now an error code is written to the error log
so that the slaves stop without executing the aborted statement.
(That is, replication stops, but synchrony to the point of the stop
is preserved and you can investigate the problem.)