Bug #26551 Aborted query for non-transactional table breaks replication
Submitted: 22 Feb 2007 0:50 Modified: 1 May 2007 14:02
Reporter: Arjen Lentz Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:any OS:Any (any)
Assigned to: Mats Kindahl CPU Architecture:Any
Tags: bfsm_2007_03_01, replication

[22 Feb 2007 0:50] Arjen Lentz
Description:
If a query on a master is aborted (through kill or other means) but operated on a non-transactional table, it is nevertheless logged, with the appropriate error code.
A slave then executes this query, inevitably breaking replication because
 a) it won't get an error, but that doesn't match up with the errorcode logged and so the SQL thread stops.
 b) the data is out of sync, since the full query was executed on the slave but not on the master.

This handling appears fundamentally wrong, since it never has the desired effect from the user perspective.

How to repeat:
Abort a query on a master on a non-transactional table.

Suggested fix:
Considering replication will break anyway, I'd rather see the slave stop BEFORE this statement, then people will have a better chance of fixing things.

An alternative idea from Sinisa:
Don't log an error, but change the query, so that only rows that made into the master are logged into event log. This is doable for all three situations:
1) For INSERT ... SELECT, that would imply adding LIMIT at the end of the command
2) For multi-row INSERT, that would mean that only rows that made into the master would go into the event
3) For LOAD DATA, a file would be truncated at the point where last row was inserted.

If the Sinisa solution works in all possible cases, it is of course preferred over the "stop" alternative. However, something has to be done to fix up the current behaviour as it is just not practical.
[22 Feb 2007 8:59] Sveta Smirnova
Thank you for the report.

Verified as described.
[22 Feb 2007 9:00] Sveta Smirnova
test case

Attachment: rpl_bug26551.test (application/octet-stream, text), 626 bytes.

[2 Apr 2007 7:19] Mats Kindahl
Rewriting the statement with a LIMIT last in the case that the query completed partially will not work. Here is a simple example demonstrating why.

  # Setting up the evil scenario. The difference in insert ordering
  # below can be caused by suitable application of ALTER TABLE, or by a
  # DBA patching the slave database due to an error.

  connection master;
  CREATE TABLE t1 (a INT);
  CREATE TABLE t2 (a INT);

  SET SQL_LOG_BIN = 0;
  INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
  SET SQL_LOG_BIN = 1;

  sync_slave_with_master;
  INSERT INTO t1 VALUES (1),(3),(2),(4),(5);

  connection master;

  # Simulate a failing insert that is rewritten with a LIMIT last
  INSERT INTO t2 SELECT * FROM t1 LIMIT 2;
  SELECT * FROM t1 ORDER BY a;
  SELECT * FROM t2 ORDER BY a;
  sync_slave_with_master;
  SELECT * FROM t1 ORDER BY a;
  SELECT * FROM t2 ORDER BY a;

  connection master;
  DROP TABLE IF EXISTS t1, t2;
  sync_slave_with_master;

... and here is the result of the execution

  CREATE TABLE t1 (a INT);
  CREATE TABLE t2 (a INT);
  SET SQL_LOG_BIN = 0;
  INSERT INTO t1 VALUES (1),(2),(3),(4),(5);
  SET SQL_LOG_BIN = 1;
  INSERT INTO t1 VALUES (1),(3),(2),(4),(5);
  INSERT INTO t2 SELECT * FROM t1 LIMIT 2;
  SELECT * FROM t1 ORDER BY a;
  a
  1
  2
  3
  4
  5
  SELECT * FROM t2 ORDER BY a;
  a
  1
  2
  SELECT * FROM t1 ORDER BY a;
  a
  1
  2
  3
  4
  5
  SELECT * FROM t2 ORDER BY a;
  a
  1
  3
[3 Apr 2007 7:54] Mats Kindahl
The given test case, as written, cannot trigger the bug. When killing the server with a SIGKILL, the statement is aborted in the middle, and therefore not even written to the binary log. I tried switching to using SIGTERM, SIGINT, and SIGHUP, but all failed to generate the aborted statement in the binary log. Instead, I elected to abort a statement using KILL CONNECTION, which caused it to be written to the binary log. In this case, the statement was *not* executed on the slave side, since the ER_NEW_ABORTING_CONNECTION (1184) was generated.

However, it turned out that when using KILL QUERY, the statement *was* executed since that generates an ER_QUERY_INTERRUPTED (1317), which is not in the list of special errors that prevent the statement from being executed.

I have attached a test file that can reproduce the error, and I have changed the title of the bug report to more accurately reflect the actual error.
[3 Apr 2007 7:57] Mats Kindahl
Test file to check that killing a connection will prevent the statement from executing on the slave

Attachment: rpl_bug26551.test (application/octet-stream, text), 933 bytes.

[4 Apr 2007 7:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23750

ChangeSet@1.2428, 2007-04-04 09:30:45+02:00, mats@romeo.(none) +3 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.
[4 Apr 2007 7:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/23751

ChangeSet@1.2428, 2007-04-04 09:33:42+02:00, mats@romeo.(none) +4 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.
[12 Apr 2007 12:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/24375

ChangeSet@1.2428, 2007-04-12 14:00:45+02:00, mats@romeo.(none) +3 -0
  BUG#26551 (Aborted query for non-transactional table breaks replication):
  Added error code ER_QUERY_INTERRUPTED to the list of special errors
  that prevent the slave from starting to execute a query.
[20 Apr 2007 17:16] Bugs System
Pushed into 5.0.42
[20 Apr 2007 17:16] Bugs System
Pushed into 5.1.18-beta
[1 May 2007 14:02] Paul Dubois
Noted in 5.0.42, 5.1.18 changelogs.

Aborting a statement on the master that applied to a
non-transactional statement broke replication. The statement was
written to the binary log but not completely executed on the master.
Slaves receiving the statement executed it completely, resulting in
loss of data synchrony. Now an error code is written to the error log
so that the slaves stop without executing the aborted statement.
(That is, replication stops, but synchrony to the point of the stop
is preserved and you can investigate the problem.)