Bug #50917 slave stops with HA_ERR_KEY_NOT_FOUND
Submitted: 4 Feb 2010 15:23 Modified: 3 Dec 2010 12:19
Reporter: Matthias Leich Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.1.43,5.5.2-m2 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: auto_increment
Triage: Triaged: D2 (Serious)

[4 Feb 2010 15:23] Matthias Leich
Description:
My scipt
--------
--source include/master-slave.inc

--disable_warnings
DROP TABLE IF EXISTS t1;
DROP TABLE IF EXISTS t2;
--enable_warnings

# The problem disappears if I remove "auto_increment".
CREATE TABLE test.t1 ( pk int auto_increment, col1 tinyint, primary key (pk));
CREATE TABLE test.t2 ( pk int auto_increment, col1 tinyint, primary key (pk));

let $val= 1;
eval INSERT INTO test.t1 ( col1 , pk) VALUES ( 0 , $val + 1) ;
# It seems to play an important role that I use here $val as
# value for the the primary key autoinc column of
# a *different* table.
eval INSERT INTO test.t2 ( col1 , pk) VALUES ( 0 , $val);
     INSERT INTO test.t2 ( col1 , pk) VALUES ( 1 , NULL);

SET SESSION BINLOG_FORMAT = STATEMENT ;
UPDATE test.t1 SET pk = col1 WHERE pk IS NULL;
SET SESSION BINLOG_FORMAT = ROW;
UPDATE test.t1 SET col1 = 1;

--sync_slave_with_master

# Cleanup
DROP TABLE t1;
DROP TABLE t2;

Result on 5.1.43
mysql-5.1-bugteam revno: 3352 2010-02-02
----------------------------------------
./mysql-test-run  --mem --mysqld=--binlog-format=row mleichfeb03
...
main.mleichfeb03                         [ fail ]
....
=== SHOW SLAVE STATUS ===
...
Last_Error  Could not execute Update_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 1488
...
SHOW BINLOG EVENTS IN 'master-bin.000001';  (edited)
Pos   Event_type  End_log_pos Info
   4  Format_desc  106        Server ver: 5.1.43-debug-log, Binlog ver: 4
 106  Query        192        use `test`; DROP TABLE IF EXISTS t1
 192  Query        278        use `test`; DROP TABLE IF EXISTS t2
 278  Query        418        use `test`; CREATE TABLE test.t1 ( pk int auto_increment, col1 tinyint, primary key (pk))
 418  Query        558        use `test`; CREATE TABLE test.t2 ( pk int auto_increment, col1 tinyint, primary key (pk))
 558  Query        626        BEGIN
 626  Table_map    668        table_id: 23 (test.t1)
 668  Write_rows   703        table_id: 23 flags: STMT_END_F
 703  Query        772        COMMIT
 772  Query        840        BEGIN
 840  Table_map    882        table_id: 24 (test.t2)
 882  Write_rows   917        table_id: 24 flags: STMT_END_F
 917  Query        986        COMMIT
 986  Query       1054        BEGIN
1054  Table_map   1096        table_id: 24 (test.t2)
1096  Write_rows  1131        table_id: 24 flags: STMT_END_F
1131  Query       1200        COMMIT
1200  Intvar      1228        LAST_INSERT_ID=2
1228  Query       1336        use `test`; UPDATE test.t1 SET pk = col1 WHERE pk IS NULL
1336  Query       1404        BEGIN
1404  Table_map   1446        table_id: 23 (test.t1)
1446  Update_rows 1488        table_id: 23 flags: STMT_END_F
1488  Query       1557        COMMIT

Trees showing the problem from above:
-------------------------------------
- 5.1.43
  mysql-5.1-bugteam revno: 3352 2010-02-02
- 5.1.43
  mysql-5.1-rep+3 revno: 3134 2010-01-20
- 5.5.2-m2
  mysql-trunk revno: 2929 2010-01-15

Trees which do not show this problem:
-------------------------------------
- 6.0.14-alpha
  mysql-6.0-codebase-bugfixing revno: 3859 2010-01-31
- 5.5.99-m3
  mysql-next-mr revno: 2966 2010-01-27

So it looks like future releases >= 5.5.99-m3 do not suffer
from the reported problem.

There are various existing bug reports mentioning that
the slave stops because of HA_ERR_KEY_NOT_FOUND.
But most do not look like as if they match the current
scenario.

How to repeat:
See above
[6 Dec 2010 17:17] Luis Soares
Here is an improved test case:

--source include/master-slave.inc

CREATE TABLE t1 ( pk int auto_increment, col1 tinyint, primary key (pk));
CREATE TABLE t2 ( pk int auto_increment, col1 tinyint, primary key (pk));

SET session sql_auto_is_null=1;

-- let $val= 1
-- eval INSERT INTO t1 VALUES ( $val + 1, 0)
-- eval INSERT INTO t2 VALUES ( $val, 0)
        INSERT INTO t2 VALUES ( NULL, 1);

SET SESSION BINLOG_FORMAT = STATEMENT ;
UPDATE t1 SET pk = col1 WHERE pk IS NULL;
-- sync_slave_with_master

-- let $diff_table_1=master:test.t1
-- let $diff_table_2=slave:test.t1
-- source include/diff_tables.inc

-- connection master
SET SESSION BINLOG_FORMAT = ROW;
UPDATE t1 SET col1 = 1;

--sync_slave_with_master

-- let $diff_table_1=master:test.t1
-- let $diff_table_2=slave:test.t1
-- source include/diff_tables.inc

-- connection master
# Cleanup
DROP TABLE t1;
DROP TABLE t2;

--sync_slave_with_master
-- exit

ANALYSIS
========

  Considering the test case posted above...

  Description
  -----------

  When sql_auto_is_null=1 the statement:

   - UPDATE t1 SET pk = col1 WHERE pk IS NULL;

  makes the master to update the row on t1. The slave however, does
  not update any row, because it has no context in auto_inc values.

  In detail, the three INSERTs before this update are replicated using
  ROW format, while the UPDATE is replicated using STMT
  format. Consequently, the slave actually inserts the value the
  master has generated for the second row for t2 INSERT, and not the
  NULL value as the master does. From the slave standpoint, there is
  no NULL values involved so far for the events it has processed
  (which is not true for the master).

  As a consequence, when the slave processes the UPDATE with "pk IS
  NULL" clause, it will not update any row, because there is no
  context referring to auto_inc columns and IS NULL. If obscure, see
  [1].

  NOTE: At this point table t1 diverge on master and slave.

  Later when the second UPDATE is replicated (this one in ROW format)
  the before image contained in the UPDATE_ROWS_EVENT will not match
  any of the rows on the slave's table, because they have
  diverged... This makes the second update to not find any row and
  fail with HA_ERR_KEY_NOT_FOUND.
 
  The case for 5.5+
  -----------------

  In 5.5.3, the default value for sql_auto_is_null was changed from 1
  to 0. This makes the test case pass. However, when set to 1, the
  test fails for the reason outlined in the previous section.

  Conclusion
  ----------

  This is a very obscure combinations of factors here:

    - the special and non-intuitive case for sql_auto_is_null=1 [1],
      which is not the default anymore as of 5.5.3;
    - the fact that replication events are limited, ie, do not carry
      all information and re-create all master server context.
    - the exceptional case for interchanging STMT and ROW format and
      its interaction with sql_auto_is_null=1.
      

  All of this playing together makes replication break.

REFERENCES
==========

[1] http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_sql_auto_is_nul...