Bug #52868 Wrong handling of NULL value during update, replication out of sync
Submitted: 15 Apr 2010 21:03 Modified: 14 Oct 2010 12:28
Reporter: Matthias Leich Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.1.37, 5.1.47,5.5.5-m3,5.6.99-m4 OS:Any
Assigned to: Luis Soares CPU Architecture:Any
Tags: null

[15 Apr 2010 21:03] Matthias Leich
Description:
My script:
----------
--source include/have_binlog_format_row.inc
--source include/master-slave.inc

--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings
CREATE TABLE t1 (
  f10 bigint(20) DEFAULT 0,
  f11 bigint(20) DEFAULT 0,
  f12 bigint(20) DEFAULT 0,
  f13 varchar(1) DEFAULT '',
  f1 bigint(20) DEFAULT 0,
  f4 bigint(20) DEFAULT 0,
  f2 bigint(20) DEFAULT 0,
  f3 bigint(20) DEFAULT 0
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO t1 ( f1, f2 ) VALUES ( 1   , 35 );
INSERT INTO t1 ( f1, f2 ) VALUES ( NULL, 35 );
UPDATE t1 SET f1 = 'a';

SELECT 'On Master',f1,f2 FROM t1;
--sync_slave_with_master
--connection slave
SELECT 'On Slave',f1,f2 FROM t1;

--connection master
DROP TABLE IF EXISTS t1;

Result on 5.1.47 (mysql-5.1 revno: 3446 2010-04-12):
----------------------------------------------------
stop slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
reset master;
reset slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
start slave;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
f10 bigint(20) DEFAULT 0,
f11 bigint(20) DEFAULT 0,
f12 bigint(20) DEFAULT 0,
f13 varchar(1) DEFAULT '',
f1 bigint(20) DEFAULT 0,
f4 bigint(20) DEFAULT 0,
f2 bigint(20) DEFAULT 0,
f3 bigint(20) DEFAULT 0
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO t1 ( f1, f2 ) VALUES ( 1   , 35 );
INSERT INTO t1 ( f1, f2 ) VALUES ( NULL, 35 );
UPDATE t1 SET f1 = 'a';
Warnings:
Warning	1366	Incorrect integer value: 'a' for column 'f1' at row 1
Warning	1366	Incorrect integer value: 'a' for column 'f1' at row 2
   <== These warnings are ok.
SELECT 'On Master',f1,f2 FROM t1;
On Master	f1	f2
On Master	0	35
On Master	0	35
   <== This result is ok.
SELECT 'On Slave',f1,f2 FROM t1;
On Slave	f1	f2
On Slave	0	35
On Slave	NULL	35
   <== Why is f1 NULL for the second row?
       The replication is out of sync.
DROP TABLE IF EXISTS t1;

I will add results for version 5.1+ soon.

How to repeat:
./mtr --mysqld=--binlog-format=row rpl_ml1
[16 Apr 2010 12:19] Matthias Leich
The following trees showed the same wrong behaviour.
5.5.5-m3 (mysql-trunk revno: 3008 2010-04-12)
5.6.99-m4 (mysql-next-mr revno: 3139 2010-04-12)
[21 Apr 2010 12:43] Luis Soares
Analysis
~~~~~~~~

INTRO
=====

  To update records, roughly the slave (1) unpacks the before image
  (BI) from the binary log event current row, (2) iterates over the
  storage engine rows (SER) and compares both. If BI == SER, then the
  row is used to replay the after image (AI) on top of it. Finally,
  the updated row is stored back in the SE.

  This is done at Update_rows_log_event::do_exec_row, which
  relies in Rows_log_event::find_row to search the correct SER.

  BI and SER are matched against each other in record_compare,
  which (in some cases) compares null_bits first and then the
  actual record contents. A special cleanup for unused null_bits
  is done before the actual comparison so that don't care bits
  get the same value for both BI and SER.

PROBLEM DESCRIPTION
===================

** P1. Problem depicted by the bug reporter **

  The null_bits of a record are set according to the X bit, nullable
  fields and Field_bit data. So the problem stems from the fact that
  when:

  - X bit (if any) + N nullable columns + M Field_bit fields = 8

  means that table->s->last_null_bit_pos == 0. At this point, in
  log_event.cc:record_compare, the part that clears unused null_bits,
  assumes that all bits in the null_byte should be cleared, thence
  wrongly overwriting null_bits.

  The bug surfaces at this point, because when fields contents can
  perfectly match a BI except for a difference in a null bit, and
  given that all null bits are (wrongly) cleared before comparing the
  actual record contents, the wrong pair of records can be considered
  to match.

  For the specific case reported, the assignment of f1='a' will
  ultimately result in (for the first row updated):

   - null_bit(f1)=0 && f1=0.

  When replaying the second row update, the BI will match with
  the first record updated, because:

    1. null_bit(f1) = 1 (set wrongly at the beginning of
                         record_compare for SER and BI)

    2. f1 field content for BI (NULL) will match f1 field content
       for SER (0) - see (in record_compare): 
    
       if ((*ptr)->cmp_binary_offset(table->s->rec_buff_length))
  
  This will cause the first row to be updated twice and thence
  the difference in results between master and slave.

** P2. Problem found during analysis around X bit setting. **

  There is still another issue. The null_bytes are computed
  according to the presence of HA_OPTION_PACK_RECORD. If it is
  set, then X bit is not included. So we should also check this
  before clearing the X bit. The following test case proves 
  that this is indeed an issue:

-- source include/master-slave.inc
-- source include/have_binlog_format_row.inc

CREATE TABLE t1 (c1 int, c2 varchar(1) default '');
INSERT INTO t1(c1) VALUES (10);
INSERT INTO t1(c1) VALUES (NULL);
UPDATE t1 SET c1= 0;

SELECT * FROM t1; 
-- sync_slave_with_master
SELECT * FROM t1; 

-- connection master
DROP TABLE t1; 
-- sync_slave_with_master

-- exit

  Results show:

  (...)

INSERT INTO t1(c1) VALUES (NULL);
UPDATE t1 SET c1= 0;
SELECT * FROM t1;
c1	c2
0	
0	
SELECT * FROM t1;
c1	c2
0	
NULL	
DROP TABLE t1;

FIX
===

  To fix the two issues we should do the following in record_compare:

  P1. We only clean the unused bits, if:
      - table->s->null_bytes > 0 &&
      - table->s->last_null_bit_pos > 0

  P2. We only set the X bit if:
      - !(table->s->db_options_in_use & HA_OPTION_PACK_RECORD)
[21 Apr 2010 12:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/106253

3356 Luis Soares	2010-04-21
      BUG#52868: Wrong handling of NULL value during update, replication out
                 of sync
      
      In RBR, sometimes the table->s->last_null_bit_pos can be zero. This
      has impact at the slave when it compares records fetched from the
      storage engine against records in the binary log event. If
      last_null_bit_pos is zero the slave, while comparing in
      log_event.cc:record_compare function, would set all bits in the last
      null_byte to 1 (assumed all 8 were unused) . Thence it would loose the
      ability to distinguish records that were similar in contents except
      for the fact that some field was null in one record, but not in the
      other. Ultimately this would cause wrong matches, and in the specific
      case depicted in the bug report the same record would be updated
      twice, resulting in a lost update.
      
      Additionally, in the record_compare function the slave was setting the
      X bit unconditionally. There are cases that the X bit does not exist
      in the record header. This could also lead to wrong matches between
      records.
      
      We fix both by conditionally resetting the bits: (i) unused null_bits
      are set if last_null_bit_pos > 0; (ii) X bit is set if
      HA_OPTION_PACK_RECORD is in use.
     @ mysql-test/extra/rpl_tests/rpl_record_compare.test
        Shared part of the test case for MyISAM and InnoDB.
     @ mysql-test/suite/rpl/t/rpl_row_rec_comp_innodb.test
        InnoDB test case.
     @ mysql-test/suite/rpl/t/rpl_row_rec_comp_myisam.test
        MyISAM test case. Added also coverage for Field_bits case.
     @ sql/log_event.cc
        Deployed conditional setting of unused bits at record_compare.
     @ sql/log_event_old.cc
        Same change as in log_event.cc.
[20 May 2010 1:23] Luis Soares
Queued in mysql-5.1-bugteam and mysql-pe:
  - http://pb2.norway.sun.com/web.py?template=push_details&push=1270678
  - http://pb2.norway.sun.com/web.py?template=push_details&push=1270677
[28 May 2010 5:46] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (version source revid:alik@sun.com-20100524190136-egaq7e8zgkwb9aqi) (pib:16)
[28 May 2010 6:16] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100524190941-nuudpx60if25wsvx) (version source revid:alik@sun.com-20100524190409-5w4l7mje1wk1c90l) (merge vers: 6.0.14-alpha) (pib:16)
[28 May 2010 6:43] Bugs System
Pushed into 5.5.5-m3 (revid:alik@sun.com-20100524185725-c8k5q7v60i5nix3t) (version source revid:alexey.kopytov@sun.com-20100523204118-0tl3goawu658rxh6) (merge vers: 5.5.5-m3) (pib:16)
[2 Jun 2010 8:49] Bugs System
Pushed into 5.1.48 (revid:georgi.kodinov@oracle.com-20100602084411-2yu607bslbmgufl3) (version source revid:luis.soares@sun.com-20100519235042-1qdf81hc5az90r3o) (merge vers: 5.1.47) (pib:16)
[3 Jun 2010 11:40] Jon Stephens
Documented bugfix in the 5.1.48, 5.5.5, and 6.0.14 changelogs as follows:

        In some cases, attempting to update a column with a value of an
        incompatible type resulted in a mismatch between master and
        slave because the column value was set to its implicit default
        value on the master (as expected), but the same column on the
        slave was set to NULL.

Closed.
[14 Oct 2010 8:26] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:vasil.dimov@oracle.com-20100513074652-0cvlhgkesgbb2bfh) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:41] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:vasil.dimov@oracle.com-20100513074652-0cvlhgkesgbb2bfh) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:56] Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:vasil.dimov@oracle.com-20100513074652-0cvlhgkesgbb2bfh) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 12:28] Jon Stephens
No new changelog entry required. Setting back to Closed state.