MySQL Bugs: #37426: RBR breaks for CHAR() UTF8 fields

Bug #37426	RBR breaks for CHAR() UTF8 fields > 85 chars
Submitted:	16 Jun 2008 10:08	Modified:	4 Nov 2008 13:25
Reporter:	Jan Kneschke	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S3 (Non-critical)
Version:	5.1, 6.0	OS:	Any
Assigned to:	Mats Kindahl	CPU Architecture:	Any

Description:
Row-based replication breaks as soon as CHAR() UTF8 fields with a length > 85 are used. 

How to repeat:
Setup replication with 

  binlog-format=row

On the master execute:

  CREATE TABLE char128_utf8 ( 
    i1 INT NOT NULL, 
    c CHAR(128) CHARACTER SET utf8 NOT NULL, 
    i2 INT NOT NULL);

  INSERT INTO char128_utf8 VALUES ( 1, "123", 1 );

  SELECT * FROM char128_utf8;
  +----+-----+----+
  | i1 | c   | i2 |
  +----+-----+----+
  |  1 | 123 |  1 |
  +----+-----+----+

On the slave execute the same select:

  SELECT * FROM char128_utf8;
  +---------+-----+-------+
  | i1      | c   | i2    |
  +---------+-----+-------+
  |       1 |  12 |   307 |
  | 1585202 |     | 18688 |
  +---------+-----+-------+

Longer explanation:

  SELECT HEX(c) FROM char128_utf8 LIMIT 1;
  +--------+
  | HEX(c) |
  +--------+
  | 003132 |
  +--------+

The 00 is part of the string and the slave doesn't expect it.

Using mysql-binlog-dump from the mysql-proxy svn-trunk to decode the events you get:

  (table-map-event)
  23 00 00 00 00 00 00 00 04 74 65 73 74 00 0c 63  #........test..c
  68 61 72 31 32 38 5f 75 74 66 38 00 03 03 fe 03  har128_utf8.....
  02 fe 80 00                                      ....

which can be translated into:

CREATE TABLE test.char128_utf8 (
  field_0 INT NOT NULL,
  field_1 CHAR(128) NOT NULL,
  field_2 INT NOT NULL
)

  (write-row-event)
  23 00 00 00 00 00 01 00 03 ff f8 01 00 00 00 03  #...............
  00 31 32 33 01 00 00 00                          .123....

  [01 00 00 00] is the i1 == 1
  [03 00] [31 32 33] is c = "123"
  [01 00 00 00] is i2 = 1

The problem is that the slave expects [03] instead of [03 00] as length specifier. So it decodes the sequence as:

  [01 00 00 00] is the i1 == 1
  [03] [00 31 32] becomes c = "\00012"
  [33 01 00 00] is i2 = 307

... with a extra 00 and more broken events.

Suggested fix:
The table-description event stores one byte to encode the max-length information of a CHAR() field. This information is stored as bytes.

For a CHAR(128) UTF8 the max-bytes-length is 128 * 3 (=384), but only the 8 lower bits can be stored in the meta-data: -> 128.

The problem is that row-encoding still knows the 384 and uses 2 bytes for the string-length when it is written to the binlogs. From external position it is impossible to decode the row-events for CHAR() UTF8 columns for char-length > 85

As the table-map event is used to declare the number of bytes needed to in the row-header we either have to 

* store 2 bytes for the length of a char
* use length encoding in the table-map (as in the client-row-format)
* use length encoding in the row-events (instead of fixed 1/2 byte)

This is 5.1 version of Bug#32462

Verified as described by using above test and MySQL 5.1 bzr tree from yesterday.

Master output:
mysql> SELECT * FROM char128_utf8;
+----+-----+----+
| i1 | c   | i2 |
+----+-----+----+
|  1 | 123 |  1 | 
+----+-----+----+

Slave output:

mysql> SELECT * FROM char128_utf8;
+------------+---------------------------------------------------------------------------------------------------+-----+
| i1         | c                                                                                                 | i2  |
+------------+---------------------------------------------------------------------------------------------------+-----+
|          1 |  12                                                                                               | 307 | 
| 1751318528 | r128_utf8       MyISAM                                                                           |   0 | 
+------------+---------------------------------------------------------------------------------------------------+-----+

mysql> SELECT * FROM char128_utf8\G
*************************** 1. row ***************************
i1: 1
 c: 
i2: 307
*************************** 2. row ***************************
i1: 1751318528
 c: r128_utf8
i2: 0

Output from slave by using same test on MySQL 6.0 bzr tree:

SELECT * FROM char128_utf8\G
*************************** 1. row ***************************
i1: 1
 c: 
i2: 307
*************************** 2. row ***************************
i1: 1751318528
 c: r128_utf8
i2: 0

This behaviour only occurs by using data type 'char'.
Other string data types are not effected.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48537

2667 Mats Kindahl	2008-06-25
      BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
      
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48724

2667 Mats Kindahl	2008-06-30
      BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
      
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48751

2667 Mats Kindahl	2008-06-30
      BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
      
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48757

2662 Mats Kindahl	2008-06-30
       BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
            
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48759

2662 Mats Kindahl	2008-06-30
       BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
            
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48760

2662 Mats Kindahl	2008-06-30
       BUG#37426: RBR breaks for CHAR() UTF-8 fields > 85 chars
            
      In order to handle CHAR() fields, 8 bits were reserved for
      the size of the CHAR field. However, instead of denoting the
      number of characters in the field, field_length was used which
      denotes the number of bytes in the field.
      
      Since UTF-8 fields can have three bytes per character (and
      has been extended to have four bytes per character in 6.0),
      an extra two bits have been encoded in the field metadata
      work for fields of type Field_string (i.e., CHAR fields).
      
      Since the metadata word is filled, the extra bits have been
      encoded in the upper 4 bits of the real type (the most 
      significant byte of the metadata word) by computing the
      bitwise xor of the extra two bits. Since the upper 4 bits
      of the real type always is 1111 for Field_string, this 
      means that for fields of length <256, the encoding is
      identical to the encoding used in pre-5.1.26 servers, but
      for lengths of 256 or more, an unrecognized type is formed,
      causing an old slave (that does not handle lengths of 256
      or more) to stop.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/48765

2663 Joerg Bruehe	2008-06-30
      Version 5.1.26 is labeled "rc".

This fix is included in 5.1.26-rc,
and it is important enough (showstopper) so that we want it in the announcement.

I do not know how soon it will be upmerged to 6.0,
so please set it back to "patch queued" when it is documented for 5.1.26-rc.

Noted in 5.1.26 changelog.

Row-based replication broke for utf8 CHAR columns longer than 85
characters. 

Setting report to Patch queued pending push into other trees.

Pushed into 5.1.26

Fix is already documented for 5.1.26 - reset to Patch Pending status, waiting for 6.0 merge.

Pushed into 6.0.7-alpha  (revid:serg@mysql.com-20080722121106-wy84j0yvceyu72zr) (pib:2)

Pushed into 5.1.28  (revid:joerg@mysql.com-20080711185110-l3t04xwds0ac6o1v) (version source revid:joerg@mysql.com-20080711185110-l3t04xwds0ac6o1v) (pib:3)

Noted in 6.0.7 changelog.

Pushed into 6.0.7-alpha  (revid:mats@mysql.com-20080630201118-wr133h32lvbkxyk0) (version source revid:john.embretsen@sun.com-20080724122511-9c0oudz1xrdrs6y6) (pib:3)

Hi Lars,

'Documenting' is for bugfixes that need to be documented by Docs (me).

The Binlog format docs are Internals docs and are maintained by developers. Please either re-open this bug and set the bug category to Documentation *and* assign it to the developer who's responsible for updating the Binary Log portion of the Internals docs (and set yourself as lead) or open a new Docs bug and do these things with it.

Setting a fixed and closed Replication bug to Documenting merely puts it back into my queue, which is not useful if my work with it has already been done, which AFAIK is the case here, since the user-facing issue has been resolved.

Thanks!

Related: BUG#53386.