Bug #16217 corruption importing binlogs using cp932 client charset
Submitted: 5 Jan 2006 8:37 Modified: 13 Mar 2006 21:26
Reporter: Shane Bester (Platinum Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.16 OS:Linux (Linux)
Assigned to: Andrei Elkin

[5 Jan 2006 8:37] Shane Bester
Description:
In MySQL 4.1.16, recovery with binary log causes
character corruption if the following conditions are met;

- Start mysqld with default-character-set=utf8 and binary log enabled.
- Store utf8 characters via mysql commandline tool with utf8 client character set.
(The characters are stored correctly)
- Store cp932 characters via mysql with cp932 client character set.
(Again, the characters are stored correctly)
- Convert binary log of the above steps to SQL statements by mysqlbinlog and execute them via mysql.

In this case, SQL syntax error occurs with client --default-character-set=utf8 and
corruption of characters stored with utf8 client character set occurs with --default-character-set=cp932.

How to repeat:
(1) Prepare MySQL 4.1.16
(2) Start mysqld with --default-character-set=utf8 and --log-bin
(3) Run utf8.sh(attached). The shell script creates table tbl2 and store utf8 chars with utf8 client charset.
(4) Run cp932.sh(attached). The shell script creates table tbl1 and store
cp932 chars with cp932 client charset.
(5) Check binary log using mysqlbinlog.

Here is hex dump of table tbl1 at this moment(results are correct)

SELECT HEX(c_cp932),HEX(c_utf8) FROM tbl1;
+--------------+-------------+
| HEX(c_cp932) | HEX(c_utf8) |
+--------------+-------------+
| 835C | E382BD |
| 8740 | E291A0 |
| 835C | E382BD |
+--------------+-------------+

Here is hex dump of table tbl2 at this moment(results are correct)

SELECT HEX(c_cp932),HEX(c_utf8) FROM tbl2;
+--------------+-------------+
| HEX(c_cp932) | HEX(c_utf8) |
+--------------+-------------+
| 835C | E382BD |
| 8740 | E291A0 |
| 835C | E382BD |
+--------------+-------------+

(6) Drop table tbl1 and tbl2, and convert binary log to SQL statements using mysqlbinlog.

DROP TABLE tbl1;
DROP TABLE tbl2;

mysqlbinlog ... > binlog.log

(7.1) Reimport the binlog with --default-character-set=utf8

mysql -uroot --default-character-set=utf8 < binlog.log
ERROR at line 36: Unknown command '\''.

SELECT HEX(c_cp932),HEX(c_utf8) FROM tbl1;
Empty set (0.00 sec)
--> Nothing is stored into table tbl1

(7.2) Reimport the binlog with --default-character-set=cp932
DROP TABLE tbl1;
DROP TABLE tbl2;

mysql -uroot --default-character-set=cp932 < binlog.log

The mysql command ends without any error but there are
corruptions in table tbl2. Table tbl1 is correctly
recovered.

SELECT HEX(c_cp932),HEX(c_utf8) FROM tbl2;
+--------------+--------------+
| HEX(c_cp932) | HEX(c_utf8) |
+--------------+--------------+
| E382BD | E7B9A7EFBDBD |
| E291 | E7ABAD |
| E382BD | E7B9A7EFBDBD |
+--------------+--------------+

The above corruption doesn't occur in 5.0.18.  

Your MySQL connection id is 9 to server version: 5.0.18-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SELECT HEX(c_cp932),HEX(c_utf8) FROM tbl2;
+--------------+-------------+
| HEX(c_cp932) | HEX(c_utf8) |
+--------------+-------------+
| 835C         | E382BD      |
| 8740         | E291A0      |
| 835C         | E382BD      |
+--------------+-------------+
3 rows in set (0.00 sec)

Suggested fix:
See notes
[1 Feb 2006 19:03] Andrei Elkin
Managed to reproduce both artifacts in 4.1.19 and empty recovery of tbl1 by client with utf8 charset in 5.0.19.
[8 Feb 2006 8:30] Andrei Elkin
About appropriate charset for recovery invocation. Actually this one is exactly the charset of the server.  In the last example theinvocation line is ok as long as mysqld is launched with --default-character-set=utf8.

Shane, you are welcome to discussion though I think "incorrect recovery" is not a bug but rather subtle feature.
Commiting my fix soon for review.
[9 Feb 2006 14:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/2369
[10 Feb 2006 13:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/2443
[12 Feb 2006 6:40] Alexander Barkov
Looks ok to push for me.
[15 Feb 2006 11:14] Andrei Elkin
Fixed in 5.2.0-alpha, 5.1.7-beta, 5.0.19, 4.1.19.

Leaving todo remarks.

Regarding to the new feature of mysql client still the most flexibale approach is to parse `SET vars' statements, where vars are those that affect client charset. 
Such approach would make `charset_name' redundant (better to say its explicit
invocation) as useless would be a change in mysqlbinlog made for this bug.
Unfortunately this way requires rather big changes in mysql client due to parsing.
[13 Mar 2006 21:26] Mike Hillyer
Documented in 4.1.19, 5.0.19, 5.1.7 changelogs:

<listitem>
        <para>
          New <literal>charset</literal> command added to
          <command>mysql</command> command-line client. By typing
          <literal>charset <replaceable>name</replaceable></literal> or
            <literal>\C <replaceable>name</replaceable></literal> (such
            as <literal>\C UTF8</literal>), the client character set can
          be changed without reconnecting. (Bug #16217)
        </para>
      </listitem>

Also updated mysql client commands section of refman.