Bug #26836 LOAD DATA INFILE ignores charset definitions
Submitted: 5 Mar 2007 11:31 Modified: 12 Mar 2007 10:08
Reporter: Thomas Boehme Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.36, 5.1, 5.2-falcon OS:Linux (Linux)
Assigned to: Assigned Account CPU Architecture:Any

[5 Mar 2007 11:31] Thomas Boehme
Description:
I noticed this bug in 5.0.22 first time.

When importing utf8 data into a utf8 table using LOAD DATA LOCAL INFILE statement everything is fine on the master server. But the data are double decoded on the replication slave. This is true when server character set is latin1.

This error also happens when server character set is utf8 and I want to load latin1 data. The result on the slave is different from the one on the master.

How to repeat:
Create file with UTF8 characters. Here is an example hex dump.

mysql> USE test;
mysql> CREATE TABLE `utf8_test` (
>   `id` int(11) NOT NULL,
>   `name` varchar(255) NOT NULL,
>   PRIMARY KEY  (`id`)
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

mysql> insert into utf8_test values
> (1,0x486C61766EC3AD206DC49B73746F205072616861),
> (2,0x5374C59965646FC48D65736BC3BD206B72616A),
> (3,0x4A69686FC48D65736BC3BD206B72616A),
> (4,0x506C7A65C588736BC3BD206B72616A),
> (5,0x4B61726C6F766172736BC3BD206B72616A);

Create a dump file of this table.

boe@quebec:~$ mysqldump --extended-insert=FALSE -h cambridge -u mcw -p test utf8_test > /tmp/utf8_test.sql

I displayed this file in firefox webbrowser with charcoding = utf8. The result was correct.

Now you need a master-slave-replication where both servers have a default charset=latin1;

mysql> show variables like "%char%";
+--------------------------+--------------------------------------------+
| Variable_name            | Value                                      |
+--------------------------+--------------------------------------------+
| character_set_client     | latin1                                     |
| character_set_connection | latin1                                     |
| character_set_database   | latin1                                     |
| character_set_filesystem | binary                                     |
| character_set_results    | latin1                                     |
| character_set_server     | latin1                                     |
| character_set_system     | utf8                                       |
| character_sets_dir       | /mysql/server/5.0.36/share/mysql/charsets/ |
+--------------------------+--------------------------------------------+
8 rows in set (0.00 sec)

Both server display the data in utf8_test the same way when running a select statement on it.

mysql> SET NAMES utf8;
mysql> select * from utf8_test;
+----+----------------------+
| id | name                 |
+----+----------------------+
|  1 | Hlavní město Praha |
|  2 | Středočeský kraj  |
|  3 | Jihočeský kraj     |
|  4 | Plzeňský kraj      |
|  5 | Karlovarský kraj    |
+----+----------------------+
5 rows in set (0.00 sec)

Now I created a outfile using:

mysql> select * into outfile '/tmp/test.utf8_test' from utf8_test;
Query OK, 5 rows affected (0.00 sec)

Again Firefox displays the content correctly in utf8 charset.

Now I replaced everything in the table with the data from the file.

mysql> load data local infile '/tmp/test.utf8_test' replace into table utf8_test;
Query OK, 10 rows affected (0.01 sec)
Records: 5  Deleted: 5  Skipped: 0  Warnings: 0

What I get is double coded stuff in the table. Here in HEX

mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------------------+
| id | hex(name)                                            |
+----+------------------------------------------------------+
|  1 | 486C61766EC383C2AD206DC384E280BA73746F205072616861   |
|  2 | 5374C385E284A265646FC384C28D65736BC383C2BD206B72616A |
|  3 | 4A69686FC384C28D65736BC383C2BD206B72616A             |
|  4 | 506C7A65C385CB86736BC383C2BD206B72616A               |
|  5 | 4B61726C6F766172736BC383C2BD206B72616A               |
+----+------------------------------------------------------+
5 rows in set (0.00 sec)

But setting the database charset before solves this issue:

mysql> set character_set_database =utf8;
mysql> load data local infile '/tmp/test.utf8_test' replace into table utf8_test;
Query OK, 10 rows affected (0.01 sec)
Records: 5  Deleted: 5  Skipped: 0  Warnings: 0

The data is loaded in the right way and a select returns the correct output.

mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------+
| id | hex(name)                                |
+----+------------------------------------------+
|  1 | 486C61766EC3AD206DC49B73746F205072616861 |
|  2 | 5374C59965646FC48D65736BC3BD206B72616A   |
|  3 | 4A69686FC48D65736BC3BD206B72616A         |
|  4 | 506C7A65C588736BC3BD206B72616A           |
|  5 | 4B61726C6F766172736BC3BD206B72616A       |
+----+------------------------------------------+
5 rows in set (0.00 sec)

Running the same select on the replication slave returns the output as when the database charset was set to latin1.

boe@quebec:~$ mysql -h cambridge -u mcw -p test
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 58968 to server version: 5.0.36-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------------------+
| id | hex(name)                                            |
+----+------------------------------------------------------+
|  1 | 486C61766EC383C2AD206DC384E280BA73746F205072616861   |
|  2 | 5374C385E284A265646FC384C28D65736BC383C2BD206B72616A |
|  3 | 4A69686FC384C28D65736BC383C2BD206B72616A             |
|  4 | 506C7A65C385CB86736BC383C2BD206B72616A               |
|  5 | 4B61726C6F766172736BC383C2BD206B72616A               |
+----+------------------------------------------------------+
5 rows in set (0.00 sec)

If I have a default charset of utf8 for the test database the test_utf8 table is inside of, I don't need to set character_set_database=utf8 to get the right data in the table. But even then the slave still has damaged data in the table.

I tested this setup with character_set_server=utf8 and loaded latin1 files into some tables. The slave got wrong data, too.

Suggested fix:
So, LOAD DATA INFILE doesn't seem to look at the charset definition in the databases, tables, columns or whereever.

This is true when running on the master server (I needed to set the character_set_database variable) and especially for the slave, where the manual set of character_set_database was not replicated and wrong data were written into the table.

In 4.1 the bug is not available, due to the fact that I still need to set character_set_database to the same charset as the table. But then the slave receives the data right.
[5 Mar 2007 14:13] Sveta Smirnova
test case

Attachment: rpl_bug26836.test (application/octet-stream, text), 950 bytes.

[5 Mar 2007 14:16] Sveta Smirnova
Thank you for the report.

Verified as described.

It is not repeatable with current 5.0 and 4.1 development sources.
[12 Mar 2007 10:08] Alexander Barkov
This problem was recenlty fixed under terms of the Bug#15126 
"character_set_database is not replicated (LOAD DATA INFILE need it)".

The fix will appear in 5.0.38 and 5.1.17.
A note about this bug has been added into the manual:

http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0-38.html

Marking as duplicate for http://bugs.mysql.com/bug.php?id=15126
[12 Mar 2007 10:08] Alexander Barkov
This problem was recenlty fixed under terms of the Bug#15126 
"character_set_database is not replicated (LOAD DATA INFILE need it)".

The fix will appear in 5.0.38 and 5.1.17.
A note about this bug has been added into the manual:

http://dev.mysql.com/doc/refman/5.0/en/releasenotes-es-5-0-38.html

Marking as duplicate for http://bugs.mysql.com/bug.php?id=15126
[27 May 2008 18:03] Andreas Beintken
Hi,

i have had the same problems like Thomas Boehme. Now i use 5.0.58 (rpm from RHEL) but the error already exist. I use the example from Thomas to repeat/reproduce the error.
[20 Aug 2008 12:22] Christophe Fondacci
Thank for the "set character_set_database=utf8" tip. This is the only way to properly import files with load data infile.

But I've noticed that the "character_set_database" variable seem to be a session variable. For example, if I connect to a MySQL database, change this variable to utf8, exit and reconnect to this same database, the "character_set_database" still has the original value...

I cannot understand why since (to my understanding) a database character set has no sense for a session. Or then it would be redundant with character_set_connection / client.

If someone could explain me why is it working this way...

Christophe Fondacci
http://www.nextep-softwares.com