Description:
I noticed this bug in 5.0.22 first time.
When importing utf8 data into a utf8 table using LOAD DATA LOCAL INFILE statement everything is fine on the master server. But the data are double decoded on the replication slave. This is true when server character set is latin1.
This error also happens when server character set is utf8 and I want to load latin1 data. The result on the slave is different from the one on the master.
How to repeat:
Create file with UTF8 characters. Here is an example hex dump.
mysql> USE test;
mysql> CREATE TABLE `utf8_test` (
> `id` int(11) NOT NULL,
> `name` varchar(255) NOT NULL,
> PRIMARY KEY (`id`)
> ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
mysql> insert into utf8_test values
> (1,0x486C61766EC3AD206DC49B73746F205072616861),
> (2,0x5374C59965646FC48D65736BC3BD206B72616A),
> (3,0x4A69686FC48D65736BC3BD206B72616A),
> (4,0x506C7A65C588736BC3BD206B72616A),
> (5,0x4B61726C6F766172736BC3BD206B72616A);
Create a dump file of this table.
boe@quebec:~$ mysqldump --extended-insert=FALSE -h cambridge -u mcw -p test utf8_test > /tmp/utf8_test.sql
I displayed this file in firefox webbrowser with charcoding = utf8. The result was correct.
Now you need a master-slave-replication where both servers have a default charset=latin1;
mysql> show variables like "%char%";
+--------------------------+--------------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /mysql/server/5.0.36/share/mysql/charsets/ |
+--------------------------+--------------------------------------------+
8 rows in set (0.00 sec)
Both server display the data in utf8_test the same way when running a select statement on it.
mysql> SET NAMES utf8;
mysql> select * from utf8_test;
+----+----------------------+
| id | name |
+----+----------------------+
| 1 | Hlavní město Praha |
| 2 | Středočeský kraj |
| 3 | Jihočeský kraj |
| 4 | Plzeňský kraj |
| 5 | Karlovarský kraj |
+----+----------------------+
5 rows in set (0.00 sec)
Now I created a outfile using:
mysql> select * into outfile '/tmp/test.utf8_test' from utf8_test;
Query OK, 5 rows affected (0.00 sec)
Again Firefox displays the content correctly in utf8 charset.
Now I replaced everything in the table with the data from the file.
mysql> load data local infile '/tmp/test.utf8_test' replace into table utf8_test;
Query OK, 10 rows affected (0.01 sec)
Records: 5 Deleted: 5 Skipped: 0 Warnings: 0
What I get is double coded stuff in the table. Here in HEX
mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------------------+
| id | hex(name) |
+----+------------------------------------------------------+
| 1 | 486C61766EC383C2AD206DC384E280BA73746F205072616861 |
| 2 | 5374C385E284A265646FC384C28D65736BC383C2BD206B72616A |
| 3 | 4A69686FC384C28D65736BC383C2BD206B72616A |
| 4 | 506C7A65C385CB86736BC383C2BD206B72616A |
| 5 | 4B61726C6F766172736BC383C2BD206B72616A |
+----+------------------------------------------------------+
5 rows in set (0.00 sec)
But setting the database charset before solves this issue:
mysql> set character_set_database =utf8;
mysql> load data local infile '/tmp/test.utf8_test' replace into table utf8_test;
Query OK, 10 rows affected (0.01 sec)
Records: 5 Deleted: 5 Skipped: 0 Warnings: 0
The data is loaded in the right way and a select returns the correct output.
mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------+
| id | hex(name) |
+----+------------------------------------------+
| 1 | 486C61766EC3AD206DC49B73746F205072616861 |
| 2 | 5374C59965646FC48D65736BC3BD206B72616A |
| 3 | 4A69686FC48D65736BC3BD206B72616A |
| 4 | 506C7A65C588736BC3BD206B72616A |
| 5 | 4B61726C6F766172736BC3BD206B72616A |
+----+------------------------------------------+
5 rows in set (0.00 sec)
Running the same select on the replication slave returns the output as when the database charset was set to latin1.
boe@quebec:~$ mysql -h cambridge -u mcw -p test
Enter password:
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 58968 to server version: 5.0.36-standard-log
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> select id,hex(name) from utf8_test;
+----+------------------------------------------------------+
| id | hex(name) |
+----+------------------------------------------------------+
| 1 | 486C61766EC383C2AD206DC384E280BA73746F205072616861 |
| 2 | 5374C385E284A265646FC384C28D65736BC383C2BD206B72616A |
| 3 | 4A69686FC384C28D65736BC383C2BD206B72616A |
| 4 | 506C7A65C385CB86736BC383C2BD206B72616A |
| 5 | 4B61726C6F766172736BC383C2BD206B72616A |
+----+------------------------------------------------------+
5 rows in set (0.00 sec)
If I have a default charset of utf8 for the test database the test_utf8 table is inside of, I don't need to set character_set_database=utf8 to get the right data in the table. But even then the slave still has damaged data in the table.
I tested this setup with character_set_server=utf8 and loaded latin1 files into some tables. The slave got wrong data, too.
Suggested fix:
So, LOAD DATA INFILE doesn't seem to look at the charset definition in the databases, tables, columns or whereever.
This is true when running on the master server (I needed to set the character_set_database variable) and especially for the slave, where the manual set of character_set_database was not replicated and wrong data were written into the table.
In 4.1 the bug is not available, due to the fact that I still need to set character_set_database to the same charset as the table. But then the slave receives the data right.