Bug #59523 LOAD DATA (LOCAL) INFILE does not work for UTF8
Submitted: 15 Jan 2011 8:02 Modified: 1 Nov 2012 17:27
Reporter: Ngan Pham Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0, 5.1 OS:Any
Assigned to: CPU Architecture:Any
Tags: import, LOAD DATA INFILE, utf8

[15 Jan 2011 8:02] Ngan Pham
Description:
I am unable to use LOAD DATA LOCAL INFILE with a file with utf8 content.  Non-ascii characters come in the database as ?'s.  My database, table, column charset is all set to utf8.

I noticed that if i changed my database charset to Latin1, the query will work.  Or, if I add the "CHARACTER SET latin1" option to the query, it'll work.  Adding "CHARACTER SET utf8" will not work.

How to repeat:
1. Create a utf8 database.
2. Create a utf8 table with 1 varchar (255) column.
3. Create a utf8 file with some utf8 characters.
4. Use LOAD DATA LOCAL INFILE query to load the file into the table.

Suggested fix:
I think mysql is parsing the file incorrectly.
[15 Jan 2011 9:28] Peter Laursen
see http://dev.mysql.com/doc/refman/5.0/en/load-data.html

note 

1)
LOAD DATA ...
    [CHARACTER SET charset_name]

2)
The character set indicated by the character_set_database system variable is used to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause, which is available as of MySQL 5.0.38. A character set of binary specifies “no conversion.”

Do you specify the "CHARACTER SET" option in the statement?

Peter
(not a MySQL person)
[15 Jan 2011 10:02] Valeriy Kravchuk
Looks like we need more details about OS and MySQL server version used, exact LOAD DATA statement and data you try to load and  then get from the database. This is what I see:

macbook-pro:5.0 openxs$ cat /tmp/utf8.txt
русский
english
אָלֶף 
ألف مقصورة)
macbook-pro:5.0 openxs$ file /tmp/utf8.txt
/tmp/utf8.txt: UTF-8 Unicode text
macbook-pro:5.0 openxs$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.0.93-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show create table tu\G
*************************** 1. row ***************************
       Table: tu
Create Table: CREATE TABLE `tu` (
  `c1` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> load data local infile '/tmp/utf8.txt' into table tu;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tu;
+----------------------+
| c1                   |
+----------------------+
| русский       | 
| english              | 
| אָלֶף           | 
| ألف مقصورة) | 
+----------------------+
4 rows in set (0.00 sec)

mysql> show session variables like 'char%';
+--------------------------+---------------------------------------------+
| Variable_name            | Value                                       |
+--------------------------+---------------------------------------------+
| character_set_client     | latin1                                      | 
| character_set_connection | latin1                                      | 
| character_set_database   | latin1                                      | 
| character_set_filesystem | binary                                      | 
| character_set_results    | latin1                                      | 
| character_set_server     | latin1                                      | 
| character_set_system     | utf8                                        | 
| character_sets_dir       | /Users/openxs/dbs/5.0/share/mysql/charsets/ | 
+--------------------------+---------------------------------------------+
8 rows in set (0.00 sec)
[15 Jan 2011 16:25] Ngan Pham
server1:~ npham$ cat /tmp/utf8.txt 
русский
english
אָלֶף 
ألف مقصورة)

server1:~ npham$ mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3463
Server version: 5.0.91-log MySQL Community Server (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> CREATE DATABASE utf8_test CHARACTER SET utf8;
Query OK, 1 row affected (0.01 sec)

mysql> show create database utf8_test;
+-----------+--------------------------------------------------------------------+
| Database  | Create Database                                                    |
+-----------+--------------------------------------------------------------------+
| utf8_test | CREATE DATABASE `utf8_test` /*!40100 DEFAULT CHARACTER SET utf8 */ | 
+-----------+--------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> use utf8_test;
Database changed

mysql> CREATE TABLE `tu` (`c1` varchar(255) default NULL) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.04 sec)

mysql> SHOW CREATE TABLE `tu`;
+-------+-------------------------------------------------------------------------------------------+
| Table | Create Table                                                                              |
+-------+-------------------------------------------------------------------------------------------+
| tu    | CREATE TABLE `tu` (
  `c1` varchar(255) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 | 
+-------+-------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> load data local infile '/tmp/utf8.txt' into table tu;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tu; 
+-------------+
| c1          |
+-------------+
| ???????     | 
| english     | 
| ?????       | 
| ??? ??????) | 
+-------------+
4 rows in set (0.00 sec)

mysql> show session variables like 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                                 | 
| character_set_connection | latin1                                 | 
| character_set_database   | utf8                                   | 
| character_set_filesystem | binary                                 | 
| character_set_results    | latin1                                 | 
| character_set_server     | latin1                                 | 
| character_set_system     | utf8                                   | 
| character_sets_dir       | /usr/local/mysql/share/mysql/charsets/ | 
+--------------------------+----------------------------------------+
8 rows in set (0.01 sec)
[15 Jan 2011 16:26] Ngan Pham
The key to duplicating this is creating a database with a default character set of utf8 (not latin1).
[15 Jan 2011 16:35] Ngan Pham
Specifying the CHARACTER SET (correctly) in the query does not make a difference.

mysql> truncate tu;
Query OK, 0 rows affected (0.00 sec)

mysql> load data local infile '/tmp/utf8.txt' into table tu character set utf8;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tu;
+-------------+
| c1          |
+-------------+
| ???????     | 
| english     | 
| ?????       | 
| ??? ??????) | 
+-------------+
4 rows in set (0.00 sec)

HOWEVER, strangely enough, specifying the (incorrect) CHARACTER SET as latin1, things seem to work:

mysql> truncate tu;
Query OK, 0 rows affected (0.00 sec)

mysql> load data local infile '/tmp/utf8.txt' into table tu character set latin1;
Query OK, 4 rows affected (0.00 sec)
Records: 4  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from tu;
+----------------------+
| c1                   |
+----------------------+
| русский       | 
| english              | 
| אָלֶף           | 
| ألف مقصورة) | 
+----------------------+
4 rows in set (0.00 sec)
[17 Jan 2011 12:13] Sveta Smirnova
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

You have to specify proper client character set when retrieve records from the table. See http://dev.mysql.com/doc/refman/5.1/en/set-option.html and http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html for details.
[17 Jan 2011 18:04] Ngan Pham
Sveta, I'm not trying to retrieve records from the table.  I'm trying to load data into the table.  I don't see how this is not a bug?  You can EASILY reproduce it from my comment above.  Unless I'm doing something wrong--at which point it would be a lot more helpful if you can tell what it is I'm doing wrong instead of linking me to documentation and assuming that I haven't gone through it already.
[17 Jan 2011 18:54] Sveta Smirnova
Thank you for the feedback.

Problem is in comment "[15 Jan 17:25] Ngan Pham <nganpham@gmail.com>":

mysql> select * from tu; 
+-------------+
| c1          |
+-------------+
| ???????     | 
| english     | 
| ?????       | 
| ??? ??????) | 
+-------------+
4 rows in set (0.00 sec)

mysql> show session variables like 'char%';
+--------------------------+----------------------------------------+
| Variable_name            | Value                                  |
+--------------------------+----------------------------------------+
| character_set_client     | latin1                        
...

To understand how it affects this case, please, read manual links to which I already provided.
[21 Jan 2011 0:13] Ngan Pham
Setting character_set_client and/or character_set_connection to utf8 still does not make it work.
[21 Jan 2011 17:31] Sveta Smirnova
Thank you for the feedback.

Please provide all steps with proper SET NAMES
[22 Feb 2011 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[16 Oct 2012 13:51] Stefan Haubenthal
http://blog.datispars.com/loading-utf8-text-data-into-a-mysql-table/