Bug #10573 BOM(of UTF-8) is not correctly handled by LOAD DATA
Submitted: 12 May 2005 8:08 Modified: 28 Jan 2014 21:38
Reporter: Hartmut Holzgraefe Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DML Severity:S2 (Serious)
Version:4.1.11/5.6 OS:Linux (Linux)
Assigned to: Assigned Account CPU Architecture:Any
Triage: Needs Triage: D5 (Feature request) / R1 (None/Negligible)

[12 May 2005 8:08] Hartmut Holzgraefe
Description:
A BOM(Byte Order Mark) can exist at the begining of
unicode and UTF-8/16 files in order to distinguish
character encoding and endian.

In case of UTF-8, a BOM is 0xEFBBBF in hex.
Without BOM in a UTF-8 file, it is called UTF-8N to
distinguish from a UTF-8 file with BOM.

BOM is considered and loaded as actual data if a
UTF-8 file(with BOM) is loaded by LOAD DATA
statement. This becomes a problem when converting
char set.

This behavoiur doesn't happen if UTF-8N(without BOM).

How to repeat:
Two .txt files are attached as test data.

utf8.txt --- contains BOM
utf8n.txt --- does not contain BOM

In each file, the first line is "abc" and the second
line is "def".

0. Create table

CREATE TABLE utf8_tbl (col1 VARCHAR(50)) CHARACTER SET utf8;
CREATE TABLE sjis_tbl (col1 VARCHAR(50)) CHARACTER SET sjis;

1.Store UTF-8N(without BOM) data into utf8 table

SET character_set_database=utf8;
LOAD DATA LOCAL INFILE "/tmp/utf8n.txt" INTO TABLE utf8_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
SELECT col1, HEX(col1) FROM utf8_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| abc | 616263 |
| def | 646566 |
+------+-----------+

No problem here.

2. Store UTF-8(with BOM) data into utf8 table

TRUNCATE TABLE utf8_tbl;
SET character_set_database=utf8;
LOAD DATA LOCAL INFILE "/tmp/utf8.txt" INTO TABLE utf8_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
SELECT col1, HEX(col1) FROM utf8_tbl;
+------+--------------+
| col1 | HEX(col1) |
+------+--------------+
| ?abc | EFBBBF616263 |
| def | 646566 |
+------+--------------+

Problem.
BOM is handled as a string and original data is
destroyed.

3.Store UTF-8N(without BOM) data into sjis(non-utf8)
table

SET character_set_database=utf8;
LOAD DATA LOCAL INFILE "/tmp/utf8n.txt" INTO TABLE sjis_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 0
SELECT col1, HEX(col1) FROM sjis_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| abc | 616263 |
| def | 646566 |
+------+-----------+

No problem.

4. Store UTF-8(with BOM) data into sjis(non-utf8)
table

TRUNCATE TABLE sjis_tbl;
SET character_set_database=utf8;
LOAD DATA LOCAL INFILE "/tmp/utf8.txt" INTO TABLE sjis_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected, 1 warning (0.01 sec)
Records: 2 Deleted: 0 Skipped: 0 Warnings: 1
SELECT col1, HEX(col1) FROM sjis_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| ?abc | 3F616263 |
| def | 646566 |
+------+-----------+

Problem.
When storing, a conversion(utf8->ucs2->sjis) occurs
and BOM(0xEFBBBF) is converted to 0x3F as
undefined.
[16 May 2005 17:53] Alexander Barkov
I think this is not a bug. This is a feature request
"add Byte Order Mark support into LOAD DATA INFILE".
[19 May 2005 16:34] Sergei Golubchik
see also http://bugs.mysql.com/bug.php?id=4960
[2 Jun 2005 18:07] Alexander Barkov
This is a feature request. We created a TODO item for implement
this feature. It's currently scheduled for MySQL-5.2.
[25 Nov 2010 9:02] Alexander Barkov
The original report in Bug#10573 demonstrates
that LOAD DATA INFILE imports BOM as '?' mark.

In bug#29323 Stanislav demonstrates another example,
when BOM causes data *loss*:

mysql> LOAD DATA LOCAL INFILE 'c:/Work/AgroImpex/stocks.txt'
    -> REPLACE
    -> INTO TABLE chlng.stocks_orig
    -> CHARACTER SET 'utf8'
    -> FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"'
    -> LINES TERMINATED BY '\r\n' STARTING BY ''
    -> ;
ERROR 1292 (22007): Incorrect date value: 'я╗┐14.10.2010' for column 'sdate' at row
1

So the value of the date column in the first row gets lost.
This is a more serious flaw than extra '?' mark.

I am reopening the report, so it can be fixed soon.
[25 Nov 2010 9:11] Alexander Barkov
A possible workaround:

- open the file in a text editor and insert a new line in the very beginning
- then run LOAD DATA
- then delete the first record from the table using a DELETE statement
  with appropriate WHERE condition.
[25 Nov 2010 9:18] Alexander Barkov
Note, in strict mode the entire LOAD DATA INFILE interrupts on error,
so no import happens at all.
[28 Jan 2014 21:38] Miguel Solorzano
Still present and verified as Feature request.

C:\dbs>net start mysqld56
The MySQLD56 service is starting.
The MySQLD56 service was started successfully.

C:\dbs>56

C:\dbs>c:\dbs\5.6\bin\mysql -uroot --port=3560 --prompt="mysql 5.6 > "
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.6.16-log Source distribution

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql 5.6 > CREATE DATABASE bom;
Query OK, 1 row affected (0.00 sec)

mysql 5.6 > USE bom
Database changed
mysql 5.6 > CREATE TABLE utf8_tbl (col1 VARCHAR(50)) CHARACTER SET utf8;
Query OK, 0 rows affected (0.17 sec)

mysql 5.6 > CREATE TABLE sjis_tbl (col1 VARCHAR(50)) CHARACTER SET sjis;
Query OK, 0 rows affected (0.14 sec)

mysql 5.6 > SET character_set_database=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql 5.6 > LOAD DATA LOCAL INFILE "c:/tmp/utf8n.txt" INTO TABLE utf8_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.17 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0

mysql 5.6 > SELECT col1, HEX(col1) FROM utf8_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| abc  | 616263    |
| def  | 646566    |
+------+-----------+
2 rows in set (0.00 sec)

mysql 5.6 > TRUNCATE TABLE utf8_tbl;
Query OK, 0 rows affected (0.27 sec)

mysql 5.6 > SET character_set_database=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql 5.6 > LOAD DATA LOCAL INFILE "c:/tmp/utf8.txt" INTO TABLE utf8_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.05 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0

mysql 5.6 > SELECT col1, HEX(col1) FROM utf8_tbl;
+------+--------------+
| col1 | HEX(col1)    |
+------+--------------+
| ?abc | EFBBBF616263 |
| def  | 646566       |
+------+--------------+
2 rows in set (0.00 sec)

mysql 5.6 > SET character_set_database=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql 5.6 > LOAD DATA LOCAL INFILE "c:/tmp/utf8n.txt" INTO TABLE sjis_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected (0.11 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 0

mysql 5.6 > SELECT col1, HEX(col1) FROM sjis_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| abc  | 616263    |
| def  | 646566    |
+------+-----------+
2 rows in set (0.00 sec)

mysql 5.6 > TRUNCATE TABLE sjis_tbl;
Query OK, 0 rows affected (0.19 sec)

mysql 5.6 > SET character_set_database=utf8;
Query OK, 0 rows affected (0.00 sec)

mysql 5.6 > LOAD DATA LOCAL INFILE "c:/tmp/utf8.txt" INTO TABLE sjis_tbl LINES TERMINATED BY '\n';
Query OK, 2 rows affected, 1 warning (0.05 sec)
Records: 2  Deleted: 0  Skipped: 0  Warnings: 1

mysql 5.6 > SELECT col1, HEX(col1) FROM sjis_tbl;
+------+-----------+
| col1 | HEX(col1) |
+------+-----------+
| ?abc | 3F616263  |
| def  | 646566    |
+------+-----------+
2 rows in set (0.00 sec)

mysql 5.6 >
[13 Feb 12:26] muthupandi k
Any progress in this feature or any alternate way to handle BOM