Bug #29890 Loading from utf-8 formatted file using LOAD DATA INFILE
Submitted: 19 Jul 2007 7:55 Modified: 19 Jul 2007 13:25
Reporter: Erek Gokturk Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.0.45-community-nt OS:Windows (XP Pro SP2)
Assigned to: CPU Architecture:Any

[19 Jul 2007 7:55] Erek Gokturk
Description:
Cannot load data in utf-8 character set from a file using LOAD DATA INFILE command. It ends up in the following error:

ERROR 1366 (HY000): Incorrect string value: '\xC4\xB1\xC4\x9F\xC3\xBC...' for column 'Z1' at row 1

The file contains the single line given below (formatted as utf-8 web for bug submission, normally it is in utf8 binary in file)
ŊĸĕŸďĀʼnķõŷïà|ŊĸĕŸďĀʼnķõŷïà|ŊĸĕŸďĀʼnķõŷïà|ŊĸĕŸďĀʼnķõŷïà|ŊĸĕŸďĀʼnķõŷïà

The details of character set related variables are (SHOW VARIABLES LIKE 'character%';):

| character_set_client     | utf8
| character_set_connection | utf8
| character_set_database   | utf8
| character_set_filesystem | binary
| character_set_results    | utf8
| character_set_server     | utf8
| character_set_system     | utf8
| character_sets_dir       | J:\Program Files\MySQL\MySQL Server 5.0\share\charsets\

How to repeat:

Create a database using the following commands (note: this uses the default engine innodb):

CREATE SCHEMA ABC;
CREATE TABLE ZXY
  ( Z1 VARCHAR(30) NOT NULL,
    Z2 VARCHAR(150),
    Z3 VARCHAR(250),
    Z4 VARCHAR(250),
    Z5 VARCHAR(30),
    
    PRIMARY KEY (Z1));
ALTER DATABASE ABC CHARACTER SET 'utf8' COLLATE 'utf8_turkish_ci';
ALTER TABLE ZXY CHARACTER SET 'utf8' COLLATE 'utf8_turkish_ci';

Create a utf-8 formatted, unsigned file and put the line above in the description into the file (or I guess any bunch of non-latin1 characters will do). UTF-8 signed files also create error.

Try loading it using the command below. If necessary, modify the line ending. This is tested on windows, and the file is one line anyway.

LOAD DATA INFILE '<<THE PATH TO YOUR FILE>>' INTO TABLE ZXY CHARACTER SET 'utf8' FIELDS TERMINATED BY '|' LINES TERMINATED BY '\r\n';

Then the error in the description is produced.

Suggested fix:
Is there a filter that rejects non-printable part of ascii that gets applied to all char or varchar fields when loading data from file?
[19 Jul 2007 7:58] Erek Gokturk
Apparently could not format for web. Here is another try for the single line in the file:

ığüşöçİĞÜŞÖÇ|ığüşöçİĞÜŞÖÇ|ığüşöçİĞÜŞÖÇ|ığüşöçİĞÜŞÖÇ|ığüşöçİĞÜŞÖÇ
[19 Jul 2007 7:59] Erek Gokturk
OK. It is hard to submit this single line, but you should get the idea :)
[19 Jul 2007 8:01] Erek Gokturk
The data file to be loaded

Attachment: utf8.txt (text/plain), 124 bytes.

[19 Jul 2007 13:25] Sveta Smirnova
Thank you for the report.

We can not repeat described behaviour in our own environment. So I'll close the report as "Can't repeat".

If you have idea what is unusual in your environment, feel free to reopen the report.