Bug #44411 some Unicode text garbled in LOAD DATA INFILE with user variables
Submitted: 22 Apr 2009 15:22 Modified: 13 Nov 2018 13:56
Reporter: Miguel K Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.34, 5.0, 5.1, 6.0 bzr OS:Any (Linux, MS Windows Server 2008)
Assigned to: Assigned Account CPU Architecture:Any

[22 Apr 2009 15:22] Miguel K
Description:
I found that with some Japanese characters,
LOAD DATA INFILE loads it fine when the column is loaded normally, but garbles the text if the column uses SET = @variable

Here is the output from the "How to repeat" section below:
戦
戦

The first entry is as it should be.  The second one is not.

How to repeat:
-- test1.txt has a single character + CRLF: 戦

CREATE TABLE  test3 (
  `note` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT ''
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

-- works fine when user variable not used
LOAD DATA INFILE 'C:/temp/test1.txt'
REPLACE
INTO TABLE test3
CHARACTER SET utf8
FIELDS ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(note);

-- does not work when column is loaded using a user variable
LOAD DATA INFILE 'C:/temp/4/test1.txt'
REPLACE
INTO TABLE test3
CHARACTER SET utf8
FIELDS ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(@PN2)
SET note = @PN2;

SELECT * FROM test3;
[22 Apr 2009 15:23] Miguel K
file to use with LOAD DATA INFILE

Attachment: test1.txt (text/plain), 8 bytes.

[29 Apr 2009 9:25] Sveta Smirnova
Thank you for the report.

I can not repeat described behavior. Have you run SET NAMES utf8; before LOAD DATA? If so, please also check with current version 5.1.34 and provide copy-paste from the MySQL command line client.
[29 Apr 2009 14:34] Miguel K
I just tested with 5.1.34 and got the same result, both with and without "SET NAMES utf8".

While the MySQL command line client does not show utf8 values correctly, you can see that the two entries in the table are different.  They should be the same.  I need to open them in MySQL Query Browser to see the actual values.

Command line output:
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE  test3 (
    ->   `note` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT ''
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Query OK, 0 rows affected (0.02 sec)

mysql>
mysql> -- works fine when user variable not used
mysql> LOAD DATA INFILE 'C:/temp/test1.txt'
    -> REPLACE
    -> INTO TABLE test3
    -> CHARACTER SET utf8
    -> FIELDS ENCLOSED BY '"'
    -> LINES TERMINATED BY '\r\n'
    -> (note);
Query OK, 1 row affected (0.00 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT * FROM test3;
+--------+
| note   |
+--------+
| 戦 |
+--------+
1 row in set (0.00 sec)

mysql> LOAD DATA INFILE 'C:/temp/test1.txt'
    -> REPLACE
    -> INTO TABLE test3
    -> CHARACTER SET utf8
    -> FIELDS ENCLOSED BY '"'
    -> LINES TERMINATED BY '\r\n'
    -> (@PN2)
    -> SET note = @PN2;
Query OK, 1 row affected (0.00 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 0

mysql>
mysql> SELECT * FROM test3;
+--------------+
| note         |
+--------------+
| 戦       |
| 戦 |
+--------------+
2 rows in set (0.00 sec)

mysql>
[30 Apr 2009 8:17] Susanne Ebrecht
Many thanks for writing a bug report.

Unfortunately, this is not a bug.

The sign "戦" is a 4 byte sign and MySQL only support utf8 signs with a length of 3 bytes.
[30 Apr 2009 11:21] Miguel K
It is a 4-byte character? I would think it shouldn't allow me to put it in the database as I did with the first method.

Cheers
[1 May 2009 15:30] Miguel K
It is not 4-bytes, it is 3-bytes (according to my research).

Unicode value: U+6226
HTML: &#x 6226 ;

Unicode vales in the range of U+0800-U+FFFF are 3-bytes long.  This character is in that range.

Reference:
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA
http://en.wikipedia.org/wiki/UTF-8
[4 May 2009 14:46] Susanne Ebrecht
testfile

Attachment: bug44411 (application/octet-stream, text), 1.45 KiB.

[4 May 2009 14:47] Susanne Ebrecht
Sorry, I made a lookup mistake.

I pasted a test file here. Somebody needs to test it on Windows 2008. Not repeatable on Linux.
[4 May 2009 17:52] Sveta Smirnova
Miguel,

thank you for the feedback and sorry for the wrong answer.

We think we found the problem: this is value of character_set_database or character_set_server which should be not utf8 in your case. In case if character_set_server is not utf8 this is not a bug according to http://dev.mysql.com/doc/refman/5.1/en/user-variables.html. But in case if character_set_database is not utf8 while character_set_server this is at least documentation bug.

Please run statement SHOW VARIABLES LIKE 'char%' in your environment to confirm this is the cause of the problems on your side: I want to be sure there is no other bug here.
[4 May 2009 17:56] Sveta Smirnova
Set to verified "Character Set" bug for now as effect of character_set_database on user variables is not listed in the user manual.

Workaround:

create database temp character set utf8;
use temp;
LOAD DATA INFILE XXX REPLACE INTO TABLE realdb.test3 .....
[4 May 2009 18:00] Sveta Smirnova
test case with character_set_database

Attachment: bug44411.test (application/octet-stream, text), 1.10 KiB.

[4 May 2009 18:01] Sveta Smirnova
option file

Attachment: bug44411-master.opt (application/octet-stream, text), 28 bytes.

[4 May 2009 19:05] Miguel K
Thank you.

I tried it with a new database using 'create database temp character set utf8;' and everything worked right.

Here is the output you requested for the database that I used in the bug report:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+---------------------------------------------------------+
| Variable_name            | Value                                                   |
+--------------------------+---------------------------------------------------------+
| character_set_client     | latin1                                                  |
| character_set_connection | latin1                                                  |
| character_set_database   | latin1                                                  |
| character_set_filesystem | binary                                                  |
| character_set_results    | latin1                                                  |
| character_set_server     | latin1                                                  |
| character_set_system     | utf8                                                    |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\ |
+--------------------------+---------------------------------------------------------+
8 rows in set (0.00 sec)

I don't want to ask too much, but it would be nice to:
 - get an error message instead of bad data when LOAD DATA INFILE with 'CHARACTER SET utf8' has a user variable and 'character_set_database' != utf8
 - have a page in the Unicode documentation explaining the effects of setting or not setting 'character_set_database' = utf8

I have long assumed that the database (and even table) character set was irrelevant as long as the column's character set was set correctly.

Cheers
[18 Jan 2018 14:22] Miguel K
If it isn't a bug, please explain why.  Don't just silently change the status to "not a bug" after it was previously verified. That is unprofessional.
[29 Oct 2018 6:08] Xing Zhang
Posted by developer:
 
This has been fixed in 5.1.50.
When using user variables to hold the column data in LOAD DATA the character set of the user variable was set incorrectly to the database charset. Fixed by setting it to the charset specified by LOAD DATA (if any).
[29 Oct 2018 6:08] Xing Zhang
Posted by developer:
 
This has been fixed in 5.1.50.
When using user variables to hold the column data in LOAD DATA the character set of the user variable was set incorrectly to the database charset. Fixed by setting it to the charset specified by LOAD DATA (if any).
[13 Nov 2018 13:56] MySQL Verification Team
According last developer comment, already fixed.