Bug #29323 mysql client only accetps ANSI encoded files
Submitted: 24 Jun 2007 13:11 Modified: 24 Nov 2010 16:42
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0, 5.1, 6.0.0 OS:Windows
Assigned to: Alexander Barkov CPU Architecture:Any

[24 Jun 2007 13:11] Peter Laursen
Description:
mysql client only accepts ANSI encoded files.
 
Not files encoded with the native Windows (UTF16) unicode implementation, and not UTF8 encoded files either.

Now: how do you image that the client can be used with all those languages for which no non-unicode implementation/no ANSI codepage is availabale? That applies to all Indian languages for instance!

(I did not find any 'client' category where to post this)

How to repeat:
create any simple SQL script like
"create database bugbase;"
(the content of the files really does not matter!)

... and save (using Notepad for instance) three copies of the file with Windows Unicode-, Utf8- and ANSI- encoding.  Only the latter can be used by 'mysql client' as this shows:

mysql> source c:\utf8.sql
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near '´╗┐cr
eate database bugbase' at line 1
mysql> source c:\utf16.sql
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near ' ■c'
at line 1
mysql> source c:\ansi.sql
Query OK, 1 row affected (0.01 sec)

Suggested fix:
As a minimum Windows unicode should be supported on Windows! (have MySQL developers been sleeping since Windows 95?)

But I also think UTF8 should be supported.  Almost every simple text editor (like Notepad) knows how to do this (and I believe Microsoft provides the code actually!)
[24 Jun 2007 14:07] MySQL Verification Team
Thank you for the bug report.
[4 Sep 2007 18:34] Alexander Barkov
This is because of the "BOM" (byte order mark), which is written
in the beginning of the file. The parser should be fixed to ignore
the byte order mark, at least in the beginning of a query.
[3 Oct 2007 11:53] Alexander Barkov
Windows notepad puts these BOM characters:

FFFE   - "Unicode" (little endian)
FEFF   - "Unicode big endian"
EFBBBF - "UTF-8"
[3 Oct 2007 11:58] Alexander Barkov
Windows notepad puts these BOM characters:

FFFE   - "Unicode" (little endian)
FEFF   - "Unicode big endian"
EFBBBF - "UTF-8"
[4 Oct 2007 7:11] Alexander Barkov
An unclear thing here is how to combine BOM marks and --default-character-set option?

1. Should mysql skip BOM marks only if --default-character-set is utf8?

2. Or should mysql switch to utf8 when it meets a 0xEFBBBF BOM mark in the
beginning of a file? Should it ignore --default-character-set specified
either in command line or in my.cnf?

The second looks more correct, because files can include
each other using the "source" command. So an ANSI file can include
an UTF8 file, which can include an ANSI file, which can include UTF8 file,
and so on.
[4 Oct 2007 8:09] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34872

ChangeSet@1.2539, 2007-10-04 13:06:01+05:00, bar@mysql.com +3 -0
  Bug#29323 mysql client only accetps ANSI encoded files
  Fix: ignore BOM marker in the first line.
[4 Oct 2007 8:21] Alexander Barkov
This patch make mysql ignore BOM markers.

It does not switch mysql into UTF8 mode yet - 
it will be done under terms of 
"WL#2637 Byte Order Mark for LOAD DATA INFILE and mysqlimport"
[4 Oct 2007 8:56] Alexander Barkov
Pushed into 5.0.50-rpl
[4 Oct 2007 10:29] Alexander Barkov
Pushed into 5.1.23
[5 Oct 2007 21:20] Peter Laursen
If I understand correctly then this means that a non-ascii character in a UTF8 encoded file may be interpreted as a 2 or 3 character long (ASCII/ANSI) character sequence.

This is a BAD PATCH that makes things even worse.  An error message is preferable then!  If data cannot be handled correctly then send an error - but don't ever GARBLE DATA!!

This exposes a weird priority (and even mentality) in my opinion.  I am sorry for saying this, but I see no excuse!
[7 Oct 2007 20:03] Alexander Barkov
> If I understand correctly then this means that a non-ascii character in
> a UTF8 encoded file may be interpreted as a 2 or 3 character long
> (ASCII/ANSI) character sequence.

True. You need to run "mysql --default-character-set=utf8" when loading
a UTF-8 encoded file. 

An extra parameter can be inconvenient, but it is better than 
not being able to load a file at all.

Automatic switch to UTF-8 will be done within this task:
"WL#2637 Byte Order Mark for LOAD DATA INFILE and mysqlimport"
http://forge.mysql.com/worklog/task.php?id=2637
[8 Oct 2007 7:43] Peter Laursen
I was looking for documentation on how to force UTF8 in input and output from/to external files as well.  Could you point to it please?
[8 Oct 2007 7:59] Alexander Barkov
http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
[27 Nov 2007 10:48] Bugs System
Pushed into 5.0.54
[27 Nov 2007 10:50] Bugs System
Pushed into 5.1.23-rc
[27 Nov 2007 10:52] Bugs System
Pushed into 6.0.4-alpha
[19 Dec 2007 5:38] Paul DuBois
Noted in 5.0.54, 5.1.23, 6.0.4 changelogs.

The mysql client program now ignores Unicode byte order mark (BOM)
characters at the beginning of input files. Previously, it read them
and sent them to the server, resulting in a syntax error.

Presence of a BOM does not cause mysql to change its default
character set. To do that, invoke mysql with an option such as
--default-character-set=utf8.
[19 Nov 2010 9:57] Stanislav Arkhipov
strange, but it seems the 'utf8_with_bom' bug still exists:

mysql> LOAD DATA LOCAL INFILE 'c:/Work/AgroImpex/stocks.txt'
    -> REPLACE
    -> INTO TABLE chlng.stocks_orig
    -> CHARACTER SET 'utf8'
    -> FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"'
    -> LINES TERMINATED BY '\r\n' STARTING BY ''
    -> ;
ERROR 1292 (22007): Incorrect date value: 'я╗┐14.10.2010' for column 'sdate' at row 1

Server version: 5.1.37-community MySQL Community Server (GPL)
Windows XP
[19 Nov 2010 10:13] Alexander Barkov
Stanislav,

can you please attach your file to the "Files" section of the bug report?

Thanks!
[19 Nov 2010 10:14] Alexander Barkov
Or output from this query:

SELECT LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stocks.txt')), 100);
[19 Nov 2010 12:16] Stanislav Arkhipov
well, i've restored original file as stock1 and then

mysql> SELECT LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stock1.txt')), 100);
+-----------------------------------------------------------+
| LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stock1.txt')), 100) |
+-----------------------------------------------------------+
| NULL                                                      |
+-----------------------------------------------------------+
1 row in set, 1 warning (0.03 sec)

i can see in the very beginning of file
EF BB BF
[19 Nov 2010 12:18] Stanislav Arkhipov
and warning is
Warning (Code 1301): Result of load_file() was larger than max_allowed_packet (1048576) - truncated
[19 Nov 2010 12:37] Alexander Barkov
Can you please post HEX dump of a few more bytes after the BOM marker?
Thanks.
[19 Nov 2010 13:17] Stanislav Arkhipov
there's a truncated version of input file

Attachment: stock1.txt (text/plain), 455 bytes.

[24 Nov 2010 16:39] Alexander Barkov
Stanislav,

Thanks for the example file!

BOM marker is currently recognized in this context:

mysql> source c:\utf8.sql

(i.e. it will skip BOM in utf8.sql).

LOAD DATA does not recognize BOM.
I will create a separate bug report for that.
[24 Nov 2010 16:42] Peter Laursen
@Alexander .. please post link to new report here then!
[25 Nov 2010 8:26] Stanislav Arkhipov
thanks for your works guys
[25 Nov 2010 9:05] Alexander Barkov
Hi Stanislav, Peter,

I reopened http://bugs.mysql.com/bug.php?id=10573.
It was considered as feature request originally.

Now with new circumstances (data loss) I'm going escalate it.

Thanks for reporting!