Bug #29323 | mysql client only accetps ANSI encoded files | ||
---|---|---|---|
Submitted: | 24 Jun 2007 13:11 | Modified: | 24 Nov 2010 16:42 |
Reporter: | Peter Laursen (Basic Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S2 (Serious) |
Version: | 5.0, 5.1, 6.0.0 | OS: | Windows |
Assigned to: | Alexander Barkov | CPU Architecture: | Any |
[24 Jun 2007 13:11]
Peter Laursen
[24 Jun 2007 14:07]
MySQL Verification Team
Thank you for the bug report.
[4 Sep 2007 18:34]
Alexander Barkov
This is because of the "BOM" (byte order mark), which is written in the beginning of the file. The parser should be fixed to ignore the byte order mark, at least in the beginning of a query.
[3 Oct 2007 11:53]
Alexander Barkov
Windows notepad puts these BOM characters: FFFE - "Unicode" (little endian) FEFF - "Unicode big endian" EFBBBF - "UTF-8"
[3 Oct 2007 11:58]
Alexander Barkov
Windows notepad puts these BOM characters: FFFE - "Unicode" (little endian) FEFF - "Unicode big endian" EFBBBF - "UTF-8"
[4 Oct 2007 7:11]
Alexander Barkov
An unclear thing here is how to combine BOM marks and --default-character-set option? 1. Should mysql skip BOM marks only if --default-character-set is utf8? 2. Or should mysql switch to utf8 when it meets a 0xEFBBBF BOM mark in the beginning of a file? Should it ignore --default-character-set specified either in command line or in my.cnf? The second looks more correct, because files can include each other using the "source" command. So an ANSI file can include an UTF8 file, which can include an ANSI file, which can include UTF8 file, and so on.
[4 Oct 2007 8:09]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/34872 ChangeSet@1.2539, 2007-10-04 13:06:01+05:00, bar@mysql.com +3 -0 Bug#29323 mysql client only accetps ANSI encoded files Fix: ignore BOM marker in the first line.
[4 Oct 2007 8:21]
Alexander Barkov
This patch make mysql ignore BOM markers. It does not switch mysql into UTF8 mode yet - it will be done under terms of "WL#2637 Byte Order Mark for LOAD DATA INFILE and mysqlimport"
[4 Oct 2007 8:56]
Alexander Barkov
Pushed into 5.0.50-rpl
[4 Oct 2007 10:29]
Alexander Barkov
Pushed into 5.1.23
[5 Oct 2007 21:20]
Peter Laursen
If I understand correctly then this means that a non-ascii character in a UTF8 encoded file may be interpreted as a 2 or 3 character long (ASCII/ANSI) character sequence. This is a BAD PATCH that makes things even worse. An error message is preferable then! If data cannot be handled correctly then send an error - but don't ever GARBLE DATA!! This exposes a weird priority (and even mentality) in my opinion. I am sorry for saying this, but I see no excuse!
[7 Oct 2007 20:03]
Alexander Barkov
> If I understand correctly then this means that a non-ascii character in > a UTF8 encoded file may be interpreted as a 2 or 3 character long > (ASCII/ANSI) character sequence. True. You need to run "mysql --default-character-set=utf8" when loading a UTF-8 encoded file. An extra parameter can be inconvenient, but it is better than not being able to load a file at all. Automatic switch to UTF-8 will be done within this task: "WL#2637 Byte Order Mark for LOAD DATA INFILE and mysqlimport" http://forge.mysql.com/worklog/task.php?id=2637
[8 Oct 2007 7:43]
Peter Laursen
I was looking for documentation on how to force UTF8 in input and output from/to external files as well. Could you point to it please?
[8 Oct 2007 7:59]
Alexander Barkov
http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
[27 Nov 2007 10:48]
Bugs System
Pushed into 5.0.54
[27 Nov 2007 10:50]
Bugs System
Pushed into 5.1.23-rc
[27 Nov 2007 10:52]
Bugs System
Pushed into 6.0.4-alpha
[19 Dec 2007 5:38]
Paul DuBois
Noted in 5.0.54, 5.1.23, 6.0.4 changelogs. The mysql client program now ignores Unicode byte order mark (BOM) characters at the beginning of input files. Previously, it read them and sent them to the server, resulting in a syntax error. Presence of a BOM does not cause mysql to change its default character set. To do that, invoke mysql with an option such as --default-character-set=utf8.
[19 Nov 2010 9:57]
Stanislav Arkhipov
strange, but it seems the 'utf8_with_bom' bug still exists: mysql> LOAD DATA LOCAL INFILE 'c:/Work/AgroImpex/stocks.txt' -> REPLACE -> INTO TABLE chlng.stocks_orig -> CHARACTER SET 'utf8' -> FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' -> LINES TERMINATED BY '\r\n' STARTING BY '' -> ; ERROR 1292 (22007): Incorrect date value: 'я╗┐14.10.2010' for column 'sdate' at row 1 Server version: 5.1.37-community MySQL Community Server (GPL) Windows XP
[19 Nov 2010 10:13]
Alexander Barkov
Stanislav, can you please attach your file to the "Files" section of the bug report? Thanks!
[19 Nov 2010 10:14]
Alexander Barkov
Or output from this query: SELECT LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stocks.txt')), 100);
[19 Nov 2010 12:16]
Stanislav Arkhipov
well, i've restored original file as stock1 and then mysql> SELECT LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stock1.txt')), 100); +-----------------------------------------------------------+ | LEFT(HEX(LOAD_FILE('c:/Work/AgroImpex/stock1.txt')), 100) | +-----------------------------------------------------------+ | NULL | +-----------------------------------------------------------+ 1 row in set, 1 warning (0.03 sec) i can see in the very beginning of file EF BB BF
[19 Nov 2010 12:18]
Stanislav Arkhipov
and warning is Warning (Code 1301): Result of load_file() was larger than max_allowed_packet (1048576) - truncated
[19 Nov 2010 12:37]
Alexander Barkov
Can you please post HEX dump of a few more bytes after the BOM marker? Thanks.
[19 Nov 2010 13:17]
Stanislav Arkhipov
there's a truncated version of input file
Attachment: stock1.txt (text/plain), 455 bytes.
[24 Nov 2010 16:39]
Alexander Barkov
Stanislav, Thanks for the example file! BOM marker is currently recognized in this context: mysql> source c:\utf8.sql (i.e. it will skip BOM in utf8.sql). LOAD DATA does not recognize BOM. I will create a separate bug report for that.
[24 Nov 2010 16:42]
Peter Laursen
@Alexander .. please post link to new report here then!
[25 Nov 2010 8:26]
Stanislav Arkhipov
thanks for your works guys
[25 Nov 2010 9:05]
Alexander Barkov
Hi Stanislav, Peter, I reopened http://bugs.mysql.com/bug.php?id=10573. It was considered as feature request originally. Now with new circumstances (data loss) I'm going escalate it. Thanks for reporting!