Bug #43337 SELECT INTO OUTFILE adds false space characters
Submitted: 3 Mar 2009 20:25 Modified: 4 Mar 2009 13:03
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:6.0.9, 5.1.32, 5.0.77 OS:Windows
Assigned to: CPU Architecture:Any

[3 Mar 2009 20:25] Peter Laursen
Description:
SELECT INTO OUTFILE generates false spaces for ucs2, utf16 and utf32 but not utf8.

How to repeat:
create table ucs2test (txt char(5)) charset ucs2;
insert into ucs2test values ('æøå');
select txt into outfile 'c:\\ucs2text.txt' from ucs2test;

create table utf16test (txt char(5)) charset ucs2;
insert into utf16test values ('æøå');
select txt into outfile 'c:\\utf16text.txt' from utf16test;

create table utf32test (txt char(5)) charset ucs2;
insert into utf32test values ('æøå');
select txt into outfile 'c:\\utf32text.txt' from utf32test;

create table utf8test (txt char(5)) charset utf8;
insert into utf8test values ('æøå');
select txt into outfile 'c:\\utf8text.txt' from utf8test;

utf8test.txt reads "æøå" - all others read ' æ ø å' (a SPACE before every character). Checked in Notepad and PSPad editors.

- (and what I do not understand is that opening in a my hex editor (PSPAD hex mode) only utf8test displays anything at all - the rest show empty)

For MySQL 5.1 it is reproducable same way with utf8 and ucs2
[3 Mar 2009 20:46] Peter Laursen
same with 5.0.77.

'æøå' is saved to file with SELECT INTO OUTFILE from utf8 - ' æ ø å' from ucs2.
[3 Mar 2009 20:54] Peter Laursen
Additionally I can see that the file exported from a utf8 table got UTF8-encoded.  The others ANSI-encoded.

If you support writing files with SELECT INTO OUTFILE on Windows it should be more smart than this!

I will attach files!
[3 Mar 2009 20:56] Peter Laursen
files from the experiment on server 6.0.9

Attachment: files.zip (application/zip, text), 502 bytes.

[3 Mar 2009 21:27] Peter Laursen
my advice: when saving files on Windows you should *force* utf8-encoding for the file - ALWAYS! It works with any language, anywhere, all data and all postWin98 applications (and is the best option for compatibility with MySQL).
[3 Mar 2009 21:42] Sveta Smirnova
Thank you for the report.

But files which you provided don't contain spaces:

$hexdump bug43337/ucs2text.txt 
0000000 e600 f800 e500 000a                    
0000007

$hexdump bug43337/utf16text.txt 
0000000 e600 f800 e500 000a                    
0000007

$hexdump bug43337/utf32text.txt 
0000000 e600 f800 e500 000a                    
0000007

$hexdump bug43337/utf8text.txt 
0000000 a6c3 b8c3 a5c3 000a                    
0000007

Most likely text editors show data incorrectly. Please check in hex editor. (For example, I use editor jEdit with plugin Hex Edit on Windows)
[3 Mar 2009 21:57] Peter Laursen
And it does not matter that the standard Windows text editor sees spaces?
[3 Mar 2009 21:58] Peter Laursen
ucs2-file as opened in Notepad

Attachment: notepad.jpg (image/jpeg, text), 6.58 KiB.

[3 Mar 2009 22:28] Peter Laursen
The discussion is about the usability of the output on Windows.  I cannot explain either why Linux 'hexdump' command reveals nothing. To me it looks as if your hexdump fails to return BOMs for the utf8 file. So I do not trust that command much for the other files as well!

We are discussing Windows and things should be tested on Windows!

Opening again of course!
[3 Mar 2009 23:38] Peter Laursen
I think I can explain it.

The output is 'big endian' byte order (as explained in another recent report of mine), but there are no BOMs.   Where there are no BOMs Windows *thinks* this is ANSI. But as the bytes are not an ANSI-pattern Windows cannot handle it.  It does handle generically *somehow* anyway and this is the (imperfect) result of it.

I cannot explain why the utf8 file is identified as such.  utf8 BOMs must be there.  But how were they created?  Maybe Windows is able to recognize the byte pattern as utf8 and will add BOMs.

Now .. it *simply does not make sense* to create such files on Windows.  Server and clients must interact with the OS as the OS requires it! SELECT INTO OUTFILE is only functional with non-unicode, utf8 (and probably utf8mb3) on WIndows.  So *either* this must be solved or an error should be returned with SELECT INTO OUTFILE and 'big endian byte-order' unicode.

The solution is in my opinion NOT to save the file as a binary copy of the datastore for unicode data.  As long as it is unicode, data can safely be converted between *uni-encodings*.  So dependent on the server OS (for the server as here) and the client OS (for the clients when a client saves a file and also if LOCAL option should be added to SELECT INTO OUTFILE) data should be saved as a file *uni-encoded* as the OS supports it.

You can say it is *not a bug* and is as per current server design.  
I would call it a *design bug* :-)

Simplest solution is always to save the output of SELECT INTO OUTFILE (at least for unicode data, but why not for all?) as utf8-encoded on Windows. This is the only *uni-encoding* on Windows compatible with MySQL.

You can do with this as you like (make it *not a bug* or a *feature request* - I will not reopen again!). I only ask that you ensure that the developers for whom this should be most concern (mr. 'bar' and mr. 'glub' I think) sees this and thinks it over!
[4 Mar 2009 0:10] Peter Laursen
... or maybe add a CHARSET option to SELECT INTO OUTFILE (just as it was added to LOAD DATA INFILE not so long ago).
[4 Mar 2009 6:59] Sveta Smirnova
Thank you for the feedback.

> And it does not matter that the standard Windows text editor sees spaces?

If files contains no spaces, but standard Windows text editor sees them probably this is problem of the text editor?

> I cannot explain either why Linux 'hexdump' command reveals nothing.

There are plenty hex editors for Windows. Feel free to use one of them.

> Simplest solution is always to save the output of SELECT INTO OUTFILE
> (at least for unicode data, but why not for all?) as utf8-encoded on
> Windows. This is the only *uni-encoding* on Windows compatible with
> MySQL.

Default collation for SELECT INTO OUTFILE is binary. I see this is reasonable, because file can be used for LOAD DATA INFILE with any character set specified. I see no reason to use utf8.

Regarding to "add BOM" this make no sense for binary character set. Also this will make files incomapatible to ones created on Linux.

> or maybe add a CHARSET option to SELECT INTO OUTFILE (just as it was
> added to LOAD DATA INFILE not so long ago).

I can set this as such a feature request though.
[4 Mar 2009 7:00] Sveta Smirnova
Verified as feature request "add a CHARSET option to SELECT INTO OUTFILE"
[4 Mar 2009 13:03] Peter Laursen
now as we are in the 'feature request' section an option to write BOMs ('BOMWRITE') would be very nice too.  It will improve usability on Windows (at least) a lot.

SELECT .. INTO [LOCAL] OUTFILE [CHARSET 'utf8' [BOMWRITE]] FROM ... 
.. would be perfect!
[12 Mar 2009 17:22] Sergei Golubchik
As a feature request this is mostly a duplicate of Bug#30946 (which adds CHARACTER SET clause to the SELECT ... INTO OUTFILE).

But "BOMMARK" isn't part of the Bug#30946, so I'm not closing this one as a dupe.