Bug #43184 LOAD DATA fails wiht unicode file paths
Submitted: 25 Feb 2009 12:31 Modified: 11 Mar 2009 7:37
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.1.31 OS:Windows (Vista 32 bit)
Assigned to: Assigned Account CPU Architecture:Any
Tags: qc

[25 Feb 2009 12:31] Peter Laursen
Description:
In windows you can have file paths using unicode characters 'outside' the machine LOCALE setting. I have a Danish LOCALE setting and a Chinese file path.

LOAD DATA fails with such path!

How to repeat:
CREATE TABLE `tablename1` (              
              `id` int(11) NOT NULL AUTO_INCREMENT,  
              `bluh` varchar(50) DEFAULT NULL,       
              PRIMARY KEY (`id`),                    
              KEY `blahbluh` (`bluh`)                
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

load data local infile 'C:\\维基百科关于中文维基百科\\test.csv' into table `test`.`tablename1` fields escaped by '\\' terminated by '\t' lines terminated by '\n' (`id`, `bluh`);

returns:

Error Code : 2
File 'C:\维基百科关于中文维基百科\test.csv' not found (Errcode: 2)

File is there and reads:

"1","a"
"2","b"

I can open from Windows explorer, Notepad and other programs without problems!

Suggested fix:
Not sure how to fix (in technical terms).
[25 Feb 2009 12:38] Peter Laursen
Not important for the discussion but anyway the correct statement for this particular file would be instead

load data local infile 'C:\\维基百科关于中文维基百科\\test.csv' into table `test`.`tablename1` fields escaped by '\\' terminated by ',' enclosed by '"' lines terminated by '\r\n' (`id`, `bluh`);
[25 Feb 2009 19:00] Peter Laursen
I originally marked this a a server problem. That may be wrong. Now I think it is rather an issue with the client API.

I also forgot to tell that as I do not see any way to test this with command line on Windows (cmd.exe), I did with libmysql-based clients (Query Browser and SQLyog).
[27 Feb 2009 23:05] Peter Laursen
related:
http://bugs.mysql.com/bug.php?id=43273
[28 Feb 2009 17:39] Sveta Smirnova
Thank you for the report.

Afaik Windows uses UCS-2 for its filenames. Could you please create text document with appropriate encoding, write SET NAMES ucs2; and LOAD DATA statements there and try to load it with help of cmd.exe?
[28 Feb 2009 17:58] Peter Laursen
wrong!

Windows uses 'little endian' UFT16 unicode implementation! UCS2 was used by the 'unicode-layer' of Win98/ME.  But since Win 2K++ it is UTF16 (difference is that characters outside Basic Multilingual Plane' is suppported.

I will do as you suggest later tonight, but I bet it won't help!  It is not the folder/file *content* but the folder/file *name* we are disucssing!  Problem is that no MySQL programs use BOMs (Byte Order Marks) for detecting string encodings.  Doing this is simply required on Windows because Windows uses ANSI/Uncode dynamically for strings handled internally!

Problem here is *file system access* - not *data*!
[28 Feb 2009 18:10] Sveta Smirnova
Peter,

thank you for the feedback. I agree problem is the name, but MySQL should handle it correctly if correct client character set used. Please see also comments "[16 Jan 16:09] Alexander Barkov" and "[16 Jan 16:12] Alexander Barkov" to bug #31688 for details.
[28 Feb 2009 19:44] Peter Laursen
how should I be able to do this?

set character_set_client = ucs2
... returns "Error Code 1231: Variable 'character_set_client' can't be set to the value of 'ucs2'". The support for ucs2 for client character set was removed from the server some time ago! 

Also note that I can have a folder like 'myрусскийvery汉语漢語ownहिंदी' (with latin, cyrilic, chinese and hindi charaters all mixed together). On utf8 systems (Linux etc.) that would be no problem.  But is is with MySQL on Windows (and it is not with native Windows programs - whether from Microsoft or other vendors!)

Also changing server configuration (as I understand Alexanders comments) should not be required! Consider: I have a zipped archive with a hierachy of CSV files from a Chinese customer.  All file and folder names are Chinese! How do I LOAD those DATA files on *my* system? 

Note that this report is mostly related to libmysql (and of course the new C++ connector as well).  How do I write a client program with libmysql that lets me LOAD DATA from all files on a Windows system no matter what language is used for naming files and folders and no matter system LOCALE setting and MySQL configuration? (Tomorrow I will have data from a customer in Indonesia that I will LOAD with same program on another system with other settings and so on!). 

(and FYI: I happen to know that in areas of the world where it is common to 'mix' languages (HongKong = English+Chinese, India = English+Hindi+a local language) this is a serious defect with MySQL)
[28 Feb 2009 19:53] Peter Laursen
So in short the culprit is that on Windows multilingalism should be supported as smooth as it is on utf8 systems (as Linux)!

When it is not the case it is not a problem with Windows.  It is a problem with MySQL softwares!
[28 Feb 2009 19:59] Peter Laursen
To pin it out: I may not be in control of server configuration.  I am using a client!  I do not know where the server is - it may be Brazil, Greenland or the Moon.  I do not care as long as the client does the job for me!
[28 Feb 2009 20:22] Sergei Golubchik
It is possible to work around this problem on the client. You can use mysql_set_local_infile_handler() to implement your handling of filenames in load data (and if the name comes garbled from the server, which can be the case, you can extract it from the SQL on the client too).

It isn't nice, I agree, but doable.
[28 Feb 2009 21:54] Peter Laursen
ok thanks @Sergei .. that I will have to discuss in detail with more 'long-haired' members of our development team!

But please tell: does not Windows return BOM information that could be used by the client automatically?
[2 Mar 2009 13:52] Vladislav Vaintroub
I think the problem cannot be solved as long as MySQL (client in this case) will support Unicode file names (CreateFileW, _wopen, _wfopen and friends). It does not today and a fix sounds like a major revamp of MySQL guts.
[2 Mar 2009 14:10] Peter Laursen
?? 'it cannot be solved' *BECAUSE* 'it is a 'major revamp' ?? 

This is a matter of priorities only!  The word BECAUSE is nonsense here!
[2 Mar 2009 14:26] Sergei Golubchik
Read it again, Nobody said that "it cannot be solved because it is a major revamp" :)
[2 Mar 2009 14:38] Peter Laursen
I am sorry but I still read it like I did in the first place. :-) :-)

But you are doing a *procedural* mistake.  Currently we are not discussing if, how or when this can be fixed.  We are waiting for the report to be *verified* or not - simply!

Once it is verifed (only), the other discussion starts.

So I understand that we agree that MySQL softwares (servers and clients) fail to understand file paths returned by windows if no *codepage* ('ANSI' in Windows terminology) but instead *unicode* is used by Windows?  That is the question now!
[11 Mar 2009 7:37] Sveta Smirnova
Thank you for the feedback.

Verified as described: there is no way to use Unicode file names on Windows.
[2 Apr 2009 0:29] Paul DuBois
I've added a note to http://dev.mysql.com/doc/refman/5.1/en/windows-vs-unix.html:

Directory and file names

On Windows, MySQL Server supports only directory and file names that are compatible with the current ANSI code pages. For example, the following Japanese directory name will not work in the Western locale (code page 1252):

datadir="C:/维基百科关于中文维基百科"
The same limitation applies to directory and file names referred to in SQL statements, such as the data file path name in LOAD DATA INFILE.

See also Bug#43273 and Bug#39449.
[8 Apr 2009 7:12] Alexander Barkov
Bug#43273 has been marked as duplicate for this one.
[8 Apr 2009 7:19] Alexander Barkov
Bug#39449 was marked as duplicate for this one.