MySQL Bugs: #39449: wrong encoding used when naming files and folders

Bug #39449	wrong encoding used when naming files and folders
Submitted:	15 Sep 2008 10:58	Modified:	8 Apr 2009 7:18
Reporter:	Albert Rosenfield	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.0.67	OS:	Windows (any)
Assigned to:		CPU Architecture:	Any

Description:
Created a database by adding a folder 'æøå' to the mysql data directory on Windows Vista.

Using SHOW DATABASES, MySQL 5.0.67 now reports an inaccessible database named '' (blank string) instead of 'æøå'.

Went to Google to look for fixes and research a bit and found out the following, which is hopefully precise.

======== background info begin ========
Filesystem drivers normally have a binary encoding scheme, that is they do not care about or convert between character sets. NTFS stores filenames in arrays of 16-bit values, other filesystems use arrays of 8-bit values.

Above the filesystem is the kernel API. In MS-Windows, the encoding scheme when calling the normal kernel API is UTF-16LE. The Windows kernel API also provides wrappers that convert from a specific ANSI codepage to UTF-16LE.

In Linux, the official encoding scheme when calling the kernel API is binary, but applications generally expect data transferred to/from the kernel to be in the encoding scheme given by the environment variable LC_CTYPE. The default LC_TYPE setting is specific to the Linux distribution used, but in the Western world is often iso-8859-1 or utf-8.

Lack of kernel support in Linux is problematic. For example when a running Linux box has mounted both an NTFS and an ext2 filesystem.

Environment variables are specific to a running process, so the NTFS driver cannot usually see what the process expects from it and automatically convert to/from the UTF-16 characters stored on disk. Therefore the user has to manually switch the LC_CTYPE variable before (s)he runs a program, and restrict running programs to access one of the volumes at any time.

It's hopefully reasonable to expect the Linux stuff to be cleaned up at some point..

MySQL has two variables, character_set_system and character_set_filesystem, that instructs MySQL how it should talk to the underlying operating system.
======== background info end ========

First tried to set character_set_system as it seems the broader of the two. Didn't work, the server refuses to start, apparently the variable can not be set from my.ini.

Then tried setting character_set_filesystem. MySQL does not support UTF-16LE, but a character set named 'ucs2' is supported. The relevant Unicode specifications say that UCS-2 with no endianess specified has host endianess, so being on x64 this would mean that UCS-2 is equivalent to UCS-2LE. And UCS-2LE is equivalent to UTF-16LE for the basic characters, those that can be written using only one 16-bit value.

Setting character_set_filesystem=ucs2 did not help either, the database still shows up as blank.
Also, creating the database using CREATE DATABASE `æøå` causes a folder named 'Ã¦Ã¸Ã¥' to be created.

How to repeat:
CREATE DATABASE `æøå`.

Suggested fix:
Calling the MS-Windows kernel APIs with correctly encoded UCS-2 data.

(Asides from correcting the mapping between filesystem names and SQL names, this could in theory also have security benefits. Allowing SQL users to feed incorrectly encoded data into the kernel could be a security issue.)

Thank you for the report.

Please provide output of SHOW VARIABLES LIKE '%char%' and SHOW VARIABLES LIKE '%col%': I want to check which client collation you use. Also please provide output of `chcp` command from cmd.exe

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

The filesystem and system settings are those as automatically detected by MySQL(not modified by config file).  The database/table/field settings are set by this my.cnf setting:

  default-character-set=latin1
  default-collation=latin1_danish_ci

Here are char% and col%:
  character_set_client;utf8
  character_set_connection;utf8
  character_set_database;latin1
  character_set_filesystem;binary
  character_set_results;utf8
  character_set_server;latin1
  character_set_system;utf8
  collation_connection;utf8_general_ci
  collation_database;latin1_danish_ci
  collation_server;latin1_danish_ci

Many thanks for writing a bug report.

Windows filesystem is using encoding that you can figure out by using a command prompt and type: CHCP

You have to set the given codepage as client encoding for you MySQL Client to get the data displayed.

As example:
CHCP result is codepage 852.

So you first have to use:
SET NAMES CP852;

After that you can use:
SHOW DATABASES;

and all will be displayed correct.

> Many thanks for writing a bug report.

You're welcome.

> Windows filesystem is using encoding that you
> can figure out by using a command prompt and
> type: CHCP

Incorrect.  NTFS filenames are always encoded
using UTF-16 in little endian byte order.

> You have to set the given codepage as client
> encoding for you MySQL Client to get the data
> displayed.

This is a bug.

SET NAMES is supposed to change the character set
that the client and the server uses to communicate.

SET NAMES should _not_ magically cause the underlying
data to change.  Whenever this happens, it's a bug.

> As example:
> CHCP result is codepage 852.
>
> So you first have to use:
> SET NAMES CP852;
>
> After that you can use:
> SHOW DATABASES;
>
> and all will be displayed correct.

This is a bug.  The underlying data from the database
server should not change with SET NAMES.

Also, as you can clearly see by starting Windows Explorer
and going to where MySQL Server keeps it's databases, the
encoding used for storing directories and filenames is
simply wrong.

Thank you for the feedback.

Please provide output of chcp command.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Here you go.

==============
C:\>chcp
Active code page: 1252
==============

Completely irrelevant though, since filenames in the NTFS filesystem are always encoded in the same character set, namely UCS2.

Thank you for the feedback.

Unfortunately this is relevant. This is limitation of Windows command line client. You have to manage MySQL databases using client API and don't create them in filesystem if you want to use any non-English character in database name.

> Unfortunately this is relevant.
> This is limitation of Windows command line client.

Not true, because I wasn't using the command line client to begin with.

The bug is clearly in the server (which creates the databases and tables in the filesystem), not the client.

> You have to manage MySQL databases using client API and don't
> create them in filesystem if you want to use any non-English
> character in database name.

Irrelevant, since the server stores filenames using the wrong character set in either case.

Thank you for the feedback.

> Not true, because I wasn't using the command line client to begin with.

What did you use? Creating databases using Windows Explorer is not proper way anyway.

> Irrelevant, since the server stores filenames using the wrong character
set in either case.

This is not so easy. Windows dynamically changes between ANSI and Unicode. See bug #43273 for details.

> What did you use?

One of the GUI clients.  Can't remember which one, but I'm always using a GUI client.  The CLI stuff is for bug reports ;-).

> Creating databases using Windows Explorer is not proper way anyway.

It's exceedingly easy to make MySQL bork itself with this bug using just SQL.  Using Windows Explorer makes it easier since you're skipping a step, but it's by no means a necessary tool.  The original problem description does not include Windows Explorer.

> This is not so easy. Windows dynamically changes
> between ANSI and Unicode. See bug #43273 for details.

Ah, so the issue is lack of Windows API know-how.  Should be easy to fill that in?  Microsoft has supported UCS2 (= simplified UTF-16) since Windows 95.  In a matter of a few Windows versions, every kernel API function was gutted and replaced with an UCS2 version.

A wrapper was made for each function so that old programs would continue to work.  Sort of.  The wrapper uses an ANSI character set, so all characters that are not representable are basically screwed, which is probably the random effect you're talking about.

The trick to making things work is always calling the native APIs.  For listing files in a directory, the function name to use is FindFirstFileW() rather than FindFirstFile().

FindFirstFile will map to FindFirstFileA if a specific compiler #define is not given, and FindFirstFileA() is just a wrapper that mangles the characters as best it can, and then forwards the call to FindFirstFileW().

So, instead of:
  GetProcAddress("FindFirstFile");

Use:
  GetProcAddress("FindFirstFileW");

because the first variant only works for ASCII (or ANSI if you're having a really lucky day).

All that's left now is to convert the return value, which is always a wchar_t when using the native W versions, to MySQL's internal representation.

Assuming that MySQL uses UTF-8, either an internal UCS2->UTF8 function could be used, or the Windows API one:

WideCharToMultiByte(CP_UTF8, 0, wcFileName, -1, utf8Buffer, ...)

And that's it.

There's also a UTF8->UCS2 function in the Windows API, for converting input parameters from UTF-8:

MultiByteToWideChar(CP_UTF8, 0, uft8FileName, -1, wcBuffer, ...)

--

(As a sidenote: the codepage returned by CHCP is not the ACP (ANSI code page) which the non-native API wrappers use.  It's the OEMCP that is shown by CHCP.  These are different and may very well be incompatible, as an example here's from a machine where I work:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Codepage]
"ACP"="1252"
"OEMCP"="437"
"MACCP"="10000"
)

Thank you for the feedback.

Regarding to GUI Tool we need repeatable test case with ANSI names and appropriate client encoding, so we can repeat the bug.

Regarding second question it is exactly what  bug #43273 is about. If you wish I can mark this one as duplicate of  bug #43273. Anyway I'll add a link to your comment to  bug #43273.

The test case is right here, in the original report.

Thank you for the feedback.

I can verify this as feature request "Please support UTF database names on Windows" or even "Please fully support UTF on Windows". Although technically this is duplicate of bug #43273 I'll left them separate.

> duplicate

This bug is several months older, so technically #43273 is a duplicate of this one ;-).

> feature request

MySQLd is treating the filesystem incorrectly.  With all due respect, that seems to fit the definition of "bug" more than "feature request".

That said, classify it as you see fit.  I can't force you to fix things...

> need repeatable test case

At this point, I'm not sure what would be helpful.  I can write a dozen more test cases that demonstrates in different ways how the server breaks because it doesn't understand that the filesystem names are UCS2.  That won't change the underlying cause, which has already been identified.

I can appreciate that someone spent time writing code that would have to be removed for this to be fixed.  Namely, all the bizarre identifier rewriting documented in the 5.1 manual under "identifier-mapping" would be (mostly? completely?) unnecessary if the server was fixed to name files and folders with the correct encoding used by the filesystem.  It hurts to remove code, but sometimes it's the right thing to do.

I've added a note to http://dev.mysql.com/doc/refman/5.1/en/windows-vs-unix.html:

Directory and file names

On Windows, MySQL Server supports only directory and file names that are compatible with the current ANSI code pages. For example, the following Japanese directory name will not work in the Western locale (code page 1252):

datadir="C:/维基百科关于中文维基百科"
The same limitation applies to directory and file names referred to in SQL statements, such as the data file path name in LOAD DATA INFILE.

See also Bug#43273 and Bug#43184.

The originally reported problem was fixed in MySQL-5.1, with so called
"table name to file name" encoding. 

The other information can be useful when fixing the problem reported in

Bug#43184 LOAD DATA fails wiht unicode file paths

Marking this bug as a duplicate for bug#43184.