Bug #39449 | wrong encoding used when naming files and folders | ||
---|---|---|---|
Submitted: | 15 Sep 2008 10:58 | Modified: | 8 Apr 2009 7:18 |
Reporter: | Albert Rosenfield | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 5.0.67 | OS: | Windows (any) |
Assigned to: | CPU Architecture: | Any |
[15 Sep 2008 10:58]
Albert Rosenfield
[15 Sep 2008 19:46]
Sveta Smirnova
Thank you for the report. Please provide output of SHOW VARIABLES LIKE '%char%' and SHOW VARIABLES LIKE '%col%': I want to check which client collation you use. Also please provide output of `chcp` command from cmd.exe
[15 Oct 2008 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[30 Nov 2008 19:28]
Albert Rosenfield
The filesystem and system settings are those as automatically detected by MySQL(not modified by config file). The database/table/field settings are set by this my.cnf setting: default-character-set=latin1 default-collation=latin1_danish_ci Here are char% and col%: character_set_client;utf8 character_set_connection;utf8 character_set_database;latin1 character_set_filesystem;binary character_set_results;utf8 character_set_server;latin1 character_set_system;utf8 collation_connection;utf8_general_ci collation_database;latin1_danish_ci collation_server;latin1_danish_ci
[12 Dec 2008 14:53]
Susanne Ebrecht
Many thanks for writing a bug report. Windows filesystem is using encoding that you can figure out by using a command prompt and type: CHCP You have to set the given codepage as client encoding for you MySQL Client to get the data displayed. As example: CHCP result is codepage 852. So you first have to use: SET NAMES CP852; After that you can use: SHOW DATABASES; and all will be displayed correct.
[17 Feb 2009 11:49]
Albert Rosenfield
> Many thanks for writing a bug report. You're welcome. > Windows filesystem is using encoding that you > can figure out by using a command prompt and > type: CHCP Incorrect. NTFS filenames are always encoded using UTF-16 in little endian byte order. > You have to set the given codepage as client > encoding for you MySQL Client to get the data > displayed. This is a bug. SET NAMES is supposed to change the character set that the client and the server uses to communicate. SET NAMES should _not_ magically cause the underlying data to change. Whenever this happens, it's a bug. > As example: > CHCP result is codepage 852. > > So you first have to use: > SET NAMES CP852; > > After that you can use: > SHOW DATABASES; > > and all will be displayed correct. This is a bug. The underlying data from the database server should not change with SET NAMES. Also, as you can clearly see by starting Windows Explorer and going to where MySQL Server keeps it's databases, the encoding used for storing directories and filenames is simply wrong.
[17 Feb 2009 14:29]
Sveta Smirnova
Thank you for the feedback. Please provide output of chcp command.
[18 Mar 2009 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[25 Mar 2009 10:13]
Albert Rosenfield
Here you go. ============== C:\>chcp Active code page: 1252 ============== Completely irrelevant though, since filenames in the NTFS filesystem are always encoded in the same character set, namely UCS2.
[25 Mar 2009 10:34]
Sveta Smirnova
Thank you for the feedback. Unfortunately this is relevant. This is limitation of Windows command line client. You have to manage MySQL databases using client API and don't create them in filesystem if you want to use any non-English character in database name.
[25 Mar 2009 15:01]
Albert Rosenfield
> Unfortunately this is relevant. > This is limitation of Windows command line client. Not true, because I wasn't using the command line client to begin with. The bug is clearly in the server (which creates the databases and tables in the filesystem), not the client. > You have to manage MySQL databases using client API and don't > create them in filesystem if you want to use any non-English > character in database name. Irrelevant, since the server stores filenames using the wrong character set in either case.
[25 Mar 2009 18:26]
Sveta Smirnova
Thank you for the feedback. > Not true, because I wasn't using the command line client to begin with. What did you use? Creating databases using Windows Explorer is not proper way anyway. > Irrelevant, since the server stores filenames using the wrong character set in either case. This is not so easy. Windows dynamically changes between ANSI and Unicode. See bug #43273 for details.
[25 Mar 2009 20:44]
Albert Rosenfield
> What did you use? One of the GUI clients. Can't remember which one, but I'm always using a GUI client. The CLI stuff is for bug reports ;-). > Creating databases using Windows Explorer is not proper way anyway. It's exceedingly easy to make MySQL bork itself with this bug using just SQL. Using Windows Explorer makes it easier since you're skipping a step, but it's by no means a necessary tool. The original problem description does not include Windows Explorer. > This is not so easy. Windows dynamically changes > between ANSI and Unicode. See bug #43273 for details. Ah, so the issue is lack of Windows API know-how. Should be easy to fill that in? Microsoft has supported UCS2 (= simplified UTF-16) since Windows 95. In a matter of a few Windows versions, every kernel API function was gutted and replaced with an UCS2 version. A wrapper was made for each function so that old programs would continue to work. Sort of. The wrapper uses an ANSI character set, so all characters that are not representable are basically screwed, which is probably the random effect you're talking about. The trick to making things work is always calling the native APIs. For listing files in a directory, the function name to use is FindFirstFileW() rather than FindFirstFile(). FindFirstFile will map to FindFirstFileA if a specific compiler #define is not given, and FindFirstFileA() is just a wrapper that mangles the characters as best it can, and then forwards the call to FindFirstFileW(). So, instead of: GetProcAddress("FindFirstFile"); Use: GetProcAddress("FindFirstFileW"); because the first variant only works for ASCII (or ANSI if you're having a really lucky day). All that's left now is to convert the return value, which is always a wchar_t when using the native W versions, to MySQL's internal representation. Assuming that MySQL uses UTF-8, either an internal UCS2->UTF8 function could be used, or the Windows API one: WideCharToMultiByte(CP_UTF8, 0, wcFileName, -1, utf8Buffer, ...) And that's it. There's also a UTF8->UCS2 function in the Windows API, for converting input parameters from UTF-8: MultiByteToWideChar(CP_UTF8, 0, uft8FileName, -1, wcBuffer, ...) -- (As a sidenote: the codepage returned by CHCP is not the ACP (ANSI code page) which the non-native API wrappers use. It's the OEMCP that is shown by CHCP. These are different and may very well be incompatible, as an example here's from a machine where I work: [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Codepage] "ACP"="1252" "OEMCP"="437" "MACCP"="10000" )
[25 Mar 2009 21:23]
Sveta Smirnova
Thank you for the feedback. Regarding to GUI Tool we need repeatable test case with ANSI names and appropriate client encoding, so we can repeat the bug. Regarding second question it is exactly what bug #43273 is about. If you wish I can mark this one as duplicate of bug #43273. Anyway I'll add a link to your comment to bug #43273.
[25 Mar 2009 21:45]
Albert Rosenfield
The test case is right here, in the original report.
[25 Mar 2009 22:08]
Sveta Smirnova
Thank you for the feedback. I can verify this as feature request "Please support UTF database names on Windows" or even "Please fully support UTF on Windows". Although technically this is duplicate of bug #43273 I'll left them separate.
[26 Mar 2009 13:43]
Albert Rosenfield
> duplicate This bug is several months older, so technically #43273 is a duplicate of this one ;-). > feature request MySQLd is treating the filesystem incorrectly. With all due respect, that seems to fit the definition of "bug" more than "feature request". That said, classify it as you see fit. I can't force you to fix things... > need repeatable test case At this point, I'm not sure what would be helpful. I can write a dozen more test cases that demonstrates in different ways how the server breaks because it doesn't understand that the filesystem names are UCS2. That won't change the underlying cause, which has already been identified. I can appreciate that someone spent time writing code that would have to be removed for this to be fixed. Namely, all the bizarre identifier rewriting documented in the 5.1 manual under "identifier-mapping" would be (mostly? completely?) unnecessary if the server was fixed to name files and folders with the correct encoding used by the filesystem. It hurts to remove code, but sometimes it's the right thing to do.
[2 Apr 2009 0:29]
Paul DuBois
I've added a note to http://dev.mysql.com/doc/refman/5.1/en/windows-vs-unix.html: Directory and file names On Windows, MySQL Server supports only directory and file names that are compatible with the current ANSI code pages. For example, the following Japanese directory name will not work in the Western locale (code page 1252): datadir="C:/维基百科关于中文维基百科" The same limitation applies to directory and file names referred to in SQL statements, such as the data file path name in LOAD DATA INFILE. See also Bug#43273 and Bug#43184.
[8 Apr 2009 7:18]
Alexander Barkov
The originally reported problem was fixed in MySQL-5.1, with so called "table name to file name" encoding. The other information can be useful when fixing the problem reported in Bug#43184 LOAD DATA fails wiht unicode file paths Marking this bug as a duplicate for bug#43184.