MySQL Bugs: #3906: CREATE TABLE does not save the Japanese table names correctly in UTF8.

Bug #3906	CREATE TABLE does not save the Japanese table names correctly in UTF8.
Submitted:	27 May 2004 4:40	Modified:	26 Jan 2006 2:51
Reporter:	Shuichi Tamagawa	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.0	OS:	Windows (Window XP Professional)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
When Japanese characters are used for table names in a SJIS environment, the table names are saved in UTF8 as described in the manual. However, it seems that some SJIS codes are converted to the same UTF8 code. Thus, in some cases new table creation fails with the "ERROR 1050 (42S01): Table '***' already exists" even though the table names are different in SJIS.

How to repeat:
mysql> CREATE TABLE 0x82A0(a char(1));
Query OK, 0 rows affected (0.16 sec)

mysql> CREATE TABLE 0x82A2(a char(1));
ERROR 1050 (42S01): Table '0xE3818145 already exists
mysql> CREATE TABLE 0x82A4(a char(1));
ERROR 1050 (42S01): Table '0xE3818145 already exists
mysql> CREATE TABLE 0x82A6(a char(1));
ERROR 1050 (42S01): Table '0xE3818145 already exists
mysql> CREATE TABLE 0x82A8(a char(1));
ERROR 1050 (42S01): Table '0xE3818145 already exists

*Replace "0x82A0 - 0x82A8" with the Japanese character of the SJIS code point.
*Replace "0xE3818145" with the Japanese characters of the SJIS code points.

Character set variables are as follows.

mysql> SHOW VARIABLES LIKE 'character_%';
+--------------------------+-------------------------+
| Variable_name            | Value                   |
+--------------------------+-------------------------+
| character_set_server     | sjis                    |
| character_set_system     | utf8                    |
| character_set_database   | sjis                    |
| character_set_client     | sjis                    |
| character_set_connection | sjis                    |
| character-sets-dir       | C:/mysql/share/charsets |
| character_set_results    | sjis                    |
+--------------------------+-------------------------+

Suggested fix:
Appearently after converting the character to UTF8, if the last 2 digits of the code are less than 'A0'in HEX, the character is treated as the same character as others. 

Example:
After the operation in "How to repeat", do the following operation.

mysql> create table 0x82BC(a char(1));	#0x82BC(SJIS)=0xE3819E(UTF8)
ERROR 1050 (42S01): Table '0xE3818145 already exists
mysql> create table 0x82BD(a char(1));	#0x82BD(SJIS)=0xE3819F(UTF8)
ERROR 1050 (42S01): Table '0xE3818145 already exists

mysql> create table 0x82BE(a char(1));	#0x82BE(SJIS)=0xE381A0(UTF8)
Query OK, 0 rows affected (0.15 sec)

mysql> create table 0x82BF(a char(1));	#0x82BF(SJIS)=0xE381A1(UTF8)
Query OK, 0 rows affected (0.17 sec)

mysql> create table 0x82C0(a char(1));  #0x82C0(SJIS)=0xE381A2(UTF8)
Query OK, 0 rows affected (0.14 sec)

mysql> create table 0x82C1(a char(1));	#0x82C1(SJIS)=0xE381A3(UTF8)
Query OK, 0 rows affected (0.15 sec)

*Replace "0x82BC - 0x82C1" with the Japanese character of SJIS code point.
*Replace "0xE3818145" with the Japanese characters of SJIS code points.

It is suggested that these codes are converted to each UTF8 code accordingly.

Shuichi, I could not reproduce this bug, and successfully created all these
tables simultaniously. Can you please try the follwing:
SELECT HEX(COVERT(x USING utf8));

where x in 0x82a0, 0x82a2, 0x82a4, etc. 

Does conversion return the same result for them, or different?

When I use the convert function, it returns the different resuts for these characters. But when I create tables, it looks these are treated as the same characters. See the attached file.

Note that these tables are created successfully in Version 4.1.18.

Test result - 4.1.1a-alpha-nt

Attachment: example4.1.1a.JPG (image/pjpeg, text), 73.52 KiB.

Test result - 4.0.18

Attachment: example4.0.18.JPG (image/pjpeg, text), 35.88 KiB.

Shuichi, It looks like Windows doesn't allow to create files
with these names at the same time. Can you please try to create
files with these names: 

0xE38182
0xE38184
0xE38186
0xE38188
0xE3818A

Does Windows allow to do it? Thank you!
Sorry, I cannot try it, I don't have a Windows box to try on.

Bar,

I think this issue can be explained as follows:

1.Windows File system uses Shift-JIS.
2.When MySQL set the file name '0xE38182' in UTF8, Windows interprets it in Shift-JIS.
3.In Shift-JIS '0xE38182' is interpreted as two characters, '0xE381' and '0x82'.
4.Since there are no corresponding character for code point '0x82' in Shift-JIS, it is expressed in another character(It looks '0x8145').
5.Since there are no corresponding character for code point from '0x80' to '0x9E' these codes are expressed in the same character, which result in file name confliction.

When I tried some other characters like '0xEFBC91', '0xEFBC92', '0xEFBCA1' it behaved in the same way. If my understanding is right, the file name of table data need to be set in Shift-JIS on Windows environment to avoid this problem.

I agree that it is wrong that Japanese names are
stored in UTF8 on a SJIS system. Using SJIS for
file system is not good too, we want table files
to be copied between boxes, including between boxes
with different localization. The only way is to store
names in unicode, but not in UTF8. It could be a 
pseudo character set which encodes ASCII as is, but
all national characters, including latin letters with
accents, Japanes, Cyrillic, etc, in their Unicode code
in textual representation:

For example:
ab12.frm       -> ab12, pure ASCII name
ab12%3062      -> ab12 + Japanese character corresponding to SJIS 0x82C0
ab12%00E5.frm  -> ab12 + LATIN LETTER A WITH RING ABOVE (Swedish letter)
ab12%0430.frm  -> ab12 + CYRILLIC LETTER C

What do you think?

see WL#1324

by the way, you cannot give % a special meaning, as old tables with % in the name will become unreadable (consider, e.g. CREATE TABLE `ab12%3062` (...))

Serge, what is your erstimation of the priority of this task?

Serge, what should I do with this bug?
Should i close it with a reference to the worklog item?

A relative problem happens on MAC OSX.
This indicates that it's currently not possible to use non-ASCII characters for
identifiers on Mac OS X, as the file system would mangle the file name. 

See
http://developer.apple.com/documentation/MacOSX/Conceptual/SystemOverview/FileSy
stem/chapter_8_section_6.html

We won't fix this problem in 4.1.
It's to be fixed in 5.0 or 5.1.

We should add a note about this problem into the manual.

See also bug#5566

This problem was fixed in 5.1 after pushing WL#1324 "table name to file name encoding".

We will not fix this problem in 4.1 and 5.0.

Need version number for the fix.

Fixed in 5.1.6.

P.S.

Please make sure "table name to file name encoding" is documenting
together with this bug. Stefan told he assigned it to Paul.

Documented in 5.1.6 changelog:

   <listitem>
        <para>
          Certain Japanese table names were not properly saved during a
          <literal>CREATE TABLE</literal> statement. (Bug #3906)
        </para>
      </listitem>

Will followup with Paul.