Bug #69021 | Results with character set utf8mb4 do not convert correctly in classic ASP | ||
---|---|---|---|
Submitted: | 21 Apr 2013 11:06 | Modified: | 1 May 2013 9:56 |
Reporter: | S Peschier | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | Connector / ODBC | Severity: | S3 (Non-critical) |
Version: | 5.2.4 | OS: | Windows (7) |
Assigned to: | Bogdan Degtyariov | CPU Architecture: | Any |
Tags: | ASP, utf8mb4 |
[21 Apr 2013 11:06]
S Peschier
[21 Apr 2013 11:09]
S Peschier
After the bug was submitted I saw that the two supplementary characters that where in the bug report under 'This is the output on my machine:' disappeared. This is the character that should have been there: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=22710
[23 Apr 2013 9:12]
S Peschier
PS: because ASP string are internally UTF16(le), I also tried using other character sets in all kinds of variations. With some of these combinations the 2 and 3 byte utf8 character whould be shown fine, but the 4 byte utf8 characters was shown as a question mark. This was a ASCII question mark, not a browser question mark which is shown when a illegal character is encountered. So I guess it is the driver that converts the 4 byte utf8 character to a question mark.
[24 Apr 2013 15:02]
S Peschier
This is what I get when I change charset in the connectionstring to utf16 (which probably makes more sense): Hex utf8Char converted C3A9 é é E5A4A7 大 大 F0A29C90 ? ? Again: the question marks are ascii question marks
[26 Apr 2013 7:14]
Bogdan Degtyariov
Hello, Thanks for the detailed explanation of the problem and for the ASP test case. Please note that under no circumstances your application should set the character set for the connection or the results etc. It is always set by the driver at the connection time to UTF-8. UTF-8 is used as a "transport" character set to communicate with the server. So the data conversion normally goes similar to the following: UTF8MB4 UTF8 UTF8MB4 [ASP]<------->[ODBC Driver]<---->[MySQL Server]<------->[MySQL Table] As you can see at both ends [ASP] and [MySQL Table] the data is in UFT8MB4. Once again, the application should indicate the intended character set using the special option "...;CHARSET=UTF8MB4;.." and should not attempt to set any of the connection properties because it confuses the driver conversion functions. Afetr commenting out queries like "SET character_set_...." in the test case everything worked perfectly: ' dbC.Execute("SET character_set_results = utf8mb4;") ' dbC.Execute("SET character_set_client = utf8mb4;") ' dbC.Execute("SET character_set_connection = utf8mb4;") Here is the result I obtained: Hex utf8Char converted C3A9 é é E5A4A7 大 大 F0A29C90 𢜠? I am not sure about the last character with the code F0A29C90. None of my applications was able to display it in any other way than the question mark "?". This is not a bug, please let me know if you think otherwise.
[29 Apr 2013 12:31]
S Peschier
Hello Bogdan, Thanks for your answer and for clearing up the character set settings. But I just added these to see if this would make the 4 byte utf character show up normally. So yes, I still think this is a bug. The 4 byte character is part of 'CJK Unified Ideographs Extension B'. See here: http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=22710 And here: http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=F0+A2+9C+90&mode=bytes Normally I have no problem displaying the character on my Windows 7 PC. Is it correct to assume that when the returned value in ASP is an ASCII question mark, that MySQL at some moment in time has decided that something is wrong with the 4 byte character and then replaces it with a question mark? I also found out something else. This 4 byte character is the first 4 byte character in a text file I imported into mySQL (with: LOAD DATA INFILE '...' CHARACTER SET utf8mb4). This works fine. To check I export the data to another file (with: SELECT * INTO OUTFILE '...' CHARACTER SET utf8mb4), and when I open this in Notepad++ all 4 byte characters appear normally. But when I view the imported table in MySQL Workbench (5.2.47), the 4 byte characters are shown again as question marks, while the latin and 3 byte chinese character appear correct.
[1 May 2013 9:56]
Bogdan Degtyariov
Thank you. The bug is verified.