Bug #69021 Results with character set utf8mb4 do not convert correctly in classic ASP
Submitted: 21 Apr 2013 11:06 Modified: 1 May 2013 9:56
Reporter: S Peschier Email Updates:
Status: Verified Impact on me:
None 
Category:Connector / ODBC Severity:S3 (Non-critical)
Version:5.2.4 OS:Windows (7)
Assigned to: Bogdan Degtyariov CPU Architecture:Any
Tags: ASP, utf8mb4

[21 Apr 2013 11:06] S Peschier
Description:
For a project which uses supplementary characters (that lie outside the Basic Multilingual Plane (BMP)), I am trying to use utf8mb4 with classic ASP. But the supplementary characters (and non-supplementary characters too) are displayed as latin-1 characters *even though* other ways of displaying characters in the same page work just fine.

This is the output on my machine:

Variable_name			Value
character_set_client		utf8mb4
character_set_connection	utf8mb4
character_set_database		utf8mb4
character_set_filesystem	binary
character_set_results		utf8mb4
character_set_server		utf8mb4
character_set_system		utf8
collation_connection		utf8mb4_general_ci
collation_database		utf8mb4_unicode_ci
collation_server		utf8mb4_bin
version				5.5.28
version_comment			MySQL Community Server (GPL)
version_compile_machine		x86
version_compile_os		Win64

Hex		utf8Char	converted
C3A9		é		é
E5A4A7		大		大
F0A29C90	𢜐		𢜐

<hr />

How to repeat:
This code shows the problem.

Note that in the second part (below the <hr />) there is code which uses both UTF-16 and UTF-8 to show a supplementary character, which is displayed just fine.

 	<%
	Response.Charset = "utf-8" 
	Session.CodePage = 65001 ' 65001 = Unicode (UTF-8)

	' ODBC driver: version 5.2.4
	mySqlConnectStr = _
			"Provider=MSDASQL;" &_
			"Driver={MySQL ODBC 5.2w Driver};" &_
			"Server=localhost;" &_
			"Database=....;" &_
			"User=....;" &_
			"Password=....;" &_
			"Option=3;" &_
			"CharSet=utf8mb4"
	' CharSet: The character set to use for the connection
	%>

<!doctype html>
<html>

<head>
	<meta charset="utf-8">
	<style>
	html { font-family: arial; font-size: 13px; }
	table { margin-bottom:10px; }
	.zh { font-size:21px; }
	</style>
</head>

<body>
 	<%
	Set dbC = CreateObject("ADODB.Connection")
	dbC.Open mySqlConnectStr

' ### Setting characters sets:
' character_set_results: The character set used for returning query results such as result sets or error messages to the client.
	dbC.Execute("SET character_set_results = utf8mb4;")
' character_set_client: The character set for statements that arrive from the client.
' ucs2, utf16, and utf32 cannot be used as a client character set
	dbC.Execute("SET character_set_client = utf8mb4;")
' character_set_connection: The character set used for literals that do not have a character set introducer and for number-to-string conversion.
	dbC.Execute("SET character_set_connection = utf8mb4;")

' ### Show mySql settings:
	Set vRS=dbC.Execute("SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%' OR Variable_name LIKE 'version%';")
	Call RStable(vRS)
	vRS.Close

	dbC.Execute("DROP TABLE IF EXISTS test;")
	dbC.Execute("CREATE TEMPORARY TABLE test (x VARCHAR(4) CHARACTER SET utf8mb4 NOT NULL);")
	dbC.Execute("INSERT INTO test SELECT x'c3a9';")
	dbC.Execute("INSERT INTO test SELECT x'e5a4a7';")
	dbC.Execute("INSERT INTO test SELECT x'f0a29c90';")

	Set tRS=dbC.Execute("SELECT HEX(x) AS Hex, x AS utf8Char, CONVERT(x USING utf16) AS converted FROM test;")
	Call RStable(tRS)
	tRS.Close

	dbC.Execute("DROP TABLE test;")

	dbC.Close
	%>
<hr />
	<%
	' Dec: 141072
	' Hex: 0x22710
	' UTF-8: F0 A2 9C 90
	' UTF-16: D849 DF10

' Show a supplementary character using UTF-16
	Response.Write "<span class=zh>" & ChrW( &hD849 ) & ChrW( &hDF10 ) & "</span><br />"

' Show a supplementary character using UTF-8 (Note: in Little Endian order)
	Response.Write "<span class=zh>"
	Response.BinaryWrite ChrW(&HA2F0) & ChrW(&H909C)
	Response.Write "</span><br />"
	%>
</body>
</html>

	<%
Function RStable( byref bRS )
	Response.Write "<table border>"
	Response.Write "<tr>"
	For col=0 To bRS.Fields.Count - 1
		Response.Write "<th>" & bRS.Fields(col).Name & "</th>"
	Next
	Response.Write "</tr>"
	If Not bRS.EOF Then
		Do While Not bRS.EOF
			Response.Write "<tr>"
			For col=0 To bRS.Fields.Count - 1
				Response.Write "<td>" & bRS.Fields.Item(col).Value & "</td>"
			Next
			Response.Write "</tr>"
			bRS.MoveNext
		Loop
	Else
		Response.Write "<tr><td colspan=""" & bRS.Fields.Count & """><i>No records...</i></td></tr>"
	End If
	Response.Write "</table>"
End Function
	%>

Suggested fix:
Speaks for itself
[21 Apr 2013 11:09] S Peschier
After the bug was submitted I saw that the two supplementary characters that where in the bug report under 'This is the output on my machine:' disappeared.

This is the character that should have been there:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=22710
[23 Apr 2013 9:12] S Peschier
PS: because ASP string are internally UTF16(le), I also tried using other character sets in all kinds of variations. With some of these combinations the 2 and 3 byte utf8 character whould be shown fine, but the 4 byte utf8 characters was shown as a question mark. This was a ASCII question mark, not a browser question mark which is shown when a illegal character is encountered. So I guess it is the driver that converts the 4 byte utf8 character to a question mark.
[24 Apr 2013 15:02] S Peschier
This is what I get when I change charset in the connectionstring to utf16 (which probably makes more sense):

Hex		utf8Char	converted
C3A9		é		é
E5A4A7		大		大
F0A29C90	?		?

Again: the question marks are ascii question marks
[26 Apr 2013 7:14] Bogdan Degtyariov
Hello,

Thanks for the detailed explanation of the problem and for the ASP test case.
Please note that under no circumstances your application should set the character set for the connection or the results etc. It is always set by the driver at the connection time to UTF-8.
UTF-8 is used as a "transport" character set to communicate with the server.
So the data conversion normally goes similar to the following:

      UTF8MB4               UTF8                UTF8MB4
[ASP]<------->[ODBC Driver]<---->[MySQL Server]<------->[MySQL Table]

As you can see at both ends [ASP] and [MySQL Table] the data is in UFT8MB4.

Once again, the application should indicate the intended character set using the special option "...;CHARSET=UTF8MB4;.." and should not attempt to set any of the connection properties because it confuses the driver conversion functions.

Afetr commenting out queries like "SET character_set_...." in the test case everything worked perfectly:

' dbC.Execute("SET character_set_results = utf8mb4;")
' dbC.Execute("SET character_set_client = utf8mb4;")
' dbC.Execute("SET character_set_connection = utf8mb4;")

Here is the result I obtained:

Hex		utf8Char	converted
C3A9		é		é
E5A4A7		大		大
F0A29C90	𢜐		?

I am not sure about the last character with the code F0A29C90. None of my applications was able to display it in any other way than the question mark "?".

This is not a bug, please let me know if you think otherwise.
[29 Apr 2013 12:31] S Peschier
Hello Bogdan,

Thanks for your answer and for clearing up the character set settings. But I just added these to see if this would make the 4 byte utf character show up normally.

So yes, I still think this is a bug.

The 4 byte character is part of 'CJK Unified Ideographs Extension B'. See here:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=22710
And here:
http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=F0+A2+9C+90&mode=bytes
Normally I have no problem displaying the character on my Windows 7 PC.

Is it correct to assume that when the returned value in ASP is an ASCII question mark, that MySQL at some moment in time has decided that something is wrong with the 4 byte character and then replaces it with a question mark?

I also found out something else. This 4 byte character is the first 4 byte character in a text file I imported into mySQL (with: LOAD DATA INFILE '...' CHARACTER SET utf8mb4). This works fine. To check I export the data to another file (with: SELECT * INTO OUTFILE '...' CHARACTER SET utf8mb4), and when I open this in Notepad++ all 4 byte characters appear normally. But when I view the imported table in MySQL Workbench (5.2.47), the 4 byte characters are shown again as question marks, while the latin and 3 byte chinese character appear correct.
[1 May 2013 9:56] Bogdan Degtyariov
Thank you.
The bug is verified.