MySQL Bugs: #99732: ODBC driver fails to return 4-byte UTF-8 characters

Bug #99732	ODBC driver fails to return 4-byte UTF-8 characters
Submitted:	28 May 2020 21:50	Modified:	2 Jun 2020 13:06
Reporter:	Andre M	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / ODBC	Severity:	S2 (Serious)
Version:	5.3.14	OS:	Windows (Win10, Win 2012 Srv)
Assigned to:		CPU Architecture:	Any (x86, x64)
Tags:	ODBC, Unicode, UTF-16, UTF-8, utf8mb4

Description:
ODBC driver receives 4-byte UTF-8 characters from the server and erroneously passes them to the my_charset_utf8_handler, which can only convert 3-byte UTF-8 characters and returns question marks instead of actual characters.

I do not see any way to change this behavior through configuration or the ODBC connection string in the driver code and the driver code is fully capable of handling characters properly, but fails to use the correct functionality.

At the code level, copy_wchar_result uses this line to convert from a 32-bit character `wc` to a UTF-8 character `u8`:

to_cnvres= (*wc_mb)(utf8_charset_info, wc, u8, u8 + sizeof(u8));

This works for 1, 2 and 3-byte UTF-8 sequences, but fails for 4-byte sequences. Looking closer at utf8_charset_info, its cset data member points to my_charset_utf8_handler, which is not capable of handling 4-byte UTF-8 sequences, instead of my_charset_utf8mb4_handler, which is.

utf8_charset_info is initialized in myodbc_init and appears to be hard-coded for "utf8". At least I don't see any way to make it select the handler based on the actual character set and collation that is coming from the database.

If I replace utf8_charset_info.cset to point to &my_charset_utf8mb4_handler in the debugger, then everything works and a proper UTF-16 surrogate pair is returned from SQLFetch, so the functionality is there and is just not set up correctly to handle data coming from the server.

This prevents any app using ODBC from being able to retrieve any 4-byte UTF-8 characters, which wasn't much of a problem a decade ago, but now everyone wants to use emojis and this bug makes it impossible to pull them from the database and insert into the database. 

People use all sorts of hacks to deal with this, such as getting data as binary and decoding code units on their own, which is just ridiculous. Can you please fix the ODBC driver to return UTF-16 surrogate pairs for supplemental Unicode characters?

How to repeat:
create database test_db character set utf8mb4 collate utf8mb4_general_ci; 
create table tmp (txt varchar(128) character set utf8mb4 collate utf8mb4_general_ci; 

insert into tmp values (NHEX('41')); 
insert into tmp values (UNHEX'C2A3'); 
insert into tmp values (UNHEX'E4B888'); 
insert into tmp values (UNHEX'F09F9982');

This test is using ASP.Net with JavaScript to make it easier to see the result without Windows' console. However, I debugged both w3wp.exe and a standalone ODBC application in Visual Studio to see that they are behaving the same way, as described in the bug, so the code below doesn't really matter.

<%@ language="jscript" codepage="65001" aspcompat="true" validaterequest="false" debug="false" %> 

<%  
Response.CacheControl = "no-cache"; 
Response.Charset = "UTF-8"; 

var connstr = "driver={MySQL ODBC 5.3 Unicode Driver};server=192.168.4.112;uid=test;pwd=test;database=test_db";

var conn = Server.CreateObject("ADODB.Connection");
conn.Open(connstr); 

var adocmd = Server.CreateObject("ADODB.Command"); 
adocmd.CommandText = "select txt from tmp"; 
adocmd.CommandType = 1; 
adocmd.Prepared = true; 
adocmd.ActiveConnection = conn; 

var rs = adocmd.Execute();
 
while(rs && !rs.BOF && !rs.EOF) { 
  Response.Write("<p>[" + rs.Fields(0).Type + "] " + rs.Fields(0).Value + "</p>");
  rs.MoveNext();
} 

rs.Close(); 
conn.Close(); 
%>

This is the output of this script:

[202] A
[202] £
[202] 丈
[202] ?

The 202 type is adVarWChar, as it should be.

Suggested fix:
Windows characters used to be UCS-2, but even Microsoft treats them as UTF-16 now pretty much everywhere, so the ODBC driver should use my_charset_utf8mb4_handler by default and return UTF-16 code units for supplemental Unicode characters.

Hi Andre,

Thanks for reporting an issue in MySQL ODBC Driver.
The version 5.3.14 you are using does not support utf8mb4 character set.
You need to upgrade the driver to the version 8.0.

This is the expected behavior for MyODBC 5.3.14.
Closing the ticket with "not a bug" status.

Have you tried it? I just did and v8 behaves the same way and returns a question mark. In addition to this, the 32-bit driver consistently crashes IIS within minutes after a few of these calls.

mysql server 5.6.50 with english and thai characters serving windows client running excel through odbc and power query.

excel 365 with 64bit odbc 8.0.23 (dev) works ok
excel 2010 sp1 32bit (standard client) with odbc 8.0.23 32bit does not

setting utf8 in odbc charset returns question marks instead of thai characters
utf8mb4 returns gibberish thai

Switched from ANSI driver to Unicode driver and solved