Description:
ODBC driver receives 4-byte UTF-8 characters from the server and erroneously passes them to the my_charset_utf8_handler, which can only convert 3-byte UTF-8 characters and returns question marks instead of actual characters.
I do not see any way to change this behavior through configuration or the ODBC connection string in the driver code and the driver code is fully capable of handling characters properly, but fails to use the correct functionality.
At the code level, copy_wchar_result uses this line to convert from a 32-bit character `wc` to a UTF-8 character `u8`:
to_cnvres= (*wc_mb)(utf8_charset_info, wc, u8, u8 + sizeof(u8));
This works for 1, 2 and 3-byte UTF-8 sequences, but fails for 4-byte sequences. Looking closer at utf8_charset_info, its cset data member points to my_charset_utf8_handler, which is not capable of handling 4-byte UTF-8 sequences, instead of my_charset_utf8mb4_handler, which is.
utf8_charset_info is initialized in myodbc_init and appears to be hard-coded for "utf8". At least I don't see any way to make it select the handler based on the actual character set and collation that is coming from the database.
If I replace utf8_charset_info.cset to point to &my_charset_utf8mb4_handler in the debugger, then everything works and a proper UTF-16 surrogate pair is returned from SQLFetch, so the functionality is there and is just not set up correctly to handle data coming from the server.
This prevents any app using ODBC from being able to retrieve any 4-byte UTF-8 characters, which wasn't much of a problem a decade ago, but now everyone wants to use emojis and this bug makes it impossible to pull them from the database and insert into the database.
People use all sorts of hacks to deal with this, such as getting data as binary and decoding code units on their own, which is just ridiculous. Can you please fix the ODBC driver to return UTF-16 surrogate pairs for supplemental Unicode characters?
How to repeat:
create database test_db character set utf8mb4 collate utf8mb4_general_ci;
create table tmp (txt varchar(128) character set utf8mb4 collate utf8mb4_general_ci;
insert into tmp values (NHEX('41'));
insert into tmp values (UNHEX'C2A3');
insert into tmp values (UNHEX'E4B888');
insert into tmp values (UNHEX'F09F9982');
This test is using ASP.Net with JavaScript to make it easier to see the result without Windows' console. However, I debugged both w3wp.exe and a standalone ODBC application in Visual Studio to see that they are behaving the same way, as described in the bug, so the code below doesn't really matter.
<%@ language="jscript" codepage="65001" aspcompat="true" validaterequest="false" debug="false" %>
<%
Response.CacheControl = "no-cache";
Response.Charset = "UTF-8";
var connstr = "driver={MySQL ODBC 5.3 Unicode Driver};server=192.168.4.112;uid=test;pwd=test;database=test_db";
var conn = Server.CreateObject("ADODB.Connection");
conn.Open(connstr);
var adocmd = Server.CreateObject("ADODB.Command");
adocmd.CommandText = "select txt from tmp";
adocmd.CommandType = 1;
adocmd.Prepared = true;
adocmd.ActiveConnection = conn;
var rs = adocmd.Execute();
while(rs && !rs.BOF && !rs.EOF) {
Response.Write("<p>[" + rs.Fields(0).Type + "] " + rs.Fields(0).Value + "</p>");
rs.MoveNext();
}
rs.Close();
conn.Close();
%>
This is the output of this script:
[202] A
[202] £
[202] 丈
[202] ?
The 202 type is adVarWChar, as it should be.
Suggested fix:
Windows characters used to be UCS-2, but even Microsoft treats them as UTF-16 now pretty much everywhere, so the ODBC driver should use my_charset_utf8mb4_handler by default and return UTF-16 code units for supplemental Unicode characters.