Bug #67297 | Unexpected behaviour on INSERT/UPDATE strings with non-BMP unicode characters | ||
---|---|---|---|
Submitted: | 19 Oct 2012 10:57 | Modified: | 1 Jul 2013 9:59 |
Reporter: | Marin Stavrev | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | Connector / ODBC | Severity: | S2 (Serious) |
Version: | 5.2.2 | OS: | Windows (Windows 2008 SP2 x64 ) |
Assigned to: | Bogdan Degtyariov | CPU Architecture: | Any |
Tags: | ADO, ODBC |
[19 Oct 2012 10:57]
Marin Stavrev
[19 Oct 2012 10:58]
Marin Stavrev
The test project to test the issue
Attachment: MySQLTests.rar (application/x-rar-compressed, text), 4.25 KiB.
[26 Oct 2012 5:01]
Bogdan Degtyariov
Verified, thank you for the detailed description of the problem and for the test project, which made the problem easier to spot. We are currently working on fixing the issue in Connector/ODBC Driver.
[28 Jun 2013 20:58]
Lawrenty Novitsky
Marin, the problem here is that the penguin symbol requires 4 bytes for encoding in utf8. And 5.1 and Unicode 5.2 driver use 3-byte utf8 internally. You could use ansi 5.2 driver with "...;CHARSET=utf8mb4". Then you encode your query string in utf8 like in following example SQLExecDirect(hstmt1, "INSERT INTO bug67297(val) VALUES('" "\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters "\xF0\x9f\x90\xA7" // U+1F427 (penguine symbol) "\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters "\xF0\x9f\x90\xA7" // U+1F427 (penguine symbol) "')" and you will get your string stored correctly in the table. Not sure if this workaround fits for your application.
[1 Jul 2013 9:59]
Marin Stavrev
Hello, We are using ADO, so calling ODBC functions such as SQLExecDirect is not possible without complicated application changes. I've anyway tested your suggestion: - Downloaded and installed the latest 32-bit ODBC driver (mysql-connector-odbc-5.2.5-win32.msi). It added two providers - ANSI and Unicode - I've modified the example to connect using the ANSI driver, supplying as you've suggested the "charset=utf8mb4" option on the connection string. - Since the "Execute" method is Unicode I've supplied the UTF-8 sequence bytes as wide chars: /***************************************************/ /* Test 4 */ strSQL = L"INSERT INTO t1(name1) VALUES('" L"\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters L"\xF0\x9f\x90\xA7" // U+1F427 (penguine symbol) L"\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters L"\xF0\x9f\x90\xA7" // U+1F427 (penguine symbol) L"')"; pCon->Execute(strSQL, NULL, adExecuteNoRecords); /***************************************************/ The result unfortunately is similar to previous test D: Application hangs out forever, 100% CPU usage (single thread). No record is created. When do you expect that the Unicode driver shall be released with a fix for that issue? We've got ridden of any ANSI related stuff and we've already switched to Unicode in all aspects of our application. Switching back to the ANSI driver means that we would need to change all of our query generation and construct UTF-8 query strings instead of Unicode we're using at the moment. It is a substantial change that additionally complicates the implementation, requires new testing and carries the risk of introducing new issues. Best Regards
[20 Feb 2014 22:02]
Nigel Meachen
Updated UTF conversion routines to address errors in implementation
Attachment: unicode_transcode.c (text/plain), 6.44 KiB.
[20 Feb 2014 22:06]
Nigel Meachen
I have attached updated UTF conversion routines that address flaws I have found in them while I was creating my own version of the ODBC driver that utilizes utf8mb4. Hopefully this will help progress this issue forward.