Bug #67297 Unexpected behaviour on INSERT/UPDATE strings with non-BMP unicode characters
Submitted: 19 Oct 2012 10:57 Modified: 1 Jul 2013 9:59
Reporter: Marin Stavrev Email Updates:
Status: Verified Impact on me:
None 
Category:Connector / ODBC Severity:S2 (Serious)
Version:5.2.2 OS:Windows (Windows 2008 SP2 x64 )
Assigned to: Bogdan Degtyariov CPU Architecture:Any
Tags: ADO, ODBC

[19 Oct 2012 10:57] Marin Stavrev
Description:
I'm using latest GA MySQL build - 5.5.28 x64, running on Windows 2008 SP2 x64.
When a C++ application using ADO is trying to insert string column value that contains Unicode characters outside of plane 0 (BMP) the operation does not complete as expected. Depending on the table/column definition - whether it is using UTF8 or UTF8MB4 storage and depending on the ODBC driver (5.1.11 or 5.2.2) different issues are observed. There isn't a working combination of ODBC driver, MySQL server settings and DB/Table/Column definition that works correctly.

Here are my findings:

A. Using MySQL ODBC 5.1.11 (32-bit) client, table column defined with charset utf8
 Test.1: ok
 Test.2: Fails with: "[MySQL][ODBC 5.1 Driver][mysqld-5.5.28]Server does not support 4-byte encoded UTF8 characters."
 Test.3: Text is truncated to the first penguine symbol. The result string stored is "АБВ"
 
B. Using MySQL ODBC 5.2.2 (32-bit) client, table column defined with charset utf8
 Test.1: ok
 Test.2: Text is truncated to the first penguine symbol. The result string stored is "АБВ"
 Test.3: Hangs out forever in Execute, 100% CPU usage (single thread). No record is created
 
C. Using MySQL ODBC 5.1.11 (32-bit) client, table column defined with charset utf8mb4
 Test.1: ok
 Test.2: Fails with: "[MySQL][ODBC 5.1 Driver][mysqld-5.5.28]Server does not support 4-byte encoded UTF8 characters."
 Test.3: The penguine symbols are stored as 4 asterisk (?) characters. The result string is "АБВ????АБВ????"

D. Using MySQL ODBC 5.2.2 (32-bit) client, table column defined with charset utf8mb4
 Test.1: ok
 Test.2: The penguine symbols are stored as 4 asterisk (?) characters. The result string is "АБВ????АБВ????"
 Test.3: Hangs out forever in Execute, 100% CPU usage (single thread). No record is created

How to repeat:
I'm posting a test C++ application that compiles under VS2008. It performs three different tests. The application can be compiled to use either ODBC 5.1 or 5.2 driver.
The .cpp file contains the schema used for creating the necessary database and table with either UTF8 or UTF8MB4 storage for the column in question.
I've also tried with different "Charset" (utf8mb4, utf16) and "Option" value makes no difference.
I've also tried changing the MySQL server configuration put the following on its 

[mysqld] section:
character-set-client-handshake = FALSE
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

This made no difference either.
Here's the source code of the program that simulates the problem (I'm going to try to attach the full project if that's possible):

#include "stdafx.h"
#include <windows.h>
#include <stdio.h>

#import "C:\Program Files (x86)\Common Files\System\ADO\msado15.dll" \
no_namespace rename("EOF", "EndOfFile")

void BailOnError(bool bError, char * message)
{
	if (bError)
	{
		printf(message);
		exit(1);
	}
}

_variant_t GetVar(char * buffer, int len)
{
	HRESULT hr;
	_variant_t varChunk;
	SAFEARRAY FAR *psa;
	SAFEARRAYBOUND rgsabound[1];
	rgsabound[0].lLbound = 0;
	rgsabound[0].cElements = len; 
	psa = SafeArrayCreate(VT_UI1, 1, rgsabound);
	for (long i = 0; i < len; i++)
	{
       hr = SafeArrayPutElement(psa, &i, &buffer[i]); 
	   BailOnError(FAILED(hr), "Failed preparing new record column value\n");
	}
    varChunk.vt = VT_ARRAY|VT_UI1;
    varChunk.parray = psa;
	return varChunk;
}

/* Database schemas used for the tests
--- UTF-8 4-bytes storage
DROP DATABASE IF EXISTS abc;
CREATE DATABASE abc;
CREATE TABLE abc.t1 (
  `name1` mediumtext COLLATE utf8mb4_unicode_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
 
--- UTF-8 regular storage (3-bytes)
DROP DATABASE IF EXISTS abc;
CREATE DATABASE abc;
CREATE TABLE abc.t1 (
  `name1` mediumtext COLLATE utf8_unicode_ci
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
*/

int _tmain(int argc, _TCHAR* argv[])
{
	HRESULT hr = S_OK;
    try
    {
		CoInitialize(NULL);
		//_bstr_t strCnn("Driver={MySQL ODBC 5.1 Driver};charset=utf8;no_ssps=0;Server=WIN-Z400LPQFLSV;Port=3307;Database=abc;User=admin;Password=XXXXXX;Option=67110962");
		_bstr_t strCnn("Driver={MySQL ODBC 5.2w Driver};charset=utf16;no_ssps=0;Server=WIN-Z400LPQFLSV;Port=3307;Database=abc;User=admin;Password=XXXXXX;Option=67110962");

		_ConnectionPtr pCon = NULL;
		hr = pCon.CreateInstance(__uuidof(Connection));
		BailOnError(FAILED(hr), "Failed creating connection\n");

		hr = pCon->Open(strCnn, "", "", adConnectUnspecified);
		BailOnError(FAILED(hr), "Failed opening connection\n");

		/***************************************************/
		/*  Test 1                                         */
		_bstr_t strSQL = L"INSERT INTO t1 (name1) VALUES('"
			L"\x0410" L"\x0411" L"\x0412" // U+0410 to U+0412 - the first three Cyrillic letters
			L"')";
		pCon->Execute(strSQL, NULL, adExecuteNoRecords);
		/***************************************************/

		/***************************************************/
		/*  Test 2                                         */
		_RecordsetPtr pRec;
		hr = pRec.CreateInstance(__uuidof(Recordset));
		BailOnError(FAILED(hr), "Cannot create recordset\n");
		hr = pRec->Open(L"SELECT * FROM t1 WHERE 0=1", _variant_t((IDispatch*)pCon, true),
                             adOpenKeyset, adLockOptimistic, adCmdText);
		BailOnError(FAILED(hr), "Failed opening recordset\n");
		hr = pRec->AddNew();
		BailOnError(FAILED(hr), "Failed adding new record\n");
		wchar_t utf16data [] =  L"\x0410" L"\x0411" L"\x0412"	// U+0410 to U+0412 - the first three Cyrillic letters
								L"\xD83D" L"\xDC27"				// U+1F427 (penguine symbol)
								L"\x0410" L"\x0411" L"\x0412"	// U+0410 to U+0412 - the first three Cyrillic letters
								L"\xD83D" L"\xDC27";			// U+1F427 (penguine symbol)
	    hr = pRec->Fields->GetItem("name1")->AppendChunk(GetVar((char*)utf16data, 20)); 
		BailOnError(FAILED(hr), "Failed on AppendChunk\n");
		hr = pRec->Update();
		BailOnError(FAILED(hr), "Failed on recordset update\n");
		/***************************************************/

		/***************************************************/
		/*  Test 3                                         */
		strSQL = L"INSERT INTO t1 (name1) VALUES('"
								L"\x0410" L"\x0411" L"\x0412"	// U+0410 to U+0412 - the first three Cyrillic letters
								L"\xD83D" L"\xDC27"				// U+1F427 (penguine symbol)
								L"\x0410" L"\x0411" L"\x0412"	// U+0410 to U+0412 - the first three Cyrillic letters
								L"\xD83D" L"\xDC27"				// U+1F427 (penguine symbol)
			L"')";
		pCon->Execute(strSQL, NULL, adExecuteNoRecords);
		/***************************************************/
		pCon->Close();
	}
	catch(_com_error & ce)
	{
	   	_bstr_t bstrDescription(ce.Description());
	   printf("Error:%s\n", (LPCSTR)bstrDescription);
	}
	CoUninitialize();
	return 0;
}
[19 Oct 2012 10:58] Marin Stavrev
The test project to test the issue

Attachment: MySQLTests.rar (application/x-rar-compressed, text), 4.25 KiB.

[26 Oct 2012 5:01] Bogdan Degtyariov
Verified, thank you for the detailed description of the problem and for the test project, which made the problem easier to spot.

We are currently working on fixing the issue in Connector/ODBC Driver.
[28 Jun 2013 20:58] Lawrenty Novitsky
Marin, the problem here is that the penguin symbol requires 4 bytes for encoding in utf8. And 5.1 and Unicode 5.2 driver use 3-byte utf8 internally. You could use ansi 5.2 driver with "...;CHARSET=utf8mb4". Then you encode your query string in utf8 like in following example
SQLExecDirect(hstmt1, 
   "INSERT INTO bug67297(val) VALUES('"
   "\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters
   "\xF0\x9f\x90\xA7"           // U+1F427 (penguine symbol)
   "\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters
   "\xF0\x9f\x90\xA7"           // U+1F427 (penguine symbol)
                  "')"
and you will get your string stored correctly in the table. Not sure if this workaround fits for your application.
[1 Jul 2013 9:59] Marin Stavrev
Hello,

We are using ADO, so calling ODBC functions such as SQLExecDirect is not possible without complicated application changes.
I've anyway tested your suggestion:
 - Downloaded and installed the latest 32-bit ODBC driver (mysql-connector-odbc-5.2.5-win32.msi). It added two providers - ANSI and Unicode
 - I've modified the example to connect using the ANSI driver, supplying as you've suggested the "charset=utf8mb4" option on the connection string.
 - Since the "Execute" method is Unicode I've supplied the UTF-8 sequence bytes as wide chars:

		/***************************************************/
		/*  Test 4                                         */
		strSQL = L"INSERT INTO t1(name1) VALUES('"
			   L"\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters
			   L"\xF0\x9f\x90\xA7"           // U+1F427 (penguine symbol)
			   L"\xD0\x90\xD0\x91\xD0\x92" // U+0410 to U+0412 - the first three Cyrillic letters
			   L"\xF0\x9f\x90\xA7"           // U+1F427 (penguine symbol)
			L"')";
		pCon->Execute(strSQL, NULL, adExecuteNoRecords);
		/***************************************************/

The result unfortunately is similar to previous test D:  Application hangs out forever, 100% CPU usage (single thread). No record is created.

When do you expect that the Unicode driver shall be released with a fix for that issue?

We've got ridden of any ANSI related stuff and we've already switched to Unicode in all aspects of our application. Switching back to the ANSI driver means that we would need to change all of our query generation and construct UTF-8 query strings instead of Unicode we're using at the moment. It is a substantial change that additionally complicates the implementation, requires new testing and carries the risk of introducing new issues.

Best Regards
[20 Feb 2014 22:02] Nigel Meachen
Updated UTF conversion routines to address errors in implementation

Attachment: unicode_transcode.c (text/plain), 6.44 KiB.

[20 Feb 2014 22:06] Nigel Meachen
I have attached updated UTF conversion routines that address flaws I have found in them while I was creating my own version of the ODBC driver that utilizes utf8mb4. Hopefully this will help progress this issue forward.