MySQL Bugs: #64233: UTF-8 query will not be sent correctly

Bug #64233	UTF-8 query will not be sent correctly
Submitted:	5 Feb 2012 16:56	Modified:	12 Feb 2012 19:15
Reporter:	sadmarvin84 _	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	Connector / ODBC	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Windows
Assigned to:	Bogdan Degtyariov	CPU Architecture:	Any
Tags:	query, UTF-8

Description:
With the use of MyODBC drivers with CHARACTER SET "utf8", the query is not sent as such. MySQL Server Query LOG shows that this string is ISO-8859-1. If the string is converted into ISO-8859-1, before being sent via ODBC, the server gets the correct string in UTF8.
The problem does not occur with ODBC Driver version 3.51 with the same settings.

How to repeat:
Use a PHP-script to insert an UTF-8 encoded string with specialchars like (ä,ö,ü) via ODBC Driver (5.1) into database and watch the result. Then repeat it with Driver 3.51.
PHP version: 5.2.8
MySQL version: 5.5.19
Apache: 2.2.9

For your information, MySQL ODBC Driver version 5.1 always sends data to the server in UTF-8. If the source string is not UTF-8 it gets converted to UTF-8 and then result from the server is converted back to the defined charset.

It is important to set CHARSET option in order to get the correct result and avoid the double conversion. For instance, the double conversion might happen if application is working with Latin-1 (ISO 8859-1) data and then it wants to send UTF-8 data. Each byte in UTF-8 will be treated as a Latin1 character, so the data gets corrupted.

Changing the working character set requires re-connecting with the proper value for CHARSET option.

This is not a bug.

The charset settings are right, with v3.51 and the same configuration it does working fine. I did checked the charset of the string with the php-function md_detect_encoding(). And this shows that the charset of the string is UTF-8. My complete Database and all php-files, php-config etc. are UTF-8!

Here an example, which shows the exact same query once with version 3.51 and once with version 5.1:

v3.51 Driver:

mySQL-Server-LOG:
SELECT COUNT(*) FROM qr_logtable loglist
...
WHERE (namelist.NAME LIKE '%müller%');

ODBC-LOG:
SELECT COUNT(*) FROM qr_logtable loglist
...
WHERE (namelist.NAME LIKE '%müller%');

v5.1 Driver:

mySQL-Server-LOG:
SELECT COUNT(*) FROM qr_logtable loglist
...
WHERE (namelist.NAME LIKE '%mÃ¼ller%');

ODBC-LOG:
SELECT COUNT(*) FROM qr_logtable loglist
...
WHERE (namelist.NAME LIKE '%mÃ¼ller%');

Same problem here with 5.1 and 5.2 on Windows 7 x64 with x86 MyODBC driver:
Apache 2.4.7 PHP 5.5.8 MySQL 5.6.16

driver charset=utf8
all %char% variables in MySQL say utf8
SELECT statemetn involving accented letters such as é work fine.
Only UPDATE and INSERT do double ut8 encoding. This double utf8 encoding shows up in myodbc.log.
Test: The PHP script executes the same update query (different rows) using MyODBC and php_mysql (mysql_query): php_mysql behaves correctly, MyODBC does double encoding.

My problem occurs with the Unicode driver only. The ANSI driver works fine. How come???

Here are common problems and mistakes when working with UTF-8 data using MySQL 
Connector/ODBC

Starting from the version 5.2 the MySQL Connector/ODBC comes in two flavors 
ANSI and UNICODE. The main difference between these two versions is the way of
processing of the string and text data:

 - ANSI or (A) version mainly works with the single-byte character sets such as
   Latin-1 (a.k.a. ISO-8859-1), Windows-1251 (a.k.a. CP1251) and so on.
   This puts a very important requirement of using 0x00 byte as a string
   termination character. Any string returned to the client application is
   a sequence of characters each of them is SQLCHAR type. SQLCHAR is mapped
   to the single-byte char C type. ANSI driver cannot process most of
   UNICODE strings because multi-byte UNICODE characters can contain 0x00
   byte, which will be immediately treated as the end of the string. The
   ANSI-style function can be explicitly called by prepending A to the ODBC
   function name. For instance:
   
   SQLDriverConnectA() for SQLDriverConnect()
   SQLColAttributeA() for SQLColAttribute()
   etc.
   
 - UNICODE or (W) version works with the multi-byte encodings. In this case
   (W) stands for Wide characters that consist of several bytes: UTF-16, UCS-2
   and others. Zero byte 0x00 is no longer a string terminator because it can
   be a part of a multi-byte character. The string is terminated by a character
   which has all its bytes set to zero (such as 0x0000 for UTF-16). The text
   string is a sequence of such wide characters of SQLWCHAR type. 
   
   NOTE: The mapping of SQLWCHAR type is somewhat complicated and it can 
         create hidden pitfalls for programmers porting their code from 
         Windows to Linux. Usually a SQLWCHAR character is a 16-bit unit and
         we will not consider the exotic cases when SQLWCHAR is different.
         Windows uses UTF-16 and maps SQLWCHAR to 16-bit wchar_t type.
         However, many Linux versions such as Ubuntu Linux use UTF-32 as an
         internal character set and therefore their 32-bit wchar_t is not
         compatible with SQLWCHAR, which is always 16-bit.
  
   The UNICODE-style function calls are made through adding W to the function
   name:
   
   SQLDriverConnectW() for SQLDriverConnect()
   SQLColAttributeW() for SQLColAttribute()
   etc.

So far it was all clear and straightforward because we reviewed only fixed-size
character sets that consisted of 1 byte (Latin-1), 2 bytes (UTF-16) or 4 bytes
(UTF-32). The situation is different with UTF-8 because it is a variable-length
character set with characters that take from 1 to 3 bytes per character. UTF-8
possesses all properties of UNICODE character set except that the UTF-8 string
termination character is a single-byte 0x00 (same as for single-byte character
sets).

It is easy to see that UTF-8 data stream has more in common with single-byte
character sets than with multi-byte ones. In fact, the binary representation
of UTF-8 string written with only latin characters ("Hello World") would be
exactly the same as the binary representation of the single-byte ISO-8859-01.
The zero byte 0x00 is the mark for the end of a UTF-8 string, which makes the
single-byte SQLCHAR ODBC type suitable for representing UTF-8 data.

With the wide-character type SQLWCHAR things get more complicated because 0x00
combined with other non-zero byte gives a valid wide character instead of
terminating the string, which can cause the data corruption.

Here is an example with data stream containing two strings "string 1" and 
"string 2":

ANSI ODBC driver and SQLCHAR data (single-byte units)

[s][t][r][i][n][g][ ][1][\0][s][t][r][i][n][g][ ][2][\0]
 ^                           ^
 |__(String #1)              |___(String #2)
 
When ODBC driver is processing String #1 it sees the zero byte and stops 
reading data after it.

UNICODE ODBC driver and SQLWCHAR data (two-byte units)

[st][ri][ng][ 1][\0s][tr][in][g ][2\0]
 ^                 ^
 |__(String #1)    |___(String #2)
 
In the above example two strings collided inside one SQLWCHAR.
The [\0s] is not a terminating character for SQLWCHAR string.
For SQLWCHAR it has to be [\0\0], but two consequent zero bytes in UTF-8 would
mean that the second string is empty.

If the client program requires UTF-8 data it is better to use the ANSI version
of MySQL Connector/ODBC.