Bug #64233 | UTF-8 query will not be sent correctly | ||
---|---|---|---|
Submitted: | 5 Feb 2012 16:56 | Modified: | 12 Feb 2012 19:15 |
Reporter: | sadmarvin84 _ | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | Connector / ODBC | Severity: | S3 (Non-critical) |
Version: | 5.1 | OS: | Windows |
Assigned to: | Bogdan Degtyariov | CPU Architecture: | Any |
Tags: | query, UTF-8 |
[5 Feb 2012 16:56]
sadmarvin84 _
[9 Feb 2012 4:44]
Bogdan Degtyariov
For your information, MySQL ODBC Driver version 5.1 always sends data to the server in UTF-8. If the source string is not UTF-8 it gets converted to UTF-8 and then result from the server is converted back to the defined charset. It is important to set CHARSET option in order to get the correct result and avoid the double conversion. For instance, the double conversion might happen if application is working with Latin-1 (ISO 8859-1) data and then it wants to send UTF-8 data. Each byte in UTF-8 will be treated as a Latin1 character, so the data gets corrupted. Changing the working character set requires re-connecting with the proper value for CHARSET option. This is not a bug.
[12 Feb 2012 19:15]
sadmarvin84 _
The charset settings are right, with v3.51 and the same configuration it does working fine. I did checked the charset of the string with the php-function md_detect_encoding(). And this shows that the charset of the string is UTF-8. My complete Database and all php-files, php-config etc. are UTF-8! Here an example, which shows the exact same query once with version 3.51 and once with version 5.1: v3.51 Driver: mySQL-Server-LOG: SELECT COUNT(*) FROM qr_logtable loglist ... WHERE (namelist.NAME LIKE '%müller%'); ODBC-LOG: SELECT COUNT(*) FROM qr_logtable loglist ... WHERE (namelist.NAME LIKE '%müller%'); v5.1 Driver: mySQL-Server-LOG: SELECT COUNT(*) FROM qr_logtable loglist ... WHERE (namelist.NAME LIKE '%müller%'); ODBC-LOG: SELECT COUNT(*) FROM qr_logtable loglist ... WHERE (namelist.NAME LIKE '%müller%');
[20 Feb 2014 9:24]
Serge Wautier
Same problem here with 5.1 and 5.2 on Windows 7 x64 with x86 MyODBC driver: Apache 2.4.7 PHP 5.5.8 MySQL 5.6.16 driver charset=utf8 all %char% variables in MySQL say utf8 SELECT statemetn involving accented letters such as é work fine. Only UPDATE and INSERT do double ut8 encoding. This double utf8 encoding shows up in myodbc.log. Test: The PHP script executes the same update query (different rows) using MyODBC and php_mysql (mysql_query): php_mysql behaves correctly, MyODBC does double encoding.
[20 Feb 2014 9:54]
Serge Wautier
My problem occurs with the Unicode driver only. The ANSI driver works fine. How come???
[24 Feb 2014 3:47]
Bogdan Degtyariov
Here are common problems and mistakes when working with UTF-8 data using MySQL Connector/ODBC Starting from the version 5.2 the MySQL Connector/ODBC comes in two flavors ANSI and UNICODE. The main difference between these two versions is the way of processing of the string and text data: - ANSI or (A) version mainly works with the single-byte character sets such as Latin-1 (a.k.a. ISO-8859-1), Windows-1251 (a.k.a. CP1251) and so on. This puts a very important requirement of using 0x00 byte as a string termination character. Any string returned to the client application is a sequence of characters each of them is SQLCHAR type. SQLCHAR is mapped to the single-byte char C type. ANSI driver cannot process most of UNICODE strings because multi-byte UNICODE characters can contain 0x00 byte, which will be immediately treated as the end of the string. The ANSI-style function can be explicitly called by prepending A to the ODBC function name. For instance: SQLDriverConnectA() for SQLDriverConnect() SQLColAttributeA() for SQLColAttribute() etc. - UNICODE or (W) version works with the multi-byte encodings. In this case (W) stands for Wide characters that consist of several bytes: UTF-16, UCS-2 and others. Zero byte 0x00 is no longer a string terminator because it can be a part of a multi-byte character. The string is terminated by a character which has all its bytes set to zero (such as 0x0000 for UTF-16). The text string is a sequence of such wide characters of SQLWCHAR type. NOTE: The mapping of SQLWCHAR type is somewhat complicated and it can create hidden pitfalls for programmers porting their code from Windows to Linux. Usually a SQLWCHAR character is a 16-bit unit and we will not consider the exotic cases when SQLWCHAR is different. Windows uses UTF-16 and maps SQLWCHAR to 16-bit wchar_t type. However, many Linux versions such as Ubuntu Linux use UTF-32 as an internal character set and therefore their 32-bit wchar_t is not compatible with SQLWCHAR, which is always 16-bit. The UNICODE-style function calls are made through adding W to the function name: SQLDriverConnectW() for SQLDriverConnect() SQLColAttributeW() for SQLColAttribute() etc. So far it was all clear and straightforward because we reviewed only fixed-size character sets that consisted of 1 byte (Latin-1), 2 bytes (UTF-16) or 4 bytes (UTF-32). The situation is different with UTF-8 because it is a variable-length character set with characters that take from 1 to 3 bytes per character. UTF-8 possesses all properties of UNICODE character set except that the UTF-8 string termination character is a single-byte 0x00 (same as for single-byte character sets). It is easy to see that UTF-8 data stream has more in common with single-byte character sets than with multi-byte ones. In fact, the binary representation of UTF-8 string written with only latin characters ("Hello World") would be exactly the same as the binary representation of the single-byte ISO-8859-01. The zero byte 0x00 is the mark for the end of a UTF-8 string, which makes the single-byte SQLCHAR ODBC type suitable for representing UTF-8 data. With the wide-character type SQLWCHAR things get more complicated because 0x00 combined with other non-zero byte gives a valid wide character instead of terminating the string, which can cause the data corruption. Here is an example with data stream containing two strings "string 1" and "string 2": ANSI ODBC driver and SQLCHAR data (single-byte units) [s][t][r][i][n][g][ ][1][\0][s][t][r][i][n][g][ ][2][\0] ^ ^ |__(String #1) |___(String #2) When ODBC driver is processing String #1 it sees the zero byte and stops reading data after it. UNICODE ODBC driver and SQLWCHAR data (two-byte units) [st][ri][ng][ 1][\0s][tr][in][g ][2\0] ^ ^ |__(String #1) |___(String #2) In the above example two strings collided inside one SQLWCHAR. The [\0s] is not a terminating character for SQLWCHAR string. For SQLWCHAR it has to be [\0\0], but two consequent zero bytes in UTF-8 would mean that the second string is empty. If the client program requires UTF-8 data it is better to use the ANSI version of MySQL Connector/ODBC.