Bug #57389 Do not force UTF8 charset when connection string has charset option set
Submitted: 11 Oct 2010 21:46 Modified: 18 Nov 2010 7:03
Reporter: Milan Crha Email Updates:
Status: Duplicate Impact on me:
None 
Category:Connector / ODBC Severity:S1 (Critical)
Version: OS:Any
Assigned to: CPU Architecture:Any

[11 Oct 2010 21:46] Milan Crha
Description:
I'm using MySQL ODBC connector through ADO components in Borland C++ Builder 2006 environment, and I realized that I cannot force it to use cp1250 charset in SQL commands. I'm setting this to highest severity, because incorrect results.

How to repeat:
a) Create table with cp1250_czech_cs COLLATE with a string field and add one row with letter "ě" in the string field.
b) make sure that connection string contains CHARSET=cp1250
c) using the connection string and ODBC connector, invoke one of these statements:
  1) SELECT * FROM tab WHERE strfield='ě';
  2) SELECT * FROM tab WHERE strfield LIKE '%ě%';
d) both above statements return empty dataset, which is incorrect.

Turning on LOG_QUERY I realized that the SQL statements are automatically changed to UTF8, which breaks things, because I also force cp1250 to the connection with these SETs (in this order, invoked just after connect):
      SET sql_mode='MSSQL';
      SET CHARACTER SET 'cp1250';
      SET character_set_connection='cp1250';
      SET collation_server='cp1250_czech_cs';
      SET collation_connection='cp1250_czech_cs';
these are necessary, because either:
  - string fields are returned as Wide Strings without it, which I do not want; or
  - doing UNION fails with error on incompatible collations, though both tables are created with same DEFAULT COLLATE.

Suggested fix:
I'm attaching simple patch to force utf8 charset only when not set in the connection string (and the unicode support is available). It fixes the issue for me.
[11 Oct 2010 21:47] Milan Crha
proposed odbc connector patch

Attachment: conn.patch (text/plain), 490 bytes.

[12 Oct 2010 9:38] Tonci Grgin
Milan, what is the charset_results? It should be NULL allowing c/ODBC to detect data is in different code-page.

Please paste the output of SHOW VARIABLES LIKE "%charset%" after your program starts (just issue the query from your program after the connection has been established).
[12 Oct 2010 17:49] Milan Crha
The command returns no rows, so I changed it slightly :) and I got:

before SETs:
Variable_name	Value
character_set_client	utf8
character_set_connection	utf8
character_set_database	cp1250
character_set_filesystem	binary
character_set_results	
character_set_server	cp1250
character_set_system	utf8
character_sets_dir	C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\

after SETs:
Variable_name	Value
character_set_client	cp1250
character_set_connection	cp1250
character_set_database	cp1250
character_set_filesystem	binary
character_set_results	cp1250
character_set_server	cp1250
character_set_system	utf8
character_sets_dir	C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\

If I add also SET character_set_results=NULL, then I endup with WideString string columns, which is something I do not want. I want to persuade the driver that there is no need for character set conversions at all, which is the only thing to use cp1250 also as character_set_system.

I realized that I can workaround this issue also by adding _utf8 prefix to my SELECT statements string, even I'm passing it in the cp1250. As an example:
the a) SELECT * FROM tab WHERE strfield='ě';
will be
       SELECT * FROM tab WHERE strfield=_utf8'ě';
and similar for b) from comment #0.
[21 Oct 2010 6:32] Tonci Grgin
Milan, there is no bug here...

Each connector has it's own, internal, charset. For c/ODBC 5.1 it's UTF8. If you do not want wide strings, please use c/ODBC 3.51.

And yes, if you're messing with SET NAMES you definitely need SET character_set_results=NULL. This is a special flag meaning that connector will try deciphering actual charset data is in, metadata queries will be returned in charset_server and so on. In short, SET is a nasty nasty command usually leading to double-encoding errors in data.
[21 Oct 2010 7:04] Bogdan Degtyariov
Hi Milan,

You should not look for "SET NAMES UTF8" statements in the server general query log. The MyODBC driver always sets the character set to UTF-8 for inner data processing. The CHARSET connection string option advises the driver to convert the results from UTF-8 into the character set specified.

So, this is not a bug, but the way it works.
Setting status "Not a Bug.
[21 Oct 2010 9:04] Milan Crha
> You should not look for "SET NAMES UTF8" statements in the server general
> query log.

Sure, I didn't look for it.

> The MyODBC driver always sets the character set to UTF-8 for inner data
> processing. The CHARSET connection string option advises the driver to
> convert the results from UTF-8 into the character set specified.

What do you mean with 'advices' here? To be honest I do not care what does the connector itself behind the scene, I only understood the CHARSET parameter as a way to tell it what charset I expect from it and in what charset I will pass my commands and data to the connector. Being this true I would be more that happy. But it isn't true, and thus a bug (or my misunderstanding?).

When I remove all my SET commands, which I agree are ugly and only to workaround this connector bug, and I set the connection string CHARSET to cp1250, then I expect all table string columns returned from the connector being in cp1250, but they aren't, they are in unicode. And as I said earlier, I do not want unicode columns, I want cp1250 columns, that's why I specify the CHARSET for the connection. Having there all those ugly SETs, or the patch applied in the connector to really honour my CHARSET will, makes it behave as I expect it.
[18 Nov 2010 7:03] Bogdan Degtyariov
Milan, we discovered another bug 58038, with similar problem (just different charset) and made a patch for it. The patch is now under review by other developers.

Please monitor this report when we send the hot-fix driver build:

http://bugs.mysql.com/bug.php?id=58038