MySQL Bugs: #57389: Do not force UTF8 charset when connection string has charset option set

Bug #57389	Do not force UTF8 charset when connection string has charset option set
Submitted:	11 Oct 2010 21:46	Modified:	18 Nov 2010 7:03
Reporter:	Milan Crha	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	Connector / ODBC	Severity:	S1 (Critical)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
I'm using MySQL ODBC connector through ADO components in Borland C++ Builder 2006 environment, and I realized that I cannot force it to use cp1250 charset in SQL commands. I'm setting this to highest severity, because incorrect results.

How to repeat:
a) Create table with cp1250_czech_cs COLLATE with a string field and add one row with letter "ě" in the string field.
b) make sure that connection string contains CHARSET=cp1250
c) using the connection string and ODBC connector, invoke one of these statements:
  1) SELECT * FROM tab WHERE strfield='ě';
  2) SELECT * FROM tab WHERE strfield LIKE '%ě%';
d) both above statements return empty dataset, which is incorrect.

Turning on LOG_QUERY I realized that the SQL statements are automatically changed to UTF8, which breaks things, because I also force cp1250 to the connection with these SETs (in this order, invoked just after connect):
      SET sql_mode='MSSQL';
      SET CHARACTER SET 'cp1250';
      SET character_set_connection='cp1250';
      SET collation_server='cp1250_czech_cs';
      SET collation_connection='cp1250_czech_cs';
these are necessary, because either:
  - string fields are returned as Wide Strings without it, which I do not want; or
  - doing UNION fails with error on incompatible collations, though both tables are created with same DEFAULT COLLATE.

Suggested fix:
I'm attaching simple patch to force utf8 charset only when not set in the connection string (and the unicode support is available). It fixes the issue for me.

proposed odbc connector patch

Attachment: conn.patch (text/plain), 490 bytes.

Milan, what is the charset_results? It should be NULL allowing c/ODBC to detect data is in different code-page.

Please paste the output of SHOW VARIABLES LIKE "%charset%" after your program starts (just issue the query from your program after the connection has been established).

The command returns no rows, so I changed it slightly :) and I got:

before SETs:
Variable_name	Value
character_set_client	utf8
character_set_connection	utf8
character_set_database	cp1250
character_set_filesystem	binary
character_set_results	
character_set_server	cp1250
character_set_system	utf8
character_sets_dir	C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\

after SETs:
Variable_name	Value
character_set_client	cp1250
character_set_connection	cp1250
character_set_database	cp1250
character_set_filesystem	binary
character_set_results	cp1250
character_set_server	cp1250
character_set_system	utf8
character_sets_dir	C:\Program Files\MySQL\MySQL Server 5.1\share\charsets\

If I add also SET character_set_results=NULL, then I endup with WideString string columns, which is something I do not want. I want to persuade the driver that there is no need for character set conversions at all, which is the only thing to use cp1250 also as character_set_system.

I realized that I can workaround this issue also by adding _utf8 prefix to my SELECT statements string, even I'm passing it in the cp1250. As an example:
the a) SELECT * FROM tab WHERE strfield='ě';
will be
       SELECT * FROM tab WHERE strfield=_utf8'ě';
and similar for b) from comment #0.

Milan, there is no bug here...

Each connector has it's own, internal, charset. For c/ODBC 5.1 it's UTF8. If you do not want wide strings, please use c/ODBC 3.51.

And yes, if you're messing with SET NAMES you definitely need SET character_set_results=NULL. This is a special flag meaning that connector will try deciphering actual charset data is in, metadata queries will be returned in charset_server and so on. In short, SET is a nasty nasty command usually leading to double-encoding errors in data.

Hi Milan,

You should not look for "SET NAMES UTF8" statements in the server general query log. The MyODBC driver always sets the character set to UTF-8 for inner data processing. The CHARSET connection string option advises the driver to convert the results from UTF-8 into the character set specified.

So, this is not a bug, but the way it works.
Setting status "Not a Bug.

> You should not look for "SET NAMES UTF8" statements in the server general
> query log.

Sure, I didn't look for it.

> The MyODBC driver always sets the character set to UTF-8 for inner data
> processing. The CHARSET connection string option advises the driver to
> convert the results from UTF-8 into the character set specified.

What do you mean with 'advices' here? To be honest I do not care what does the connector itself behind the scene, I only understood the CHARSET parameter as a way to tell it what charset I expect from it and in what charset I will pass my commands and data to the connector. Being this true I would be more that happy. But it isn't true, and thus a bug (or my misunderstanding?).

When I remove all my SET commands, which I agree are ugly and only to workaround this connector bug, and I set the connection string CHARSET to cp1250, then I expect all table string columns returned from the connector being in cp1250, but they aren't, they are in unicode. And as I said earlier, I do not want unicode columns, I want cp1250 columns, that's why I specify the CHARSET for the connection. Having there all those ugly SETs, or the patch applied in the connector to really honour my CHARSET will, makes it behave as I expect it.

Milan, we discovered another bug 58038, with similar problem (just different charset) and made a patch for it. The patch is now under review by other developers.

Please monitor this report when we send the hot-fix driver build:

http://bugs.mysql.com/bug.php?id=58038