MySQL Bugs: #34760: Character set autodetection appears to fail

Bug #34760	Character set autodetection appears to fail
Submitted:	22 Feb 2008 14:45	Modified:	2 Dec 2008 21:21
Reporter:	Martin Probst	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: General	Severity:	S2 (Serious)
Version:	5.0.67	OS:	Any
Assigned to:	Sergei Glukhov	CPU Architecture:	Any
Tags:	autodetect, characterset, jdbc, Unicode

Description:
It appears that character set autodetection of JDBC connections fails quite nastily.

I'm opening a JDBC connection to a MySQL database (mysql  Ver 14.12 Distrib 5.0.45, for apple-darwin8.5.1 (i686) using readline 5.0) that has been created with UTF-8 support.

Database schema and contents are attached. I have a table foo and a table bam with a one-to-many relationship between foo and bam.

If I now try to select those foos whose bams don't contain a certain value ("SELECT * FROM foo WHERE 'TST123' NOT IN (SELECT baz FROM bam WHERE bam.foo_id = foo.id)") then I get different behaviours depending on whether I force the character set to UTF-8 or not.

I take from the documentation that autodetection of character sets is supposed to work, normally. Also, this example doesn't use any non-ASCII characters.

This can be a real painful puzzler, as the database silently returns wrong results, with no warning whatsoever. I'm marking this as serious as it leads to silent data loss.

How to repeat:
Create database according to schema, and fill it with the default values. Run the attached Java program.

This happens at least on Mac OS X and Linux, I didn't test on other platforms.

Database schema and sample values

Attachment: schema.sql (, text), 2.19 KiB.

Java code that triggers the bug

Attachment: MySQLBug.java (, text), 1.32 KiB.

As an explanation to the Java code: it runs the same query, once connecting to the database with "?useUnicode=true&characterEncoding=UTF-8", once without. It then prints whether the query returned any results.

Martin,

What character set is your *server* set to use? (posting the output of the command "show variables like 'character_%'" would be a big help).

The "autodetection" works at a *server* level, it doesn't look at your database/table/column-level character sets (because it can't know in advance which of those you're going to reference).

  -Mark

I see, my character_set_server is latin1. Where does that get set? I'm trying to keep my whole system UTF-8 in any place possible (locale etc.). 

mysql> show variables like 'character_%';
+--------------------------+------------------------------------------------------------+
| Variable_name            | Value                                                      |
+--------------------------+------------------------------------------------------------+
| character_set_client     | utf8                                                       | 
| character_set_connection | utf8                                                       | 
| character_set_database   | utf8                                                       | 
| character_set_filesystem | binary                                                     | 
| character_set_results    | utf8                                                       | 
| character_set_server     | latin1                                                     | 
| character_set_system     | utf8                                                       | 
| character_sets_dir       | /usr/local/mysql-5.0.45-osx10.4-i686/share/mysql/charsets/ | 
+--------------------------+------------------------------------------------------------+

Even so, shouldn't this kind of thing simply work? If not by autodetecting, then by translating the results from UTF-8 databases at some other point?

I have just run into this bug using a command line session with the mysql CLI. If I connect to mysql using the command line:
> mysql -u root -p cdb_development # character encoding gets set to latin1
I get erroneous results from the query (see java file for the query).

If I use
> mysql -u root -p cdb_development --default-character-set=utf8
everything works as expected.

So this is indeed not a bug within the JDBC MySQL Connector, but rather something with the database itself.

Martin,

I don't think that this is not a bug.

Let me explain:

- Your database/table/row has a character set i.e. utf8

An easy test if your data will be stored at the right way at the database is:

insert signs like '£' or 'å' or 'ä' into your text column.
Then make a SELECT LENGTH(your_textcolumn) FROM your_db;

By using utf8 the length for the signs should be 2. If the length is 1 or 4 or something else different from 2 then your data will be stored wrong in your database.

Usually the system will negotiate the character set. This happens transparent. If your client only is able to speak Latin1 that doesn't matter.

But your client and your environment should use the same encoding. MySQL CLI usually use Latin1, when your terminal is set to UTF8 you will get encoding miss match. You have to set your terminal to ISO-8859-15. Then your environment and MySQL CLI have the same encoding.

The other possibility is: Your terminal is UTF8, then use "SET NAMES UTF8" at MySQL CLI. This effects, that MySQL CLI will get UTF8 as client encoding and also the results of a select statement will be returned as UTF8.

MySQL is able to handle different client/server encodings automatically but it isn't able to guess the encoding from your environment. Your client/result character set has to match to your environment encoding.

So, when you get different results by using JDBC and using CLI then using a wrong environment encoding could be the reason.

mysql> show variables like 'character%';

give you a list of uses character sets.

If character_set_client and/or character_set_result are utf8 your terminal (i.e. gnome or kde terminal) has to use utf8 too. If these variables are set to latin1 your terminal has to be set to ISO-8859-15.
Or if your terminal uses UTF8 you have to do a "SET NAMES UTF8;" before using your database. The same, when your terminal just has ISO-8859-15 then you have to use "SET NAMES LATIN1" at the CLI.

When you have curious output with JAVA then look if your JAVA environment has the right encoding settings.

You can see if your java file is UTF8 by using the command: "file". Just type:
$ file my_java_file.java

Please let us know if all work fine by using this rules. Of course let us know if you still have problems after checking that you really used UTF8 for all (columns, client, environment) and that your data are not stored wrong at the database.

Thanks for your comments Susanne.

> I don't think that this is not a bug.

Uh, as in German double negation (meaning true) or in American double negation (meaning false)? :-)

Anyways, I'm not sure I made it clear enough what my problem is:
> So, when you get different results by using JDBC and using CLI then using a wrong
> environment encoding could be the reason.

The point is, I'm getting different results for queries that do not involve any extended characters (like ö), depending on the character set I've set for the connection. There is no difference in results from JDBC or CLI, both tools show the same - in my opinion erroneous - behaviour.

The query in question:
> SELECT * FROM foo WHERE 
> 'TST123' NOT IN (SELECT baz FROM bam WHERE bam.foo_id = foo.id)
Returns two results when run in CLI with --default-charset=utf8 and no results without that option (and thus latin1 being used). The exact same thing happens with JDBC and the encoding parameters in the JDBC URL.

Curiously, this only happens with this particular "IN" style query. It does not happen with a simple selection, such as:
> SELECT * FROM bam WHERE baz = 'TST123'
This statement always returns correct results, regardless of the encoding connection parameters.

To stress again, this bug is not about wrong display of characters, wrong Strings being passed to MySQL or wrong locale setup. In the JDBC testcase, I can confirm 100% that the String I use for comparison is a validly encoded Java UTF-16 String. In the console setup, in both cases I'm using only simple ASCII characters, so a latin1 encoded string is absolutely identical to a UTF-8 string. You might want to test running the attached example, it should be easy to set up, and exhibits the problem quite clearly.

I've also made sure that content in the database is indeed stored as UTF-8 (using your LENGTH trick), and that my console setup and Java stuff is correctly encoded, too. It's all fine, and I also don't think that the problem is in those things.

The synopsis about character set autodetection might be wrong. It appears to me that this is somehow related to the way IN queries are handled?

>> I don't think that this is not a bug.

> Uh, as in German double negation (meaning true) or in American double negation (meaning false)? :-)

As German I would interpret this as (- * - = +). But it's just a typo. One "not" is too much in that sentense.

Btw. you don't stress.

Back to the topic:
I'll test it with newest version.
MySQL 5.0.45 is quiete old. Please also test with newest Version MySQL 5.0.51a.

Do you still have problems with this by using newer version like MySQL 5.0.51b?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I have tested this with MySQL 5.0.67 (installed via MacPorts), and I can confirm that the problem persists.

Martin,

many thanks for writing a bug report.

Unfortunately, this is not a bug.

The default value for character_set_server is latin1.

You only can change it by changing this value in your my.cnf or during compiling.

When you don't let the system know that you want utf8 as default (by telling it during compiling or at my.cnf) the system will take latin1.

I don't think that this is not a bug.

You have two communicating entities, and both know their respective character set. Why can't they make this out?

Beyond that, I find it highly dangerous that this silent data loss happens, without any errors given. I can't really think that this should be expected behaviour with a missing parameter casing erroneous behaviour for some queries. In particular as this query doesn't even use UTF8.

Thought experiment: what if I wanted to talk to the database in latin1 on purpose, e.g. to reduce the amount of checking in the database driver or something like that? I would always run into this error in some cases.

It's still not a bug. Note, that when you test with a CLI you see completely diferent effects and the explanation is different. Let's start from JDBC:

When you establish your connection, Connector/J does (simplified):

  SELECT @@character_set_server;
  SET NAMES xxxx

where xxxx is the value of character_set_server variable.
So, in your JDBC connection you have

character_set_client=character_set_connection=character_set_server=latin1

And "two communicating entities" perfectly know each other' character set - it's latin1.

See http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html

When you use CLI there's no automatic SET NAMES. And you see character_set_client=character_set_connection=utf8.

By default, though, without --default-character-set=utf8 client sets character_set_client to latin1, unless compiled differently.

So if both parties agreed on a character set - latin1 - then why does the query return incorrect results in one case?

To restate the problem: if a client connects to MySQL using latin1 and issues a query against a table that is stored in UTF-8 format, it will get incorrect results when using subqueries with the IN statement a la 'SELECT ... FROM ... WHERE ('x' IN (SELECT varcharfield FROM ...)' - the IN clause apparently always evaluates as true. And that is NOT using any extended characters.

That cannot be expected behaviour, can it?

Martin,

that is expected behaviour at every software that I know. You always have to tell the server which client encoding you are using before you will do other stuff.

Why don't you just add at my.cnf:

[client]
default_character_set=utf8

Also you just can use SET NAMES UTF8 when you are working with a client.

So you're saying if I prefer to use my client with latin1 I cannot run queries against UTF-8 tables and expect correct results? If that is the case, why doesn't the server consider this an error?

Okay.

Martin, you're right. The important stuff are in your comment from [12 Sep 16:36]. (but understandably, it could be easy to overlook after 16 comments of default charset discussion).

Test case - import the schema.sql (attached to the bugreport) and then:

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM foo WHERE 'TST123' NOT IN (SELECT baz FROM bam WHERE bam.foo_id = foo.id);
Empty set (0.00 sec)

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM foo WHERE 'TST123' NOT IN (SELECT baz FROM bam WHERE bam.foo_id = foo.id);
+--------+----+
| bar    | id |
+--------+----+
| Hello  |  1 | 
| Hello  |  2 | 
+--------+----+
2 rows in set (0.00 sec)

Thanks, I was beginning to think I was on a lost case. I'm sorry that my bug report was a bit erratic, but as you might know it can be difficult to find out what is exactly broken if you're not familiar with the internals of a system.

Sorry Martin, I missunderstood your pre-last comment.

This is not hitting 5.1.

Test case:

DROP TABLE IF EXISTS `bam`;
CREATE TABLE `bam` (
  `id` int(11) NOT NULL auto_increment,
  `foo_id` int(11) default NULL,
  `baz` varchar(6) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=utf8;

INSERT INTO `bam` VALUES (1,1,'TST123'),(2,1,'TST456'),(3,2,'TST123');
DROP TABLE IF EXISTS `foo`;
CREATE TABLE `foo` (
  `bar` varchar(6) default NULL,
  `id` int(11) NOT NULL auto_increment,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

INSERT INTO `foo` VALUES ('Hello',1),('Hello ',2);

SELECT * FROM foo WHERE 'TST123' NOT IN (SELECT baz FROM bam WHERE bam.foo_id = foo.id);

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/57555

2715 Sergey Glukhov	2008-10-31
      Bug#34760 Character set autodetection appears to fail
      the problem is the same as reported in bug#20835,
      so the fix is backport of bug#20835 patch.

Bug#40519 is duplicate of this bug.

Since http://bugs.mysql.com/40519 is regarded as a dup  (the patch for this bug also fixes that one), please ensure that the testcase *also* contains a BIGINT test so that a future regression is not possible.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/59513

2720 Sergey Glukhov	2008-11-21
      Bug#34760 Character set autodetection appears to fail
      the problem is the same as reported in bug#20835,
      so the fix is backport of bug#20835 patch.

Pushed into 5.0.74  (revid:sergey.glukhov@sun.com-20081121123959-58ffhp2nitg7f40h) (version source revid:ramil@mysql.com-20081121132058-12sfo07z1r1yvelk) (pib:5)

Noted in 5.0.74 changelog.

Pushed into 5.1.31  (revid:sergey.glukhov@sun.com-20081121123959-58ffhp2nitg7f40h) (version source revid:patrick.crews@sun.com-20081126180318-v685u61mpgoc176x) (pib:5)

Pushed into 6.0.9-alpha  (revid:sergey.glukhov@sun.com-20081121123959-58ffhp2nitg7f40h) (version source revid:ingo.struewing@sun.com-20081121151447-dtf2ofz2ys0zqed1) (pib:5)

Pushed into 5.1.31-ndb-6.2.17 (revid:tomas.ulin@sun.com-20090119095303-uwwvxiibtr38djii) (version source revid:tomas.ulin@sun.com-20090108105244-8opp3i85jw0uj5ib) (merge vers: 5.1.31-ndb-6.2.17) (pib:6)

Pushed into 5.1.31-ndb-6.3.21 (revid:tomas.ulin@sun.com-20090119104956-guxz190n2kh31fxl) (version source revid:tomas.ulin@sun.com-20090119104956-guxz190n2kh31fxl) (merge vers: 5.1.31-ndb-6.3.21) (pib:6)

Pushed into 5.1.31-ndb-6.4.1 (revid:tomas.ulin@sun.com-20090119144033-4aylstx5czzz88i5) (version source revid:tomas.ulin@sun.com-20090119144033-4aylstx5czzz88i5) (merge vers: 5.1.31-ndb-6.4.1) (pib:6)

Noted in 5.0.72sp1, 5.1.31, 6.0.9 changelogs.

Correction: This is not in 5.1.31, 6.0.9 changelogs. (5.0.x problem only)