Bug #30462 Character sets: search failures with case sensitive collations
Submitted: 16 Aug 2007 16:57 Modified: 29 Jul 2008 2:21
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:6.0 OS:Linux (SUSE 10 64-bit)
Assigned to: Alexander Barkov CPU Architecture:Any

[16 Aug 2007 16:57] Peter Gulutzan
Description:
I create some tables with character columns.
I always use a _cs (case sensitive) collation.
I insert 'x','y','z','X','Y','Z'.
With latin1_general_cs: 'Y' is between 'x' and 'z'.
With latin2_czech_cs: 'Z' is greater than 'z' but 'Y' is not.
With latin7_estonian_cs: nothing is between 'x' and 'z'.
With utf8_general_cs, 'z' is greater than 'Z' but 'y' is not.

How to repeat:
mysql> create table t1 (s1 varchar(5) character set latin1 collate latin1_general_cs) engine=myisam;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 where s1 between 'x' and 'z';
+------+
| s1   |
+------+
| x    |
| y    |
| z    |
| Y    |
| Z    |
+------+
5 rows in set (0.00 sec)

mysql>
mysql> create table t2 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=myisam;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t2 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t2 where s1 > 'z';
+------+
| s1   |
+------+
| Z    |
+------+
1 row in set (0.00 sec)

mysql>
mysql> create table t7 (s1 varchar(5) character set latin7 collate latin7_estonian_cs) engine=myisam;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t7 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 where s1 between 'x' and 'z';
Empty set (0.00 sec)

mysql>
mysql> create table tu (s1 varchar(5) character set utf8 collate utf8_general_cs) engine=myisam;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into tu values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from tu where s1 > 'Z';
+------+
| s1   |
+------+
| z    |
+------+
1 row in set (0.00 sec)
[16 Aug 2007 17:15] MySQL Verification Team
Thank you for the bug report. Verified as described.
[4 Sep 2007 17:57] Alexander Barkov
Checking latin2_czech_cs:

mysql> drop table if exists t1;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> create table t1 (s1 varchar(5) character set latin1 collate latin1_general_cs);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| X    |
| x    | 
| Y    | 
| y    | 
| Z    | 
| z    | 
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 where s1 between 'x' and 'z';
+------+
| s1   |
+------+
| x    |
| y    |
| z    |
| Y    |
| Z    |
+------+
5 rows in set (0.00 sec)

Works as expected. This collation is with "upper case preference",
so the value "X" is not returned, because it is smaller than 'x'.

I can't see any problems here.
[4 Sep 2007 17:58] Alexander Barkov
Sorry, it should be "Checking latin1_general_cs" in the previous comment.
[4 Sep 2007 17:59] Alexander Barkov
Step2. Checking latin2_czech_cs.

mysql> drop table t1;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (s1 varchar(5) character set latin2 collate latin2_czech_cs);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 where s1 > 'z';
+------+
| s1   |
+------+
| Z    |
+------+
1 row in set (0.00 sec)

This collation is of "lower case preference".
Only letter 'Z' is greater than 'z'.
There are no problems here.
[4 Sep 2007 18:01] Alexander Barkov
Checking latin7_estonian_cs:

mysql> drop table t1;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (s1 varchar(5) character set latin7 collate latin7_estonian_cs);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| Z    |
| z    |
| X    |
| x    |
| Y    |
| y    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 where s1 between 'x' and 'z';
Empty set (0.00 sec)

In Estonian, letter Z is sorted after S and before T.
http://www.omniglot.com/writing/estonian.htm

The ORDER BY result is correct, and thus nothing is between 'x' and 'z',
and the second SELECT correctly returns empty set.

No problems here.
[4 Sep 2007 18:07] Alexander Barkov
Checking utf_general_cs:

mysql> drop table t1;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (s1 varchar(5) character set utf8 collate utf8_general_cs);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('x'),('y'),('z'),('X'),('Y'),('Z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 where s1 > 'Z';
+------+
| s1   |
+------+
| z    |
+------+
1 row in set (0.00 sec)

According to the ORDER BY, the collation is of "lower case preference".
However the second SELECT returns a wrong result.

Another SELECT demonstrating the same problem:

mysql> select *, s1>'Z' from t1 order by s1;
+------+--------+
| s1   | s1>'Z' |
+------+--------+
| x    |      0 |
| X    |      0 |
| y    |      0 |
| Y    |      0 |
| z    |      1 |
| Z    |      0 |
+------+--------+
6 rows in set (0.00 sec)

utf8_general_cs definitely is broken.

All other collations work as expected.
[4 Sep 2007 18:13] Alexander Barkov
Additional test with utf8_general_cs:

mysql> alter table t1 add key(s1);
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select *, s1>'Z' from t1 order by s1;
+------+--------+
| s1   | s1>'Z' |
+------+--------+
| X    |      0 |
| x    |      0 |
| Y    |      0 |
| y    |      0 |
| Z    |      0 |
| z    |      1 |
+------+--------+
6 rows in set (0.00 sec)

Now ORDER BY is of "upper case preference", and
the order is consistent with s1>'Z' results.
[4 Sep 2007 23:48] Peter Gulutzan
Thanks to the analysis and explanations of Alexander Barkov,
I now think that this summary is true:

The order of the letters XYZxyz differs depending on collation,
on indexing, and on operation. Using variations of this script:
 drop table if exists t7;
 create table t7 (s1 varchar(5) character set latin7
 collate latin7_estonian_cs);
 insert into t7 values ('X'),('x'),('Y'),('y'),('Z'),('z');
 select * from t7 group by s1;
 select * from t7 order by s1;
 create index i7 on t7 (s1);
 select * from t7 group by s1;
 select * from t7 order by s1;
I see this:
 latin1_general_cs, always:      XxYyZz
 latin2_czech_cs, if indexed:    xXyYzZ
 latin2_czech_cs, if unindexed:  XxYyZz
 latin7_estonian_cs, always:     ZzXxYy
 utf8_general_cs, if GROUP BY:   XxYyZz
 utf8_general_cs, if ORDER BY:   xXyYzZ
The inconsistency of latin2_czech_cs doesn't seem to contradict the manual.
The inconsistency of utf8_general_cs doesn't matter, it's not officially supported.

If Mr Barkov confirms the above statements, then I believe he can
declare this "Not a Bug".
[5 Sep 2007 5:05] Alexander Barkov
Hello Peter,

I could not verify the additional inconsistency with latin2_czech_cs.
It always returned the letters in this order: xXyXzZ
No matter, whether an index on the column s1 exists.

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 5.1.21-beta-debug
...

mysql> create table t1 (s1 varchar(5) character set latin2 collate latin2_czech_cs);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values ('X'),('x'),('Y'),('y'),('Z'),('z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 group by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> create index i1 on t1 (s1);
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t1 group by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t1 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

Can you please confirm that you get another behavior.
If so, then this report needs additional attention.

If that was a mistake, and you get consistent order
both with and without index, feel free to close
this report as "not a bug".

Thanks!
[5 Sep 2007 13:39] Peter Gulutzan
No mistake. But I was using MySQL Version 6.0.

Here is the script:

select version();
drop table if exists t7;
create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=myisam;
insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
select * from t7 group by s1;
select * from t7 order by s1;
create index i on t7 (s1);
select * from t7 group by s1;
select * from t7 order by s1;
drop table if exists t7;
create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=innodb;
insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
select * from t7 group by s1;
select * from t7 order by s1;
create index i on t7 (s1);
select * from t7 group by s1;
select * from t7 order by s1;
drop table if exists t7;
create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=falcon;
insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
select * from t7 group by s1;
select * from t7 order by s1;
create index i on t7 (s1);
select * from t7 group by s1;
select * from t7 order by s1;

Summary results:

                        5.1       6.0

MyISAM, unindexed:     xXyYzZ     XxYyZz
MyISAM, indexed:       xXyYzZ     xXyYzZ
InnoDB, unindexed:     xXyYzZ     XxYyZz
InnoDB, indexed:       xXyYzZ     xXyYzZ
Falcon, unindexed:                XxYyZz
Falcon, indexed:                  XxYyZz

5.1: MyISAM are always right. Falcon is not applicable.
6.0: MyISAM and InnoDB are wrong if indexed. Falcon is always wrong.

Detailed resuls:

mysql> select version();
+-------------------+
| version()         |
+-------------------+
| 6.0.3-alpha-debug |
+-------------------+
1 row in set (0.00 sec)

mysql> drop table if exists t7;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=myisam;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.01 sec)

mysql> create index i on t7 (s1);
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.01 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> drop table if exists t7;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=innodb;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
Query OK, 6 rows affected (0.02 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.01 sec)

mysql> create index i on t7 (s1);
Query OK, 6 rows affected (0.03 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| x    |
| X    |
| y    |
| Y    |
| z    |
| Z    |
+------+
6 rows in set (0.00 sec)

mysql> drop table if exists t7;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t7 (s1 varchar(5) character set latin2 collate latin2_czech_cs) engine=falcon;
Query OK, 0 rows affected (0.03 sec)

mysql> insert into t7 values ('X'),('Y'),('Z'),('x'),('y'),('z');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.01 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.00 sec)

mysql> create index i on t7 (s1);
Query OK, 6 rows affected (0.04 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from t7 group by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.00 sec)

mysql> select * from t7 order by s1;
+------+
| s1   |
+------+
| X    |
| x    |
| Y    |
| y    |
| Z    |
| z    |
+------+
6 rows in set (0.00 sec)
[24 Jul 2008 12:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/50407

2745 Alexander Barkov	2008-07-24
      Bug#30462 Character sets: search failures with case sensitive collations
      
      The fix for Bug # 33791 "Wrong ORDER BY with latin2_czech_cs"
      fixed this problem as well. Adding only tests. No code change
      required.
[28 Jul 2008 5:27] Alexander Barkov
Pushed into mysql-6.0.7-bugteam
[28 Jul 2008 14:44] Bugs System
Pushed into 6.0.7-alpha  (revid:alik@mysql.com-20080725172155-fnc73o50e4tgl23k) (version source revid:alik@mysql.com-20080725172155-fnc73o50e4tgl23k) (pib:3)
[29 Jul 2008 2:21] Paul DuBois
Noted in 6.0.7 changelog (same as entry for Bug#33791).

ORDER BY failed to take into account accents and lettercases in
multi-level collations (latin2_czech_cs and cp1250_czech_cs).
[14 Sep 2008 0:15] Bugs System
Pushed into 6.0.6-alpha  (revid:bar@mysql.com-20080724124527-nd973ewfxkitjp6n) (version source revid:lars-erik.bjork@sun.com-20080717135217-z3zlwwy465hmc20e) (pib:3)