Bug #1264 Incorrect behaviour of ORDER BY using indexed (var)char columns in ucs2 tables
Submitted: 12 Sep 2003 10:27 Modified: 16 Sep 2003 4:27
Reporter: Simon Detheridge Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.1alpha (source distribution) OS:Linux (linux (gentoo))
Assigned to: Alexander Barkov CPU Architecture:Any

[12 Sep 2003 10:27] Simon Detheridge
Description:
When using a ucs2 table in MySQL, either with ucs2_general_ci or ucs2_bin collation, words are returned in an incorrect order when using ORDER BY on an _indexed_ CHAR or VARCHAR column. They are sorted with the longest word *first* instead of last. I.E. The word "aardvark" is in the results before the word "a".

If there is no index for the column, the problem does not occur.

When creating the table using the same SQL, but specifying latin1 as the character set, there are no problems in this area.

How to repeat:
Execute the following SQL:
DROP TABLE foo ;
CREATE TABLE foo (
   word VARCHAR(64) ,
   bar INT(11) default 0 ,
   PRIMARY KEY ( word ) )
   TYPE=MyISAM
   CHARSET ucs2
   COLLATE ucs2_general_ci ;
INSERT INTO foo ( word ) VALUES ( "aar" ) ;
INSERT INTO foo ( word ) VALUES ( "a" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvar" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvark" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvara" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvarz" ) ;

The following query shows the results in the wrong order:
mysql> SELECT * FROM foo ORDER BY word ;
+----------+------+
| word     | bar  |
+----------+------+
| aardvara |    0 |
| aardvark |    0 |
| aardvarz |    0 |
| aardvar  |    0 |
| aar      |    0 |
| a        |    0 |
+----------+------+
6 rows in set (0.00 sec)

Interestingly, if there is no second column, the words are returned in the correct order. To demonstrate this, execute the following SQL:
DROP TABLE foo ;
CREATE TABLE foo (
   word VARCHAR(64) ,
   PRIMARY KEY ( word ) )
   TYPE=MyISAM
   CHARSET ucs2
   COLLATE ucs2_general_ci ;
INSERT INTO foo ( word ) VALUES ( "aar" ) ;
INSERT INTO foo ( word ) VALUES ( "a" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvar" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvark" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvara" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvarz" ) ;

And the following query shows that the order now seems correct:
mysql> SELECT * FROM foo ORDER BY word ;
+----------+
| word     |
+----------+
| a        |
| aar      |
| aardvar  |
| aardvara |
| aardvark |
| aardvarz |
+----------+
6 rows in set (0.00 sec)

The following SQL and query show that if the column is *not* indexed, the words are returned in the correct order. The following SQL and query demonstrate this:

DROP TABLE foo ;
CREATE TABLE foo (
   word TEXT ,
   bar INT(11) AUTO_INCREMENT ,
   PRIMARY KEY ( bar ) )
   TYPE=MyISAM
   CHARSET ucs2
   COLLATE ucs2_general_ci ;
INSERT INTO foo ( word ) VALUES ( "aar" ) ;
INSERT INTO foo ( word ) VALUES ( "a" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvar" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvark" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvara" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvarz" ) ;

mysql> SELECT * FROM foo ORDER BY word ;
+----------+-----+
| word     | bar |
+----------+-----+
| a        |   2 |
| aar      |   1 |
| aardvar  |   3 |
| aardvara |   5 |
| aardvark |   4 |
| aardvarz |   6 |
+----------+-----+
6 rows in set (0.00 sec)

For reference, the following SQL and example query show that when the table is created with a latin1 charset, and latin1_swedish_ci collation, the problem does not occur:
DROP TABLE foo ;
CREATE TABLE foo (
   word VARCHAR(64) ,
   bar INT(11) default 0 ,
   PRIMARY KEY ( word ) )
   TYPE=MyISAM
   CHARSET latin1
   COLLATE latin1_swedish_ci ;
INSERT INTO foo ( word ) VALUES ( "aar" ) ;
INSERT INTO foo ( word ) VALUES ( "a" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvar" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvark" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvara" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvarz" ) ;

mysql> SELECT * FROM foo ORDER BY word ;
+----------+------+
| word     | bar  |
+----------+------+
| a        |    0 |
| aar      |    0 |
| aardvar  |    0 |
| aardvara |    0 |
| aardvark |    0 |
| aardvarz |    0 |
+----------+------+
6 rows in set (0.00 sec)

Suggested fix:
Seems like a bug in the code to me. I would have suggested that the defenition of the ucs2 charset or collation was to blame, except for the fact that the buggy behaviour is not displayed when a second column is present. (Incidentally, I've checked adding the second column both before *and* after the "word" column, same results either way.)

Seems a patch is needed to correct the behaviour of ORDER BY on ucs2 (var)char columns.

A (quite unuseful) workaround for now, is to *not* index char and varchar columns in ucs2 tables.
[15 Sep 2003 5:56] Simon Detheridge
Additional info: 
 
This bug does not affect all table types. I have tested it with MyISAM, ISAM, HEAP, and BDB, 
and they all display the buggy behaviour. However, InnoDB tables do not. They display the 
results in the correct order. 
 
Using InnoDB tables therefore seems a better workaround for now, than removing indexes on 
CHAR/VARCHAR columns.
[15 Sep 2003 10:33] Indrek Siitan
It seems to be dependent on the number of columns displayed, not 
actually contained in the db:

DROP TABLE foo ;
CREATE TABLE foo (
   word VARCHAR(64) ,
   bar INT(11) default 0 ,
   PRIMARY KEY ( word ) )
   TYPE=MyISAM
   CHARSET ucs2
   COLLATE ucs2_general_ci ;
INSERT INTO foo ( word ) VALUES ( "aar" ) ;
INSERT INTO foo ( word ) VALUES ( "a" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvar" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvark" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvara" ) ;
INSERT INTO foo ( word ) VALUES ( "aardvarz" ) ;

mysql> select * from foo order by word;
+----------+------+
| word     | bar  |
+----------+------+
| aardvara |    0 |
| aardvark |    0 |
| aardvarz |    0 |
| aardvar  |    0 |
| aar      |    0 |
| a        |    0 |
+----------+------+
6 rows in set (0.00 sec)

mysql> select word from foo order by word;
+----------+
| word     |
+----------+
| a        |
| aar      |
| aardvar  |
| aardvara |
| aardvark |
| aardvarz |
+----------+
6 rows in set (0.00 sec)

According to EXPLAIN, it looks like when the output includes columns that
are not part of the index sorted on, it does a filesort, which fails. Using
a straight index yields correct results.

mysql> explain select * from foo order by word \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: foo
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 6
        Extra: Using filesort
1 row in set (0.00 sec)

mysql> explain select word from foo order by word \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: foo
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 128
          ref: NULL
         rows: 6
        Extra: Using index
1 row in set (0.01 sec)
[16 Sep 2003 1:39] Alexander Barkov
Thank you for your bug report. This issue has already been fixed
in the latest released version of that product, which you can download at 
http://www.mysql.com/downloads/
[16 Sep 2003 4:27] Alexander Barkov
Thank you for your bug report. This issue has already been fixed
in the latest released version of that product, which you can download at 
http://www.mysql.com/downloads/

Ok, it's now really fixed. Test suit is also added.