MySQL Bugs: #10252: max_sort_length causes incorrect results with utf8 char columns

Bug #10252	max_sort_length causes incorrect results with utf8 char columns
Submitted:	29 Apr 2005 0:57	Modified:	25 Apr 2006 0:30
Reporter:	Peter Gulutzan	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.0.6-beta-debug	OS:	Linux (SUSE 9.2)
Assigned to:		CPU Architecture:	Any

Description:
The manual says max_sort_length is "The number of bytes to use when sorting BLOB or
TEXT values. Only the first max_sort_length bytes of each value are used; the rest are
ignored." Fine. But if I'm sorting a CHAR column with UTF8 characters, none of which
are more than one byte long, the number of bytes used is less than the max_sort_length
that I specify. So the result is not in order.

How to repeat:
mysql> create table t354 (s1 char(5) character set utf8);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t354 values ('aaaab'),('aaa'),(null),('aaaz'),('aaac');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> set @@max_sort_length=5;
Query OK, 0 rows affected (0.00 sec)

mysql> select @@max_sort_length;
+-------------------+
| @@max_sort_length |
+-------------------+
|                 5 |
+-------------------+
1 row in set (0.00 sec)

mysql> select s1 from t354 order by s1;
+-------+
| s1    |
+-------+
| NULL  |
| aaaab |
| aaa   |
| aaaz  |
| aaac  |
+-------+
5 rows in set (0.01 sec)

mysql> select s1, octet_length(s1) from t354 order by s1 desc;
+-------+------------------+
| s1    | octet_length(s1) |
+-------+------------------+
| aaaab |                5 |
| aaa   |                3 |
| aaaz  |                4 |
| aaac  |                4 |
| NULL  |             NULL |
+-------+------------------+
5 rows in set (0.00 sec)

This is not a bug.  For UTF8, a sort key uses exactly two bytes per chararacter,
no matter how many bytes are used by the character itself: 1,2,3.

Thus, only two characters weights can fit into 5 bytes.

The manual says
"Only the first max_sort_length bytes of each value are used; the rest are ignored".
This statement should be changed, and written in more details.
To doc team: feel free to ask me if you have any additional questions.

Just to clarify my previous comment:

max_sort_length limits not the length of the original data.
It limits length of the sort key, corresponding to the data.

In latin1_swedish_ci, sort key is just LCASE transformation
of the original string. Lengths of the data itself and sort keys
are the same.

In latin1_german2_ci, there are some tricky characters
whose sort keys use two bytes:  
  key for "ä" is "ae".
  key for "ö" is "oe"
  key for "ü" is "ue"
  key for "ß" is "ss"
Thus if you have max_sort_length=1, the second byte of the keys
will not be used, and these letters will be sorted like "a","o","u" and "s".
All other letters use one byte for sort key in latin1_german2_ci.

utf8_general_ci uses exactly two bytes per one character in sort key.

utf8_unicode_ci uses variable length weights, from 1 to 8 bytes per character.

Each collation can handle it in its own way.

Changed to Not a Bug from Documenting because this bug does not need a changelog entry.

Re-opened and assigned to me.