Bug #31351 provide collations for AaBb... and AB...ab sorting
Submitted: 2 Oct 2007 18:26 Modified: 8 Oct 2007 8:09
Reporter: Maciej Pilichowski Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.45 OS:Any
Assigned to: Alexander Barkov CPU Architecture:Any

[2 Oct 2007 18:26] Maciej Pilichowski
Description:
Take a look at MySql own doc:
http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html

Looks great, doesn't it? There is one issue however -- there are no *_cs collations. I checked utf8_general_cs and utf8_polish_cs for example. 

I am unable to group, search in case sensitive manner.

How to repeat:
Simply run mysql examples.

Suggested fix:
Provide case sensitive collations.
[2 Oct 2007 20:47] Peter Laursen
I support this .. and I think that there are actually need for TWO case sensitive collations for each non-case-sensitive

1) A a B b C c ....
2) A B C ... a b c
[4 Oct 2007 0:15] MySQL Verification Team
Thank you for the bug report. Could you please provide a test case:

[miguel@skybr 5.0]$ bin/mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.0.52-debug Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table t6 (col1 char(1));
Query OK, 0 rows affected (0.00 sec)

mysql> show create table t6\G
*************************** 1. row ***************************
       Table: t6
Create Table: CREATE TABLE `t6` (
  `col1` char(1) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> insert into t6 values ('a'),('A');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t6 where col1 = 'A';
+------+
| col1 |
+------+
| a    | 
| A    | 
+------+
2 rows in set (0.00 sec)

mysql> alter table t6 change col1 col1 char(1) collate latin1_general_cs;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select * from t6 where col1 = 'A';
+------+
| col1 |
+------+
| A    | 
+------+
1 row in set (0.00 sec)

mysql> 

Thanks in advance.
[4 Oct 2007 15:59] Maciej Pilichowski
show create table t6;
+-------+--------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                         |
+-------+--------------------------------------------------------------------------------------------------------------------------------------+
| t6    | CREATE TABLE `t6` (
  `col1` char(1) collate utf8_polish_ci default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_polish_ci |
+-------+--------------------------------------------------------------------------------------------------------------------------------------+

select * from t6 where col1 = 'A';
+------+
| col1 |
+------+
| a    |
| A    |
+------+

select * from t6 where col1 = 'A';
+------+
| col1 |
+------+
| A    | 
+------+
[4 Oct 2007 19:37] Peter Laursen
My concern was another: use in ORDER BY.

But after experimenting al little this seems very weird, doesn't it:

CREATE TABLE `t10` (
  `col1` char(2) collate utf8_danish_ci default NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci;

INSERT INTO `t10` values ('Aa'),('aB'),('BB'),('ba');

select col1 from t10 order by col1;

/*
returns

col1  
------
col1  
------
aB    
ba    
BB    
Aa    
*/

I would have expected 

1) a _ci collation to return Aa, aB, ba, BB -- A and a are equal etc
2) a _cs collation to return Aa, aB, BB, ba -- A before a before B before b etc
3) and actually also requesting a collation that returns all UPPERCASES before LOWERCASES like Aa,BB,Øz,aB,ba,øa ('Ø'/'ø' is a Danish character)

... but maybe this is another report?
[4 Oct 2007 20:10] Maciej Pilichowski
> ... but maybe this is another report?

That's right. This report is about MISSING collations, what you wrote is about NOT WORKING collations (in short).
[4 Oct 2007 20:19] Peter Laursen
I will let the MySQL people decide if this is best handled as one report or two!

But I second you in saying that I am missing an option to ORDER BY in a (or rather two) case sensitive way(s)!
[4 Oct 2007 20:30] Peter Laursen
I think I should tell my server version!  It is 5.1.22.
[8 Oct 2007 5:17] Alexander Barkov
That's true that not every character sets provides a collation with this order: 
AaBbCcDd

Only some character sets do. These collations have "_cs" suffix in their names:

mysql> show collation like '%cs';
+--------------------+---------+----+---------+----------+---------+
| Collation          | Charset | Id | Default | Compiled | Sortlen |
+--------------------+---------+----+---------+----------+---------+
| latin1_general_cs  | latin1  | 49 |         | Yes      |       1 |
| latin2_czech_cs    | latin2  |  2 |         | Yes      |       4 |
| cp1250_czech_cs    | cp1250  | 34 |         | Yes      |       2 |
| latin7_estonian_cs | latin7  | 20 |         |          |       0 |
| latin7_general_cs  | latin7  | 42 |         |          |       0 |
| cp1251_general_cs  | cp1251  | 52 |         |          |       0 |
+--------------------+---------+----+---------+----------+---------+
6 rows in set (0.00 sec)

However, MySQL provides _bin collations for *every* character set.
_bin collations produce this sort order: ABCD...abcd.

There are some ways how to do GROUP BY in case sensitive manner:

1. Just use _bin collation in CREATE TABLE:

mysql> create table t1 (a varchar(10) character set utf8 collate utf8_bin);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1  values ('a'),('A'),('b'),('B'),('c'),('C');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select a from t1 group by a;
+------+
| a    |
+------+
| A    |
| B    |
| C    |
| a    |
| b    |
| c    |
+------+
6 rows in set (0.00 sec)

2. Use case insensitive collation CREATE TABLE,
and specify binary collation when doing GROUP BY:

mysql> create table t1 (a varchar(10) character set utf8);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 ('a'),('A'),('b'),('B'),('c'),('C');

mysql> select a from t1 group by a collate utf8_bin;
+------+
| a    |
+------+
| A    |
| B    |
| C    |
| a    |
| b    |
| c    |
+------+
6 rows in set (0.01 sec)
[8 Oct 2007 5:27] Alexander Barkov
utf8_danish_ci works correctly, how it was designed.

Double letter "aa" in Danish and Norwegian is equal to
"latin letter a with ring above", which is sorted after
letter "z".

See here for details:
http://en.wikipedia.org/wiki/Norwegian_alphabet

This result is fine:

mysql> create table t1 (a char(2) collate utf8_danish_ci);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values ('Aa'),('aB'),('BB'),('ba');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from t1 order by a;
+------+
| a    |
+------+
| aB   |
| ba   |
| BB   |
| Aa   |
+------+
4 rows in set (0.00 sec)
[8 Oct 2007 5:44] Alexander Barkov
And finally, the third thing in this bug report 
that it's impossible to sort "A before a before B before b etc".

This task will be done in MySQL 6.1:
http://forge.mysql.com/worklog/task.php?id=896

There are two very simple workarounds for the current version.

1. Put additional "binary" member into ORDER BY list:

mysql> create table t1 (a varchar(10));
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('Aa'),('aA'),('aB'),('Bb'),('BB'),('ba'),('Ba');
Query OK, 7 rows affected (0.00 sec)
Records: 7  Duplicates: 0  Warnings: 0

-- sort upper case before lower case
mysql> select a from t1 order by a, binary a;
+------+
| a    |
+------+
| Aa   |
| aA   |
| aB   |
| Ba   |
| ba   |
| BB   |
| Bb   |
+------+
7 rows in set (0.00 sec)

-- sort lower case before upper case
mysql> select a from t1 order by a, binary a desc;
+------+
| a    |
+------+
| aA   |
| Aa   |
| aB   |
| ba   |
| Ba   |
| Bb   |
| BB   |
+------+
7 rows in set (0.00 sec)

2. For 8-bit character sets, one can create a user defined case sensitive
collation providing this sort order: AaBbCcDd...
For cp1250, you need to edit these files:
/usr/share/mysql/charsets/Index.xml
/usr/share/mysql/charsets/cp1250.xml

(directory may vary, depending on compilation flags)

The format is straightforward. If you have any questions,
don't hesitate to ask me.
[8 Oct 2007 5:46] Alexander Barkov
There are no bugs. Changing category to "not a bug".
[8 Oct 2007 6:20] Maciej Pilichowski
Jesus, it is that hard to spot that the whole comments are really not the issue __I REPORTED__.

Ok, I will open another report if this is so hard to do it on your own -- thank you Peter for doing great job :-(
[8 Oct 2007 8:00] Peter Laursen
1)
I think the 'binary' keyword will only work with the character range from a to z and not national characters?

To sort like (Danish alphabet) "A B C .. Æ Ø Å a b c ... æ ø å" I think a collation is required and keyword 'binary' won't do?  But as such collation was never promised this is not a bug of course!

2)
Double letter "aa" in Danish and Norwegian is equal to
"latin letter a with ring above", which is sorted after
letter "z".

I overlooked that 'aa' is considered equal to 'å'.

In modern Danish (since language reform of year 1953) the use of 'aa' for 'å' is practically depreciated - but still found in a few city names (example 'Aalborg') as well as person names of elderly people.  No 'ordinary' word uses that form.  And if a person is named 'Claas' it is DEFINITELY not 'Clås'! .. but that would probably be hard to solve without using a dictionary!
[8 Oct 2007 8:09] Maciej Pilichowski
Alexander, could you please put Peter as a reporter? I don't want to get any mail regarded to this (not mine, really, now) report, since I am not interested in multiple CS sorting modes. Thank you in advance.
[8 Oct 2007 8:44] Alexander Barkov
Maciej, I'm sorry - the bug system doesn't allow to change the reporter.
I suggest that you keep posting into the same bug report.

Peter, please open a new bug report if you still think that there are bugs
related to ORDER BY or GROUP BY.

Peter, brief answers to the questions in your last post:

1. That's true that "binary"  won't return this order:

 "A B C .. Æ Ø Å a b c ... æ ø å"

A separate collation is required.

2. It seems that MySQL needs to provide two collations for Danish:
traditional (with this special rules for AA), and modern,
without special rules for AA. Like some other databases do for Danish,
and like MySQL does for some languages, e.g. German and Spanish.