MySQL Bugs: #50909: utf8_general_cs collation is not case sensitive

Bug #50909	utf8_general_cs collation is not case sensitive
Submitted:	4 Feb 2010 12:04	Modified:	12 Feb 2016 6:24
Reporter:	Paul Tobias	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.43	OS:	Linux
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
I need a way to sort with regards to accents, and to also do a case sensitive search.
The collation utf8_general_ci sorts really well, but it is not case sensitive.
The collation utf8_bin is case sensitive, but does not sort accents well.
So I need a collation like utf8_general_ci, but case sensitive. This would be utf8_general_cs, which is not avalilable by default, only when mysql is built with --with-experimental-collations configure option.
But when mysql is built with experimental collations, the utf8_general_cs collation is not case sensitive.
In addition to version 5.1.43 I have tested it with version 5.5.1-m2 too, with the same results.
I apologize for my english, I'm not a native english speaker.

How to repeat:
1. build mysql with --with-experimental-collations configure option
2. create a test table:
CREATE TABLE `collationtest` (
  `text` text COLLATE utf8_general_cs NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_general_cs;
INSERT INTO `collationtest` VALUES ('test'),('Test'),('TEST');
3. execute a select which should be case sensitive:
select * from collationtest where text collate utf8_general_cs like 'test';

Actual result are not case sensitive:
select * from collationtest where text collate utf8_general_cs like 'test';
+------+
| text |
+------+
| test |
| Test |
| TEST |
+------+
3 rows in set (0.00 sec)

Expected results should be case sensitive, like the utf8_bin collation:
select * from collationtest where text collate utf8_bin like 'test';
+------+
| text |
+------+
| test |
+------+
1 row in set (0.00 sec)

Many thanks for writing a bug report.

Verified as described.

Hi Paul,

you're right that LIKE operator does not work well in utf8_general_cs.
However, as far as this is an experimental collation, we won't do anything
with it at this point.

Note, the '=' operator works case sensitively as expected.

As a simple workaround you can use this query:

select * from collationtest where text collate utf8_general_cs like binary 'test';

or even you don't need the collate clause:

select * from collationtest where text like binary 'test';

Note, instead of utf8_general_cs, you can similary use
the "standard" collation utf8_general_ci, with the "binary" keyword
in LIKE. It will have the same effect.

It's not recommended to use utf8_general_cs, because there's no 
guarantee it will stay in the code in the future.

Thank you for your reply.

My motivation with this bugreport was to get utf8_general_cs collation out of experimental status. Because without a case sensitive utf8 collation there is no way to be case sensitive _and_ use accents at the same time.

Unfortunately utf8_bin collation or binary keyword workarounds does not work in this case. I'll try to show why.

Let's assume the following table:
INSERT INTO collationtest VALUES ('test'),('Test'),('TEST'),('tést'),('Tést'),('TÉST');
SELECT * FROM collationtest;
+-------+
| text  |
+-------+
| test  |
| Test  |
| TEST  |
| tést  |
| Tést  |
| TÉST  |
+-------+
6 rows in set (0.00 sec)

SELECT * FROM collationtest WHERE text collate utf8_general_ci like 'test';
Actual resuts:
+-------+
| test  |
| Test  |
| TEST  |
| tést  |
| Tést  |
| TÉST  |
+-------+
Expected results:
Same as actual results.

SELECT * FROM collationtest WHERE text collate utf8_general_cs like 'test';
Actual results:
+------+
| test |
| Test |
| TEST |
+------+
Expected results:
+------+
| test |
| tést |
+------+

SELECT * FROM collationtest WHERE text collate utf8_general_ci like binary 'test';
Actual results:
+-------+
| test  |
| Test  |
| TEST  |
| tést  |
| Tést  |
| TÉST  |
+-------+
Expected results:
+-------+
| test  |
+-------+

SELECT * FROM collationtest WHERE text collate utf8_general_cs like binary 'test';
Actual results:
+------+
| test |
| Test |
| TEST |
+------+
Expected results:
+------+
| test |
+------+

SELECT * FROM collationtest WHERE text like binary 'test';
Actual results:
+------+
| text |
+------+
| test |
+------+
Expected results:
Same as actual results.

SELECT * FROM collationtest WHERE text collate utf8_general_ci = 'test';
Actual results:
+-------+
| test  |
| Test  |
| TEST  |
| tést  |
| Tést  |
| TÉST  |
+-------+
Expected results:
Same as actual results.

SELECT * FROM collationtest WHERE text collate utf8_general_cs = 'test';
Actual results:
+------+
| test |
+------+
Expected results:
+-------+
| test  |
| tést  |
+-------+

Currently there is no way to get the following results:
+-------+
| test  |
| tést  |
+-------+
In other words be case sensitive and use accents at the same time;

All the tests were performed with mysql version 5.5.1-m2.

Hi,

my comments on all this:
1/ That would be indeed good if the case sensitive UTF-8 collation went out of experimental.

2/ I don't agree with Paul on the fact that accented should be considered like the non-accented character in a comparison. If I want "test", I don't want "tést" to be returned, that's different!
After obviously we may have another collation which does this kind of comparison, but for the basic utf8_general_cs and utf8_general_ci, I would prefer it to differentiate 'e' and 'é' (as for any other accented character, and any character in general).

Posted by dev manager:

[9 Feb 2016 2:18] Manyi Lu

The experimental code for utf8_general_cs collation will be removed from the
repository. We are cleaning up dead code.

Proper case and accent sensitive collations will be provided once secondary
and tertiary sort is implemented. We will not use the experimental code.