MySQL Bugs: #32404: Cannot obtain accent sensitive czech collation

Bug #32404	Cannot obtain accent sensitive czech collation
Submitted:	15 Nov 2007 11:34	Modified:	18 Jan 2018 13:09
Reporter:	Petr Slavik	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	5.0, 5.1, 6.0	OS:	Linux (CentOS 4.3)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	accent sensitive, accent sensitivity, collation, cp1250, czech, latin2, utf, utf8

Description:
Hi,

I found out that MySQL doesn't support accent sensitivity (or should I call it diacritic sensitivity?) for Czech collations. Well, in our little example let's talk about utf8_czech_ci collation, but note that same thing happens with cp1250_czech_cs and latin2_czech_cs collations too. So that, while executing string based select, 'í' letter is considered similar to 'i', 'á' to 'a' etc. 

And it's definitively wrong, because in Czech language eg. word 'Mí' has very different meaning then word 'Mi'. And I can go on with examples.

Similar behavior is described here http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html
Important is the last select from germanutf8 table and two returned results. I don't know if it's correct in German language, but in Czech it's surely wrong.

How to repeat:
Similar procedure as here - http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html

1, CREATE TABLE Words (`text` CHAR(10)) CHARACTER SET utf8 COLLATE utf8_czech_ci;
2, INSERT INTO Words SET `text` = 'Plní'; (notice i-acute)
3, SELECT `text` FROM Words WHERE `text` = 'Plni'; (notice normal "i" letter)
4, Returned result set contains word 'Plní' but it should return no record -> IMHO bug

Suggested fix:
Create Czech collations with accent sensitivity.

Thank you for the reasonable feature request.

Many thanks for writing a feature request.

Please, can you look for a link where the rules are given.

I found rules here: http://cs.wikipedia.org/wiki/Abecední_řazení and at some other places.
I will get deeper into this topic an analyze if there is a bug or not.

Ok, this comment will get long.

The main rules first:

Sorting rule:

a < á < ä all three treated as 'a'
b
c
č
d < ď  both treated as d
e < é < ě all three treated as e
f
g
h
ch
i < í both treated as i
j
k
l < ĺ < ľ all three treated as l
m
n < ň both treated as n
o < ó < ô < ö all four treated as o
p
q
r < ŕ both treated as r
ř
s
š
t < ť both treated as t
u < ú < ů < ü all for treated as u
v
w
x
y < ý both treated as y
z
ž

Capitalisation that is different to English:

á Á
ä Ä
č Č
ď Ď
é É
ě Ě
í Í
ĺ Ĺ  ... not used in Czech alphabet... this letter is for Slovak alphabet compatibility
ľ Ľ  ... not used in Czech alphabet... this letter is for Slovak alphabet compatibility
ň Ň
ó Ó
ö Ö
ř Ř
š Š
ť Ť
ú Ú
ů Ů
ü Ü
ý Ý
ž Ž 

í == i is true, but in second phase when string are equal then í > i.

See example:

  'as' < 'ás' < 'aš' < 'áš'
  'ea' < 'ěa' < 'er' < 'ěr'

Now the test:

Use Terminal environment UTF8:

SET NAMES UTF8;

CREATE TABLE czechtestlatin2(v VARCHAR(10) CHARACTER SET latin2 COLLATE latin2_czech_cs);

CREATE TABLE czechtestutf8(v VARCHAR(10) CHARACTER SET utf8 COLLATE utf8_czech_ci);

INSERT INTO czechtestlatin2 VALUES
('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('x'),('y'),('z'),('s'),('t'),('u'),('v'),('w'),
('á'),('ä'),('č'),('ď'),('é'),('ě'),('ch'),('í'),('ĺ'),('ľ'),('ň'),('ó'),('ô'),('ö'),('ŕ'),('ř'),('š'),('ť'),('ú'),('ü'),('ů'),('ý'),('ž'),('as'),('ás'),('aš'),('áš'),('ea'),('ěa'),('er'),('ěr');

INSERT INTO czechtestutf8 VALUES
('a'),('b'),('c'),('d'),('e'),('f'),('g'),('h'),('i'),('j'),('k'),('l'),('m'),('n'),('o'),('p'),('q'),('r'),('x'),('y'),('z'),('s'),('t'),('u'),('v'),('w'),
('á'),('ä'),('č'),('ď'),('é'),('ě'),('ch'),('í'),('ĺ'),('ľ'),('ň'),('ó'),('ô'),('ö'),('ŕ'),('ř'),('š'),('ť'),('ú'),('ü'),('ů'),('ý'),('ž'),('as'),('ás'),('aš'),('áš'),('ea'),('ěa'),('er'),('ěr');

SELECT * FROM czechtestlatin2 ORDER BY v;
SELECT * FROM czechtestutf8 ORDER BY v;

Result: correct sorting

SELECT * FROM czechtestlatin2 WHERE v LIKE 'i';
+------+
| v    |
+------+
| i    | 
+------+

SELECT * FROM czechtestutf8 WHERE v LIKE 'i';
+------+
| v    |
+------+
| i    | 
| í   | 
+------+

SELECT * FROM czechtestlatin2 WHERE v='i';
+------+
| v    |
+------+
| i    | 
+------+

SELECT * FROM czechtestutf8 WHERE v='i';
+------+
| v    |
+------+
| i    | 
| í   | 
+------+

SELECT * FROM czechtestlatin2 WHERE v LIKE 'í';
+------+
| v    |
+------+
| í    | 
+------+

SELECT * FROM czechtestutf8 WHERE v LIKE 'í';
+------+
| v    |
+------+
| i    | 
| í   | 
+------+

SELECT * FROM czechtestlatin2 WHERE v='í';
+------+
| v    |
+------+
| í    | 
+------+

SELECT * FROM czechtestutf8 WHERE v='í';
+------+
| v    |
+------+
| i    | 
| í   | 
+------+

Ok, Czech sorting rule says that in first instance i == í.
Also capital for i and í is different (I and Í). Means the different behaviour in latin2_czech_cs and utf8_czech_ci is not because of same capital.

The behaviour in SELECT LIKE or '=' should be the same for latin2_czech_cs and utf8_czech_ci. It should not be different and inconsistent.

Czech folk is more familiar with behaviour of latin2_czech_cs so utf8_czech_ci should be fixed.

Bug #61615 was marked as a duplicate of this one.

[29 Dec 2017 0:22] Xing Z Zhang

From 8.0.1, we have utf8mb4_cs_0900_as_cs which is accent sensitive.