MySQL Bugs: #20247: Incorrect sorting of Lithuanian national chars

Bug #20247	Incorrect sorting of Lithuanian national chars
Submitted:	3 Jun 2006 18:17	Modified:	22 Sep 2006 8:06
Reporter:	Algirdas Brazas	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	5.0.21	OS:	Linux (Linux Slackware)
Assigned to:	Domas Mituzas	CPU Architecture:	Any

Description:
Hello,
Standard  LST 1285:1993 describes Lithuanian alphabet. It describes also the way characters should be sorted.
Lithuanian aplhabet and its sorting according to standard is as follows:
Aa Ąą Bb Cc Čč Dd Ee Ęę Ėė Ff Gg Hh Ii Įį Yy Jj Kk Ll Mm Nn Oo Pp Rr Ss Šš Tt Uu  Ųų Ūū Vv Zz Žž
When sorting by alphabet lower and upper case letters are considered of same weight.
For the moment the sorting using utf8_lithuanian_ci collation is not correct - letter 'e' is after 'ę' and 'ė', it should be before; Upper and lower case characters are sorted in some strange way so that letter 'A' has precedence against 'a', but 'c' is higher than 'C'.

How to repeat:
use sql file;

then:

SELECT * FROM `aplhabet` ORDER BY `letter`;
You will see incorrect sorting.

SELECT * FROM `alphabet` ORDER BY `ID`;
will give You the sorting according standard LST 1285:1993(once again lower and upper letter are the same, but they should be sorted all the same way, not like now - some (like A) upper first, some(like C) lower first);

Suggested fix:
Change sorting for the collation utf8_lithuanian_ci so that it matches standard LST 1285:1993.

SQL table and data for lithuanian alphabeth

Attachment: alphabet.sql (text/plain), 1.01 KiB.

According to VLKK (State Lithuanian Language Bureau) the 'dictionary' ordering specifies that extended vowel forms are sorted together (as secondary weight). 

We will analyze if there're double standards for Lithuanian sorting and introduce proper consistency with them.

Also see bug: http://bugs.mysql.com/bug.php?id=21581

I have LST 1285:1993 in front of me, and it defines such order: 

AĄ
aą
B
b
C
c
Č
č
D
d
EĘĖ
eęė
F
f
G
g
H
h
IĮY
iįy
J
j
K
k
L
l
M
m
N
n
O
o
P
p
Q
q
R
r
S
s
Š
š
T
t
UŲŪ
uųū
V
v
W
w
X
x
Z
z
Ž
ž

So, accented vowels are always grouped and sorted together. Unique words, that differ only in accents, should be treated as homographs, and schema would be adjusted for that. In specialized systems canonized forms can be used together with binary collations or data types.

Not a bug: utf8_lithuanian_ci strictly follows LST 1285:1993 and VLKK recommendations.