Bug #20247 Incorrect sorting of Lithuanian national chars
Submitted: 3 Jun 2006 18:17 Modified: 22 Sep 2006 8:06
Reporter: Algirdas Brazas Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.0.21 OS:Linux (Linux Slackware)
Assigned to: Domas Mituzas CPU Architecture:Any

[3 Jun 2006 18:17] Algirdas Brazas
Description:
Hello,
Standard  LST 1285:1993 describes Lithuanian alphabet. It describes also the way characters should be sorted.
Lithuanian aplhabet and its sorting according to standard is as follows:
Aa Ąą Bb Cc Čč Dd Ee Ęę Ėė Ff Gg Hh Ii Įį Yy Jj Kk Ll Mm Nn Oo Pp Rr Ss Šš Tt Uu  Ųų Ūū Vv Zz Žž
When sorting by alphabet lower and upper case letters are considered of same weight.
For the moment the sorting using utf8_lithuanian_ci collation is not correct - letter 'e' is after 'ę' and 'ė', it should be before; Upper and lower case characters are sorted in some strange way so that letter 'A' has precedence against 'a', but 'c' is higher than 'C'.

How to repeat:
use sql file;

then:

SELECT * FROM `aplhabet` ORDER BY `letter`;
You will see incorrect sorting.

SELECT * FROM `alphabet` ORDER BY `ID`;
will give You the sorting according standard LST 1285:1993(once again lower and upper letter are the same, but they should be sorted all the same way, not like now - some (like A) upper first, some(like C) lower first);

Suggested fix:
Change sorting for the collation utf8_lithuanian_ci so that it matches standard LST 1285:1993.
[3 Jun 2006 18:20] Algirdas Brazas
SQL table and data for lithuanian alphabeth

Attachment: alphabet.sql (text/plain), 1.01 KiB.

[13 Jun 2006 10:31] Domas Mituzas
According to VLKK (State Lithuanian Language Bureau) the 'dictionary' ordering specifies that extended vowel forms are sorted together (as secondary weight). 

We will analyze if there're double standards for Lithuanian sorting and introduce proper consistency with them.
[15 Aug 2006 16:38] MySQL Verification Team
Also see bug: http://bugs.mysql.com/bug.php?id=21581
[22 Sep 2006 7:49] Domas Mituzas
I have LST 1285:1993 in front of me, and it defines such order: 

AĄ
aą
B
b
C
c
Č
č
D
d
EĘĖ
eęė
F
f
G
g
H
h
IĮY
iįy
J
j
K
k
L
l
M
m
N
n
O
o
P
p
Q
q
R
r
S
s
Š
š
T
t
UŲŪ
uųū
V
v
W
w
X
x
Z
z
Ž
ž

So, accented vowels are always grouped and sorted together. Unique words, that differ only in accents, should be treated as homographs, and schema would be adjusted for that. In specialized systems canonized forms can be used together with binary collations or data types.
[22 Sep 2006 8:06] Domas Mituzas
Not a bug: utf8_lithuanian_ci strictly follows LST 1285:1993 and VLKK recommendations.