MySQL Bugs: #12519: Incorrect Hungarian collation

Bug #12519	Incorrect Hungarian collation
Submitted:	11 Aug 2005 12:45	Modified:	18 Jan 2018 13:03
Reporter:	Peter Ritzinger	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	4.1.13 and 5.0.10a	OS:	Any (*)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
The latin2_hungarian_ci collation in MySQL is not like the standard hungarian collation, it produces incorrect comparison results and incorrect order of rows.

How to repeat:
First, the comparisons:

SELECT
(CAST('a' AS char character set latin2) COLLATE latin2_hungarian_ci) =
(CAST('á' AS char character set latin2) COLLATE latin2_hungarian_ci)
,
(CAST('ccs' as char character set latin2) COLLATE latin2_hungarian_ci) =
(CAST('cscs' as char character set latin2) COLLATE latin2_hungarian_ci)

The correct result should be: the first comparison should be unequal, the second should be equal. But I get equal and unequal.

Second, the order of rows:

CREATE TABLE `t` (
  `id` int(11) NOT NULL auto_increment,
  `n` varchar(45) collate latin2_hungarian_ci NOT NULL,
  PRIMARY KEY  (`id`)
) DEFAULT CHARSET=latin2 COLLATE=latin2_hungarian_ci;

INSERT INTO `t` VALUES (1,'ab'),(2,'áb'),(3,'az'),(4,'áz'),(5,'comb'),(6,'cukor'),(7,'csak'),(8,'tz'),(9,'tty');

SELECT * FROM `t` ORDER BY `n` ASC, `id` ASC;
SELECT * FROM `t` ORDER BY `n` ASC, `id` DESC;

The result of both select statements should be:
1 'ab'
2 'áb'
3 'az'
4 'áz'
5 'comb'
6 'cukor'
7 'csak'
8 'tz'
9 'tty'

But I get the result for the first query:
1 'ab'
2 'áb'
3 'az'
4 'áz'
5 'comb'
7 'csak'
6 'cukor'
9 'tty'
8 'tz'

The result of the second query:
2 'áb'
1 'ab'
4 'áz'
3 'az'
5 'comb'
7 'csak'
6 'cukor'
9 'tty'
8 'tz'

Suggested fix:
I think that Hungarian collation should be a compiled collation, because of the complex rules. And there should be a Hungarian collation for utf8 and ucs2 character sets, too.

I don't know, how much do you know about the standard Hungarian collation. I summarize the rules here:

First, the alphabet:

a
á
b
c
cs
d
dz
dzs
e
é
f
g
gy
h
i
í
j
k
l
ly
m
n
ny
o
ó
ö
ő
p
q
r
s
sz
t
ty
u
ú
ü
ű
v
w
x
y
z
zs 

Each vowel has a short version and a long version, both a different character.
They are the following pairs:
a á
e é
i í
o ó
ö ő
u ú
ü ű

In general, when ordering or comparing, the short and the long version should be treated as equal, except when the order cannot be decided by the characters after the vowel.
'ab' < 'áb': The order cannot be decided by the characters after the vowel.
'áb' < 'az': The order can be decided by the characters after the vowel.
And 'a'<'á', not 'a'='á'

There is a bigger problem with consonants, because there are consonants written by two characters, and there is a consonant written using three characters.
They are the following:

cs
dz
dzs
gy
ly
ny
sz
ty
zs 

When ordering they should be treated as one letter, which is always greater then the preceding letter in the alphabet.
'cukor' < 'csak': 'cs' is a consonant, following 'c' in the alphabet.
Of course a computer without a builtin dictionary cannot recognize those rare situations, when in a word a 'c' is followed by an 's', but they should be treated as two separate consonants. So in Hungarian sorting algorithms, when there is for example a 'cs' in the string, it is always treated as one consonant.

But we are not at the and! :(
When two same multi-character consonant is following each other, they are written as the following:
cs-cs: ccs
dz-dz: ddz
dzs-dzs: ddzs
gy-gy: ggy
ly-ly: lly
ny-ny: nny
sz-sz: ssz
ty-ty: tty
zs-zs: zzs

When parsing the string, these substrings should be recongized as two separate multi-character consonants.
For example: 'tz'<'tty' because 'tz' is 't' and 'z', 'tty' is 'ty' and 'ty', and the order is decided by the first letters: 't'<'ty'.

It is a question, what should be done with characters that are not in Hungarian alphabet, but they can occur in family names, like 'ë'. I think they should be treated as they would be a vowel in the alpabet following the according Hungarian vowels, for example, ... 'e', 'é', 'ë', 'f' ...

Any answer?
Can we hope, that this thing will be fixed in a future release?

Peter, thank you very much for so detailed description.
I'm sorry, adding a real Hungarian collations is a separated complex task.
I'm changeing severity of this task to "feature request".

latin2_hungarian_ci is just a simplified version which doesn't cover complex
multi-letter rules. It was designed to work this way.

Peter, can you clarify please what should happen in this case:

CREATE TABLE t1 (a char(1));
INSERT INTO t1 VALUES  ('a');
INSERT INTO t1 VALUES  ('á');

What should these queries return:

SELECT * FROM t1 WHERE a='a';
SELECT * FROM t1 WHERE a='á';

Both records for both queries?
Or one record for each query?

What should this query return:
SELECT COUNT(DISTINCT a) FROM t1;
?

Thanks!

I am very sorry for the late answer.

I hope I can help with the following information:

Generally, 'a' and 'á' are two different vowel, they should be treated as NOT equal. They should be only treated equal when the order of the two words can be determined by the characters following the vowels. But in this case, because of the difference in the following characters, the two string cannot be equal.

This rule is true in when the following string is the empty string. In this case the order can NOT be determined by the "following characters" (the empty string), so 'a' < 'á'.

So each example query should return one row.
And the the count distinct query should return 2.

As I review my bug report written in August 2005, I can found the following line:

And 'a'<'á', not 'a'='á'

So the answer for your last querstion was in the original bug report.

I hope very much, that if you have started to implement the sorting code, you do not have to rewrite it because the late information.

MySQL is looking for an authoritative, official statement
which states all the current Hungarian collation rules.
Please let other MySQL-using Hungarians (especially if you
know a user group in Hungary) know about these
questions. Best of all would be a translation of the
Hungarian government standard, if there is one.

MySQL has received several complaints/suggestions about
Hungarian collation. For example these three people
contacted us via a public MySQL mailing list or bugs forum:
RITZINGER Peter (http://bugs.mysql.com/bug.php?id=12519)
BÁRTHÁZI András (http://lists.mysql.com/mysql/191427)
Csongor Fagyal (http://bugs.mysql.com/bug.php?id=22337)
In what follows I will refer to what seems to be agreed,
and what seems to be disputed.

The current latin2_hungarian_ci collation
is a chart in sql/share/charsets/latin2.xml,
and Mr Barkov has provided an easy-to-read web page: 
http://myoffice.izhnet.ru/bar/~bar/charts/latin2_hungarian_ci.html
This collation is unlike the Hungarian dictionaries,
collation descriptions, or products that we've seen.
For example the first letter is:
Latin Capital Letter A
= Latin Small Letter A
= control Single Shift 3
= No-Break Space
= Latin Small Letter L with caron
= Latin Capital Letter A with acute
= Latin Small Letter A with acute
But there is no reason that small L with caron
(which is Slovak not Hungarian) ever sorts with
A, there's some dispute whether A with acute
should sort with A, and all other
accented variants of A should be in this list too.
It is likely that MySQL will deprecate this
collation (which implies that MySQL will eventually
remove it), after introducing a new and more correct
one.

Most people agree that this is the Hungarian alphabet;
a á b c cs d dz dzs e é f g gy h i í j k l ly m n
ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs

(The DOUBLE ACUTE letters ő and ű are sometimes shown
as õ and û but I suspect that is a conversion error.)

Some people also say there's a secondary sort
rule for these short/long vowel pairs:
a á, e é, i í, o ó, ö ő, u ú, ü ű
For these pairs, long = short usually, but long > short
if all else is equal.
I have seen comments showing that Oracle seems to follow this rule:
'BÁ'>'BA' is true
'BÁ'>'BAC' is false
but the commenter, though Hungarian, didn't like what Oracle did.
(thread 'nlssort' on comp.databases.oracle.server 2002-11-10)
One commenter wrote to us about a similar thing, saying
it's a mistake that SELECT 'hal' LIKE 'hál' is true.
Unfortunately, the same person also disagrees, saying
that we should have two collations, one where
long > short, one where long = short.
I have also seen Simonsen's rules:
http://std.dkuug.dk/i18n/locales/hu_HU
They suggest that A-acute > A, etc.
I have also seen argument about the same thing for glibc:
http://sources.redhat.com/ml/libc-locales/2005-q4/msg00002.html

Apparently all Hungarians agree that these digraphs are "letters":
cs dz dzs gy ly ny sz ty zs
That's bad but not very bad. MySQL handles digraphs in Spanish.
There is also one trigraph:
dzs
That's very bad. Luckily dzs is rare, it's mostly for
English words with a "j" sound (bridge is 'briddz',
gin is 'dzsinn') (so I'm told).

There is a special rule when you see the first part of a
digraph followed by the digraph. For example, in 'ggy',
'g' is the first part of 'gy' and it's followed by 'gy'
... and MySQL treats it as a repetition of the digraph, i.e.
as if it's 'gygy'. This applies to all the letters listed
in the previous paragraph, so:
ccs = cscs, ddz = dzdz, ddsz = dzsdzs, ggy = gygy,
lly = lyly, nny = nyny, ssz = szsz, tty = tyty, zzs = zszs.
For example, Mr Ritzinger says that
'tty > tz' because tty is expanded to tyty.
I know that other products handle the situation, but I've
seen them called "double compressions", which worries me --
do some people think that 'cscs sorts with ccs' rather than
'ccs sorts with cscs'?

A collation which follows the single-character rules, but
ignores digraphs and trigraphs, sounds somewhat like what I
see in Kaplan's remarks on Microsoft's Hungarian Technical Sort:
http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx
One of the above-listed people would accept this, he says he doesn't
care about digraphs or trigraphs. But I have no idea whether
Microsoft was following some "technical" standard.

All characters outside the Hungarian alphabet should be done
according to UCA 4.0.0 (until MySQL switches to the newer UCA).

For Unicode support, I suggest names for the new collations
should be: ucs2_hungarian2_ci, utf8_hungarian2_ci. The only
other character sets that may have Hungarian collations are
latin2 and cp1250.

Our concern at this time is only for the "primary sort", the
collation necessary for searches. The "secondary sort" or
"tertiary sort" rules, the ones that affect only ORDER BY,
are of interest but will only be of importance in the future.

....

I have sent the above message to mysql@lists.com
(subject line "Hungarian collation"). And I will
add a note on bug#22337, to refer to bug#12519.

Some clarifications.

1. You are right, the Hungarian alphabet looks like this:
a á b c cs d dz dzs e é f g gy h i í j k l ly m n
ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs

2. There is no Hungarian character û and õ. That is usually a conversion error, mostly because ISO-8859-1 cannot handle ő and ű, so these are the closes equivalents.

3. I think is is very important do distingush sorting order on letters consisting of multiple characters, such as dz, sz, etc. and accented vowels, such as á,é,ű, etc.
Sorting following the order of the alphabet (e.g.: egye > egze) is "academic". While this is the official sorting order, it is almost impossible to create a sorting that knows wether gy is one letter or just g an y place coincidentally one after the other - an algorythm like that should be aware of all Hungarian words, IMHO. Also, even dictionaries often fail to follow that sorting. I believe that most Hungarians themselves are even unsure of the correct ordering regarding these special, multi-character letters - some even find the academic ordering "misleading" while searching for words in dictionaries, for example.
On the other hand, that is not true for accented vowels, though. A < Á, O < Ö and so on, all the time - ant what is even more important, A should NEVER be equal to Á, U to Ű and so on (this make unique char(N) indexes totally unuseable in some MySQL applications).

4. I would propose a collation where accented vowels are handled according to the Hungarian alphabet, while multiple character letters are sorted without considering them being multi-character. This is what most software currenty use anyway - people are used to it already.

I would also like to add that I feel the need for a case-sensitive collation.

The reason: to be able to use distinct *character codes* in a unique index, which includes

Ab <> ab
áá <> ÁÁ

and so on.

We are very grateful for the comments on this bug report, and the email
exchange on lists.mysql.com (http://lists.mysql.com/mysql/202806).
We are still hoping for more responses, especially from anyone who
has seen an official Hungarian standard.

The currently valid and official "Hungarian standard" ("The rules of Hungarian spelling") can be found here:

http://mek.oszk.hu/01500/01547/

http://mek.oszk.hu/01500/01547/01547.pdf

The problem is that it is written in Hungarian, so I think you can not use it.

The rules about ordering words are in chapters 14. and 15. on pages 9-10-11. These rules say the same that I have written in my original bug report.

But if you really want, I can translate this part of the standard for you. Or I can provide an algorithm-like description of the ordering procedure.

Hi,

http://www.tug.org.in/tug2002/presentations/bujdoso.pdf

This presentation is about localizing Tex to Hungarian.
What you exactly need (alphabetical order) starts at page 59.

Thank you

I've made some research and I found out, that there is NO translation to any foreign language of the Hungarian Orthography book(s). I'm afraid that our suggestions should be enough... Sorry.

Thanks to all participants in the discussion
"Hungarian collation" on lists.mysql.com.
I have posted the resulting plan here:
http://lists.mysql.com/mysql/204718

I would appreciate if any of the Hungarian authorities who contributed to or are
monitoring this thread could comment on whether the issue is addressed?  In 5.0?
Also: I've been looking for a Query tool that handles Hungarian properly.  Any
recommendations appreciated.

It is not possible to have a correct Hungarian collation without knowing the meaning of the sorted words, see for example:

CSS < CZAKÓ < CSAK

CSS isn't a word in the Hungarian language, although Czakó is a name, csak means but. More complicated examples are also easy to find. So we won't see a correct Hungarian collation in MySQL (or anywhere else) not involving an AI that knows (at least) Hungarian.

I guess approaches to create 'correct' collations (for example in MSSQL) are misleading, confusing and a nest of bugs for programs relying on sorted results.

[17 Dec 2017 23:06] Xing Z Zhang

Fixed in 8.0.1 by adding utf8mb4_hu_0900_ai_ci and utf8mb4_hu_0900_as_cs.

This guy is not happy with the new collation:
https://stackoverflow.com/questions/57555718/mysql-wrong-order-by-for-hungarian-characters