Bug #12519 | Incorrect Hungarian collation | ||
---|---|---|---|
Submitted: | 11 Aug 2005 12:45 | Modified: | 18 Jan 2018 13:03 |
Reporter: | Peter Ritzinger | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S4 (Feature request) |
Version: | 4.1.13 and 5.0.10a | OS: | Any (*) |
Assigned to: | Assigned Account | CPU Architecture: | Any |
[11 Aug 2005 12:45]
Peter Ritzinger
[27 Aug 2005 11:58]
Peter Ritzinger
Any answer? Can we hope, that this thing will be fixed in a future release?
[29 Aug 2005 6:55]
Alexander Barkov
Peter, thank you very much for so detailed description. I'm sorry, adding a real Hungarian collations is a separated complex task. I'm changeing severity of this task to "feature request". latin2_hungarian_ci is just a simplified version which doesn't cover complex multi-letter rules. It was designed to work this way.
[21 Dec 2005 17:57]
Alexander Barkov
Peter, can you clarify please what should happen in this case: CREATE TABLE t1 (a char(1)); INSERT INTO t1 VALUES ('a'); INSERT INTO t1 VALUES ('á'); What should these queries return: SELECT * FROM t1 WHERE a='a'; SELECT * FROM t1 WHERE a='á'; Both records for both queries? Or one record for each query? What should this query return: SELECT COUNT(DISTINCT a) FROM t1; ? Thanks!
[9 Feb 2006 14:05]
Peter Ritzinger
I am very sorry for the late answer. I hope I can help with the following information: Generally, 'a' and 'á' are two different vowel, they should be treated as NOT equal. They should be only treated equal when the order of the two words can be determined by the characters following the vowels. But in this case, because of the difference in the following characters, the two string cannot be equal. This rule is true in when the following string is the empty string. In this case the order can NOT be determined by the "following characters" (the empty string), so 'a' < 'á'. So each example query should return one row. And the the count distinct query should return 2.
[9 Feb 2006 14:15]
Peter Ritzinger
As I review my bug report written in August 2005, I can found the following line: And 'a'<'á', not 'a'='á' So the answer for your last querstion was in the original bug report. I hope very much, that if you have started to implement the sorting code, you do not have to rewrite it because the late information.
[17 Oct 2006 16:21]
Peter Gulutzan
MySQL is looking for an authoritative, official statement which states all the current Hungarian collation rules. Please let other MySQL-using Hungarians (especially if you know a user group in Hungary) know about these questions. Best of all would be a translation of the Hungarian government standard, if there is one. MySQL has received several complaints/suggestions about Hungarian collation. For example these three people contacted us via a public MySQL mailing list or bugs forum: RITZINGER Peter (http://bugs.mysql.com/bug.php?id=12519) BÁRTHÁZI András (http://lists.mysql.com/mysql/191427) Csongor Fagyal (http://bugs.mysql.com/bug.php?id=22337) In what follows I will refer to what seems to be agreed, and what seems to be disputed. The current latin2_hungarian_ci collation is a chart in sql/share/charsets/latin2.xml, and Mr Barkov has provided an easy-to-read web page: http://myoffice.izhnet.ru/bar/~bar/charts/latin2_hungarian_ci.html This collation is unlike the Hungarian dictionaries, collation descriptions, or products that we've seen. For example the first letter is: Latin Capital Letter A = Latin Small Letter A = control Single Shift 3 = No-Break Space = Latin Small Letter L with caron = Latin Capital Letter A with acute = Latin Small Letter A with acute But there is no reason that small L with caron (which is Slovak not Hungarian) ever sorts with A, there's some dispute whether A with acute should sort with A, and all other accented variants of A should be in this list too. It is likely that MySQL will deprecate this collation (which implies that MySQL will eventually remove it), after introducing a new and more correct one. Most people agree that this is the Hungarian alphabet; a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs (The DOUBLE ACUTE letters ő and ű are sometimes shown as õ and û but I suspect that is a conversion error.) Some people also say there's a secondary sort rule for these short/long vowel pairs: a á, e é, i í, o ó, ö ő, u ú, ü ű For these pairs, long = short usually, but long > short if all else is equal. I have seen comments showing that Oracle seems to follow this rule: 'BÁ'>'BA' is true 'BÁ'>'BAC' is false but the commenter, though Hungarian, didn't like what Oracle did. (thread 'nlssort' on comp.databases.oracle.server 2002-11-10) One commenter wrote to us about a similar thing, saying it's a mistake that SELECT 'hal' LIKE 'hál' is true. Unfortunately, the same person also disagrees, saying that we should have two collations, one where long > short, one where long = short. I have also seen Simonsen's rules: http://std.dkuug.dk/i18n/locales/hu_HU They suggest that A-acute > A, etc. I have also seen argument about the same thing for glibc: http://sources.redhat.com/ml/libc-locales/2005-q4/msg00002.html Apparently all Hungarians agree that these digraphs are "letters": cs dz dzs gy ly ny sz ty zs That's bad but not very bad. MySQL handles digraphs in Spanish. There is also one trigraph: dzs That's very bad. Luckily dzs is rare, it's mostly for English words with a "j" sound (bridge is 'briddz', gin is 'dzsinn') (so I'm told). There is a special rule when you see the first part of a digraph followed by the digraph. For example, in 'ggy', 'g' is the first part of 'gy' and it's followed by 'gy' ... and MySQL treats it as a repetition of the digraph, i.e. as if it's 'gygy'. This applies to all the letters listed in the previous paragraph, so: ccs = cscs, ddz = dzdz, ddsz = dzsdzs, ggy = gygy, lly = lyly, nny = nyny, ssz = szsz, tty = tyty, zzs = zszs. For example, Mr Ritzinger says that 'tty > tz' because tty is expanded to tyty. I know that other products handle the situation, but I've seen them called "double compressions", which worries me -- do some people think that 'cscs sorts with ccs' rather than 'ccs sorts with cscs'? A collation which follows the single-character rules, but ignores digraphs and trigraphs, sounds somewhat like what I see in Kaplan's remarks on Microsoft's Hungarian Technical Sort: http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx One of the above-listed people would accept this, he says he doesn't care about digraphs or trigraphs. But I have no idea whether Microsoft was following some "technical" standard. All characters outside the Hungarian alphabet should be done according to UCA 4.0.0 (until MySQL switches to the newer UCA). For Unicode support, I suggest names for the new collations should be: ucs2_hungarian2_ci, utf8_hungarian2_ci. The only other character sets that may have Hungarian collations are latin2 and cp1250. Our concern at this time is only for the "primary sort", the collation necessary for searches. The "secondary sort" or "tertiary sort" rules, the ones that affect only ORDER BY, are of interest but will only be of importance in the future. .... I have sent the above message to mysql@lists.com (subject line "Hungarian collation"). And I will add a note on bug#22337, to refer to bug#12519.
[17 Oct 2006 16:49]
Csongor Fagyal
Some clarifications. 1. You are right, the Hungarian alphabet looks like this: a á b c cs d dz dzs e é f g gy h i í j k l ly m n ny o ó ö ő p q r s sz t ty u ú ü ű v w x y z zs 2. There is no Hungarian character û and õ. That is usually a conversion error, mostly because ISO-8859-1 cannot handle ő and ű, so these are the closes equivalents. 3. I think is is very important do distingush sorting order on letters consisting of multiple characters, such as dz, sz, etc. and accented vowels, such as á,é,ű, etc. Sorting following the order of the alphabet (e.g.: egye > egze) is "academic". While this is the official sorting order, it is almost impossible to create a sorting that knows wether gy is one letter or just g an y place coincidentally one after the other - an algorythm like that should be aware of all Hungarian words, IMHO. Also, even dictionaries often fail to follow that sorting. I believe that most Hungarians themselves are even unsure of the correct ordering regarding these special, multi-character letters - some even find the academic ordering "misleading" while searching for words in dictionaries, for example. On the other hand, that is not true for accented vowels, though. A < Á, O < Ö and so on, all the time - ant what is even more important, A should NEVER be equal to Á, U to Ű and so on (this make unique char(N) indexes totally unuseable in some MySQL applications). 4. I would propose a collation where accented vowels are handled according to the Hungarian alphabet, while multiple character letters are sorted without considering them being multi-character. This is what most software currenty use anyway - people are used to it already.
[17 Oct 2006 16:53]
Csongor Fagyal
I would also like to add that I feel the need for a case-sensitive collation. The reason: to be able to use distinct *character codes* in a unique index, which includes Ab <> ab áá <> ÁÁ and so on.
[30 Oct 2006 12:45]
Peter Gulutzan
We are very grateful for the comments on this bug report, and the email exchange on lists.mysql.com (http://lists.mysql.com/mysql/202806). We are still hoping for more responses, especially from anyone who has seen an official Hungarian standard.
[2 Nov 2006 23:41]
Peter Ritzinger
The currently valid and official "Hungarian standard" ("The rules of Hungarian spelling") can be found here: http://mek.oszk.hu/01500/01547/ http://mek.oszk.hu/01500/01547/01547.pdf The problem is that it is written in Hungarian, so I think you can not use it. The rules about ordering words are in chapters 14. and 15. on pages 9-10-11. These rules say the same that I have written in my original bug report. But if you really want, I can translate this part of the standard for you. Or I can provide an algorithm-like description of the ordering procedure.
[4 Jan 2007 12:25]
József Kozell
Hi, http://www.tug.org.in/tug2002/presentations/bujdoso.pdf This presentation is about localizing Tex to Hungarian. What you exactly need (alphabetical order) starts at page 59. Thank you
[18 Jan 2007 18:52]
József Kozell
I've made some research and I found out, that there is NO translation to any foreign language of the Hungarian Orthography book(s). I'm afraid that our suggestions should be enough... Sorry.
[31 Jan 2007 18:10]
Peter Gulutzan
Thanks to all participants in the discussion "Hungarian collation" on lists.mysql.com. I have posted the resulting plan here: http://lists.mysql.com/mysql/204718
[2 Feb 2007 21:45]
Tom Ballister
I would appreciate if any of the Hungarian authorities who contributed to or are monitoring this thread could comment on whether the issue is addressed? In 5.0? Also: I've been looking for a Query tool that handles Hungarian properly. Any recommendations appreciated.
[2 Aug 2007 2:42]
Gabor Czigola
It is not possible to have a correct Hungarian collation without knowing the meaning of the sorted words, see for example: CSS < CZAKÓ < CSAK CSS isn't a word in the Hungarian language, although Czakó is a name, csak means but. More complicated examples are also easy to find. So we won't see a correct Hungarian collation in MySQL (or anywhere else) not involving an AI that knows (at least) Hungarian. I guess approaches to create 'correct' collations (for example in MSSQL) are misleading, confusing and a nest of bugs for programs relying on sorted results.
[18 Jan 2018 13:03]
Erlend Dahl
[17 Dec 2017 23:06] Xing Z Zhang Fixed in 8.0.1 by adding utf8mb4_hu_0900_ai_ci and utf8mb4_hu_0900_as_cs.
[22 Aug 2019 5:50]
Rick James
This guy is not happy with the new collation: https://stackoverflow.com/questions/57555718/mysql-wrong-order-by-for-hungarian-characters