Bug #44523 Feature request/proposition: Croatian utf8 collation (utf8_croatian_ci)
Submitted: 28 Apr 2009 15:32 Modified: 11 Nov 2010 17:14
Reporter: Neven Jacmenovic Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:5.x, 6.x OS:Any
Assigned to: Alexander Barkov CPU Architecture:Any
Tags: collation, croatian, utf8, utf8_croatian_ci

[28 Apr 2009 15:32] Neven Jacmenovic
Description:
MySQL database server is used for pretty much everything now days web related. It's de-facto standard for small sized to large horizontally scaled web sites and it's used by biggest players in the industry. 

But one important feature that is lacking, and which is very important for our regional market is proper Croatian collation support for utf8 charsets based on Croatian alphabet (http://en.wikipedia.org/wiki/Gajica). Without it, MySQL server can't be considered choice for eg. government migration to open-source platform. 

AFAIK the countries which would benefit from the same implementation (alongside Croatia) are: Bosnia, Serbia (for latin charset) and Monte Negro (for latin charset). 

We tried implementing it on our own for couple of times, but without any luck. The problem lies in fact that Croatian language (Serbian and Bosnian too) have digraph characters (http://en.wikipedia.org/wiki/Gajica#Digraphs - single characters consisted of two characters - lj, nj and dž). And without proper support for those, we will never be able to sort things right (a-b-c-č-ć-d-dž-đ-...i-j-k-l-lj-m-n-nj-...u-v-z-ž)

There already is built in latin2 Croatian collation (latin2_croatian_ci) and CP1250 Croatian collation (cp1250_croatian_ci) in MySQL but those implementations lack digraph support (http://www.collation-charts.org/mysql60/mysql604.latin2_croatian_ci.html).

Closest to Croatian is Slovenian collation (utf8_slovenian_ci) support built-in in MySQL, but it also lacks digraphs so it's not possible to adapt it (http://www.collation-charts.org/mysql60/mysql604.utf8_slovenian_ci.html).

What does it take to implement Croatian utf8 collation? It takes modifying source code beyond our knowledge. We tried to implement it on our own (with Vietnamese as a base for digraph support), but without much luck. We got stuck at creating digraphs as a pair of basic latin letter + accented latin letter. And without proper support for those, we will never be able to sort things right (a-b-c-č-ć-d-dž-đ-...i-j-k-l-lj-m-n-nj-...u-v-z-ž).

Thank you in advance! 

Best regards
Neven Jacmenovic | nivas.hr

How to repeat:
We have tried creating our collation, but without modification to MySQL engine  codebase, we will never get digraphs working:
http://forums.mysql.com/read.php?103,192187,216993
[29 Apr 2009 4:42] Tomo Krajina
Yep, I agree, we croatian mysql users desperately need UTF8 collation.
[29 Apr 2009 8:06] Tonci Grgin
Bok Neven and thanks for your report.

This is a known problem and every so often we do talk about it. There is also a worklog for this but without much progress so far.

The problem is in our letters Nj, Lj... as I'm informed. This needs different approach than is in effect today so I really don't know how long will it take to implement.

https://intranet.mysql.com/worklog/Server-RawIdeaBin/?tid=3286 (internal):
--<cut>--
The problem is that these collations do not support contractions: 
 
DŽ, LJ and NJ, which must be treated as single letters. 

Sorting order should be: 
A,B,C,Č,Ć,D,DŽ,Đ,E,F,G,H,I,J,K,L,LJ,M,N,NJ,O,P,Q,R,S,Š,T,U,V,W,X,Y,Z,Ž 
 
MySQL is also missing collations utf8_croatian_ci and ucs2_croatian_ci. 
--<cut>--
[29 Apr 2009 8:23] Neven Jacmenovic
Bok Tonci majstore!

Croatian is not the only language with contractions. I've been following experiments with Hungarian and Vietnamese contractions but I was unable to use same technique for Croatian utf8:
http://bugs.mysql.com/file.php?id=6814

Best regards
Neven
[29 Apr 2009 8:29] Alexander Barkov
Hi Neven,

Thank you very much for the reasonable request.

We could not add Croatian collation so far because
MySQL didn't support contractions between non-ASCII characters,
so it was not possible to support dž correctly.

Right now we're finishing this task:

http://forge.mysql.com/worklog/task.php?id=2673

This patch (among other feature) makes possible to handle
diagrams like dž correctly.

The patch is already available and it's under code review.
After code review is done, the patch will appear in a 
so called "feature preview" tree.

After that, adding Croatian collation will be very simple - just
a matter of half of an our.

It is very likely that Croatian collation will appear in the same
feature preview tree in May or June 2009, so you'll be able to
download it and give it a try.

I don't have estimation when Croatian will appear in a official
release though at the moment.
[29 Apr 2009 8:30] Tonci Grgin
Thanks for info provided Bar.
[29 Apr 2009 8:31] Goran Ucpe
Hey Neven!

Yes, i agree with you on the importance of this issue. Lately, more and more government institutions are requiring digital projects, and those projects are usually consisting of large amount of NAMES in databases  (example: population counts, voting day ballots, employees working in state, etc.) These databases cannot be sorted correctly because of issues with NJ, LJ, and DŽ, and that is starting to causing problems on the state level.
[29 Apr 2009 8:32] Neven Jacmenovic
Btw, I just realized that this ticket has been assigned to mr. Alexander Barkov who is original author of mentioned Hungarian experiment. Hi Alexander! :)

The list of supporters for this feature is growing big in my original forum post: http://forums.mysql.com/read.php?20,260051,260051#msg-260051
[29 Apr 2009 8:44] Neven Jacmenovic
Alexander - thank you for such a great news. You certainly made my day! I will be following progress on this like a hawk. 

Goran - yes my point exactly. So far, if sorting was an issue in one table then we could use old school hack with creating new column for order by clause which stored alternative names eg. Belančić -> belancxicy, Čutura -> cxutura etc. But those hacks slow down development and cause damage in long run. Hacking logic is done in application level and MySQL couldn't be use for more complex queries. 

Please keep me posted guys!
Best regards
Neven
[29 Apr 2009 8:56] Damir Ribaric
We really need that feature!
Thank you!
[7 Aug 2009 8:49] Neven Jacmenovic
Hi guys, any update on progress of this?

Thank in advance!
Best regards
Neven
[27 Nov 2009 12:45] Tonci Grgin
Neven, good news!

See http://www.collation-charts.org/articles/croatian.htm and http://forge.mysql.com/worklog/task.php?id=2673.

All the problems/thoughts you might have using these patches feel free to report straight to me (if you can't express it in English) and I'll pass them to Bar.

These links are also posted to Forum.hr under "mysql".
[30 Nov 2009 15:35] Neven Jacmenovic
Great news, great news indeed my friend. Looking good so far! We even managed to apply the patch to 5.0.51 and we are working with Alexander Barkov on further tests.

Here is test db dump: http://www.nivas.hr/pub/mysql_utf8_croatian_ci/test_croatian.sql
And this is expected order by output: http://www.nivas.hr/pub/mysql_utf8_croatian_ci/output.txt
[11 Nov 2010 17:14] Alexander Barkov
Croatian collation has been added into mysql-5.6.
It's currently in documenting. See here for status updates:
http://forge.mysql.com/worklog/task.php?id=5476