Bug #37898 Please add Bengali (Bangladesh) [bn_BD]language collation
Submitted: 6 Jul 2008 6:16 Modified: 5 Apr 2009 5:23
Reporter: Firoj Alam (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:6.0 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: collation, Contribution, Localization

[6 Jul 2008 6:16] Firoj Alam
Description:
Please add Bengali (Bangladesh) [bn_BD]language collation for the ucs2 and utf8 Unicode character sets. The bn_BD collation data is updated in CLDR version 1.6
data available at http://unicode.org/Public/cldr/1.6.0/core.zip
core/collation/bn.xml which should appear as a bn_BD collation.

How to repeat:
The Bengali (Bangladesh collation is currently not supported.
[6 Jul 2008 6:18] Firoj Alam
Bengali (bn) Bangladesh (BD) collation data

Attachment: bn_BD.xml (text/xml), 1.93 KiB.

[17 Jul 2008 5:18] Valeriy Kravchuk
Thank you for a reasonable feature request.
[30 Dec 2008 12:10] Jamil Ahmed
We have done some work on this issue. You can get the required xml file from here:
http://www.ankur.org.bd/downloads/sorting/mysql/BN-Index.xml
http://www.ankur.org.bd/downloads/sorting/mysql/Index.xml

The whole article is available here:
http://www.ankur.org.bd/wiki/Bangla_Sorting_in_MySQL
[4 Apr 2009 19:10] Jamil Ahmed
@Alexander Barkov: Any updates on this issue? Please let us know, if we missed anything.
[5 Apr 2009 5:23] Firoj Alam
There is no update? You can finalize it.
[17 Apr 2009 7:59] Alexander Barkov
Hello Firoj,
I'm sorry for delay.
thank you very much for your contribution!
In order to accept your contribution into the MySQL tree,
we need you to sign the Sun Contribution agreement (SCA).

Can you please do it:
http://forge.mysql.com/wiki/Contributing_Code#Paperwork

Thanks!
[17 Apr 2009 8:01] Alexander Barkov
I'm sorry, the last message was for Jamil.
[17 Apr 2009 11:36] Jamil Ahmed
Hi Alexander,

As a code contributor in OpenOffice.org for Bengali, I had already signed Sun Contribution agreement (SCA). You can check Native Language Confederation Projects of OpenOffice.org [0].

Please let us know the next step.

[0] http://projects.openoffice.org/native-lang.html

Thanks!
[8 Jun 2009 23:33] liz drachnik
Hello Firoj - 

In order for us to continue the process of reviewing your contribution to MySQL - We need you to review and sign the Sun|MySQL contributor agreement (the "SCA")

The process is explained here: 
http://forge.mysql.com/wiki/Sun_Contributor_Agreement

Getting a signed/approved SCA on file will help us facilitate your contribution-- this one, and others in the future.

Thank you ! 

Liz Drachnik  - Program Manager - MySQL
[26 Jul 2009 13:33] Jamil Ahmed
Hi Alexander Barkov and Liz Drachnik,

Today I have sent my SCA signed and mailed to sun_ca@sun.com. Please check and let us know the next step.

Regards,
-Jamil
[16 Oct 2010 8:16] Alexander Barkov
Bengali sorting order by Unicode's CLDR

Attachment: BENGALI-CLDR.short (application/octet-stream, text), 3.98 KiB.

[16 Oct 2010 8:17] Alexander Barkov
Bengali sorting order as in contribution

Attachment: BENGALI-CONTRIB.short (application/octet-stream, text), 4.19 KiB.

[16 Oct 2010 8:24] Alexander Barkov
Hi Firoj,

I made a comparison of utf8_bangla_ci vs Bengali collation rules
provided by Unicode Common Locale Data Repositiy (CLDR):
http://unicode.org/cldr/trac/browser/trunk/common/collation/bn.xml
and found some differences. 

Please find files "BENGALI-CONTRIB.short" and "BENGALI-CLDR.short"
in the "Files" section of the report. I used them for comparison. 

Can you please clarify a few differences between your and CLDR versions:

- Unicode puts "09BC # BENGALI SIGN NUKTA"
immediately before "0982 # BENGALI SIGN ANUSVARA".
You use default weight for 09BC, which is ignorable.

- Unicode reorders Bengali characters after 
punctuation (like comma, dot, dash, etc),
doing reset at "09FA # BENGALI ISSHAR".
You reset to SPACE, which mean you put
Bengali characters before punctuation block.

- Unicode sorts Bengali digits primary equal to ASCII digits.
You put Bengali digits separately.
So now, for example, "DIGIT 1" and "BENGALY DIGIT 1" 
are neither equal not sort near each other.

- Unicode puts a few currency signs immediately after "BENGALI ISSHAR":

09FA  ; [0453] # BENGALI ISSHAR
09F8  ; [0453+01] # BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
09F9  ; [0453+02] # BENGALI CURRENCY DENOMINATOR SIXTEEN
09F2  ; [0453+03] # BENGALI RUPEE MARK
09F3  ; [0453+04] # BENGALI RUPEE SIGN

You lieave these signs on their default DUCET position,
between Devangari and Gurmikhi.

- Unicode puts "RR" near "R" and "LL" near "L":

098B  ; [0453+0B] # LETTER VOCALIC R
09E0  ; [0453+0C] # LETTER VOCALIC RR
098C  ; [0453+0D] # LETTER VOCALIC L
09E1  ; [0453+0E] # LETTER VOCALIC LL

You don't reorder "RR" and "LL", so they are
sorted in their default DUCET positions,
between Devangari and Gurmikhi.

- Unicode puts a few sign characters after "0994 # LETTER AU"
and before "0995 # LETTER KA":

0994  ; [0453+12] # LETTER AU
09BC  ; [0453+13] # BENGALI SIGN NUKTA
0982  ; [0453+14] # BENGALI SIGN ANUSVARA
0983  ; [0453+15] # BENGALI SIGN VISARGA
0981  ; [0453+16] # BENGALI SIGN CANDRABINDU
0995  ; [0453+17] # BENGALI LETTER KA

You don't change order for NUKTA (as mentioned before),
and put the other signs after "KHANDA TA" and before "SIGN AA":

09CE  ; [020A+3A] # BENGALI LETTER KHANDA TA; QQKN
0982  ; [020A+3B] # BENGALI SIGN ANUSVARA
0983  ; [020A+3C] # BENGALI SIGN VISARGA
0981  ; [020A+3D] # BENGALI SIGN CANDRABINDU
09BE  ; [020A+3E] # BENGALI VOWEL SIGN AA

- Unicode puts "RRA" after "DDA",  "RHA" after "DDHA",
"KHANDA TA" after "NNA. You put "RRA", "RHA", "KHANDA TA"
in a separate block, which goes after "LETTER HA".
 
- Unicode puts "YYA" after "YA". You put "YYA" 
after "RRA", "RHA", and before "KHANDA TA".

- Unicode puts "SIGN AVAGHARA", "VOVEL SIGN AA", "VOVEL SIGN I"
after "LETTER HA".

- Unicode puts "SIGN VOCALIC RR", "SIGN VOCALIC L", "SING VOCALIC LL"
after "SIGN VOCALIC R". You don't change the order for these
three vocalic signs, so they are sorted in their default DUCET positions,
between Devangari and Gurmikhi.

Thank you.

P.S. I am also planning to check dictionary collation.