Bug #28916 LDML doesn't work for utf8 and is not described in the manual
Submitted: 6 Jun 2007 8:10 Modified: 30 May 2008 17:26
Reporter: Alexander Barkov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.0, 5.1 OS:Any
Assigned to: Paul DuBois CPU Architecture:Any

[6 Jun 2007 8:10] Alexander Barkov
Description:
Some times ago a possibility to add user defined Unicode
collations was implemented. This feature does not require
mysqld to be recompiled to add a new Unicode collation -
it uses so called "Locale Data Markup Language (LDML)" which
can be embedded directly into the character set and collation
index file Index.xml.

There are two problems with LDML implementation:

1. It works only for UCS2, but does not work for UTF8

2. It is not documented in the manual

How to repeat:
1. Apply the patch (attached in the "files" section of this bug report)
to the file Index.xml of your MySQL installation
(typically /usr/share/mysql/charsets/Index.xml)

It adds a similar user defined collation to UCS2 and UTF8:
with a rule making letter 'b' compare the same to letter 'a'.

2. Run this script:

#
# Check if it works with UCS2
#
drop table if exists t1;
create table t1 (c1 char(1) character set ucs2 collate ucs2_test_ci);
insert into t1 values ('a');
select * from t1 where c1='b';

#
# Check that it works with UTF8
#
drop table if exists t1;
create table t1 (c1 char(1) character set utf8 collate utf8_test_ci);
insert into t1 values ('a');
select * from t1 where c1='b';

3. Check its output:

mysql> drop table if exists t1;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> create table t1 (c1 char(1) character set ucs2 collate ucs2_test_ci);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('a');
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1 where c1='b';
+------+
| c1   |
+------+
| a    |
+------+
1 row in set (0.00 sec)

mysql>
mysql> drop table if exists t1;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (c1 char(1) character set utf8 collate utf8_test_ci);
ERROR 1273 (HY000): Unknown collation: 'utf8_test_ci'

So it perfectly added the user defined collation "ucs2_test_ci"
and correctly compared 'a' equal to 'b',
but it failed to add "utf8_test_ci" and returned "Unknown collation"
error.

Suggested fix:
1. Fix the collation routines to be able to load Unicode collations
for both UCS2 and UTF8

2. Add LDML description into the manual, so the users can
easily add their own collations
[6 Jun 2007 8:11] Alexander Barkov
Diff file to add user defined Unicode collations using LDML

Attachment: Index.xml.diff (text/x-patch), 596 bytes.

[6 Jun 2007 12:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28193

ChangeSet@1.2515, 2007-06-06 17:09:59+05:00, bar@mysql.com +9 -0
  Bug#28916 LDML doesn't work for utf8
  and is not described in the manual
  - Adding missing initialization for utf8 collations
  - Minor code clean-ups: renaming variables,
    moving code into a new separate function.
  - Adding test, to check that both ucs2 and utf8 user
    defined collations work (ucs2_test_ci and utf8_test_ci)
  - Adding Vietnamese collation as a complex user defined
    collation example.
[7 Jun 2007 12:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28295

ChangeSet@1.2515, 2007-06-07 17:55:55+05:00, bar@mysql.com +9 -0
  Bug#28916 LDML doesn't work for utf8
  and is not described in the manual
  - Adding missing initialization for utf8 collations
  - Minor code clean-ups: renaming variables,
    moving code into a new separate function.
  - Adding test, to check that both ucs2 and utf8 user
    defined collations work (ucs2_test_ci and utf8_test_ci)
  - Adding Vietnamese collation as a complex user defined
    collation example.
[8 Jun 2007 8:35] Alexander Barkov
Pushed into 5.0.44-rpl
Pushed into 5.1.20-rpl

To documentation team:

This bug can be closed only after we have a new section in the manual,
explaining how to add user defined Unicode collations using LDML.
I'm going to write this section soon.

Please wait for me :)
[21 Jun 2007 20:12] Bugs System
Pushed into 5.0.46
[21 Jun 2007 20:15] Bugs System
Pushed into 5.1.20-beta
[23 Jun 2007 8:16] Jon Stephens
Waiting for info from Bar. :)
[1 Oct 2007 10:50] Alexander Barkov
At the Heidelberg DevConf, Bar gave a session "How to add a collation".
Paul now has all information to write a manual section on LDML using
Bar's presentation.
[15 Nov 2007 15:21] Paul DuBois
Changing category to Documentation, assigning to myself.
[30 May 2008 17:26] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

The manuals now contain a new section on adding new collations:

http://dev.mysql.com/doc/refman/4.1/en/adding-collation.html
http://dev.mysql.com/doc/refman/5.0/en/adding-collation.html
http://dev.mysql.com/doc/refman/5.1/en/adding-collation.html
http://dev.mysql.com/doc/refman/6.0/en/adding-collation.html

This covers simple collations for 8-bit character sets 
an LDML-based collations for Unicode character sets.

The 4.1 manual does not have instructions for adding 
LDML collations because that is not supported in 4.1.