Instructions for adding a new Unicode collation
By Alexander Barkov and Peter Gulutzan
Copyright (c) 2007 by MySQL AB
All rights reserved

MySQL offers 42 built-in collations (sets of rules
for comparison and sorting) for the Unicode character
sets utf8 and ucs2. But you can add your own.

We will describe here how to add Vietnamese Unicode
collations. This is intended as an example, so
that you can try out all the steps and then modify
based on your own rules for some other language.
We chose Vietnamse because many people asked for it
via our "Feature Request" option in bugs.mysql.com,
http://bugs.mysql.com/bug.php?id=4745.

There are two ways to add a new collation:
* Change the source code. This is the trickier way,
but it's the only way for pre-August-2007 versions.
* Change the LDML "markup" file. This hasn't been
tested much, but it's simpler, it requires no
recompilation. (LDML stands for Locale Data Markup
Language, which provides a formal way to specify
ordering.)

We will describe both ways.

ADDING A NEW COLLATION BY CHANGING THE SOURCE CODE

In this section we will describe how to change
MySQL source code to add Vietnamese Unicode collations.

To understand this, you will have to know how to
download source code and how to build the server.
(The MySQL Reference Manual explains all that).
These instructions should work for versions 4.1,
5.0, and 5.1.

"(Note for MySQL 4.1 users only: the first file to edit will be
configure.in rather than config/ac-macros/character_sets.m4,
and in subsequent files the last defined language will probably
be Persian rather than Hungarian, and when editing ctype-uca.c
you shouldn't add the lines that have 'caseinfo' or
'caseup_multiply' or 'casedn_multiply' or pad char' in them.)"

1. Edit config/ac-macros/character_sets.m4

Look for lines approximately like this:
"
      define(UCSC1, ucs2_general_ci ucs2_bin)
      define(UCSC2, ucs2_czech_ci ucs2_danish_ci)
      define(UCSC3, ucs2_esperanto_ci ucs2_estonian_ci ucs2_icelandic_ci)
      define(UCSC4, ucs2_latvian_ci ucs2_lithuanian_ci)
      define(UCSC5, ucs2_persian_ci ucs2_polish_ci ucs2_romanian_ci)
      define(UCSC6, ucs2_slovak_ci ucs2_slovenian_ci)
      define(UCSC7, ucs2_spanish2_ci ucs2_spanish_ci)
      define(UCSC8, ucs2_swedish_ci ucs2_turkish_ci)
      define(UCSC9, ucs2_unicode_ci)
"
Change the last line to
"
      define(UCSC9, ucs2_unicode_ci ucs2_vietnamese_ci)
"
(Collations are in alphabetic order so "vietnamese" comes after
"unicode".)

Look for lines approximately like this:
"
        define(UTFC1, utf8_general_ci utf8_bin)
        define(UTFC2, utf8_czech_ci utf8_danish_ci)
        define(UTFC3, utf8_esperanto_ci utf8_estonian_ci utf8_icelandic_ci)
        define(UTFC4, utf8_latvian_ci utf8_lithuanian_ci)
        define(UTFC5, utf8_persian_ci utf8_polish_ci utf8_romanian_ci)
        define(UTFC6, utf8_slovak_ci utf8_slovenian_ci)
        define(UTFC7, utf8_spanish2_ci utf8_spanish_ci)
        define(UTFC8, utf8_swedish_ci utf8_turkish_ci)
        define(UTFC9, utf8_unicode_ci)
"
Change the last line to
"
        define(UTFC9, utf8_unicode_ci utf8_vietnamese_ci)
"

2. Edit mysys/charset-def.c

Look for lines approximately like this:
"
#ifdef HAVE_CHARSET_ucs2
extern CHARSET_INFO my_charset_ucs2_general_uca;
extern CHARSET_INFO my_charset_ucs2_icelandic_uca_ci;
extern CHARSET_INFO my_charset_ucs2_latvian_uca_ci;
... (etc.)
extern CHARSET_INFO my_charset_ucs2_hungarian_uca_ci;
#endif
"
Add this line before the #endif:
"
extern CHARSET_INFO my_charset_ucs2_vietnamese_uca_ci;
"

Look for lines approximately like this:
"
#ifdef HAVE_CHARSET_utf8
extern CHARSET_INFO my_charset_utf8_general_uca_ci;
extern CHARSET_INFO my_charset_utf8_icelandic_uca_ci;
extern CHARSET_INFO my_charset_utf8_latvian_uca_ci;
... (etc.)
extern CHARSET_INFO my_charset_utf8_hungarian_uca_ci;
#endif
"
Add this line before the #endif:
"
extern CHARSET_INFO my_charset_utf8_vietnamese_uca_ci;
"

Look for lines approximately like this:
"
#ifdef HAVE_UCA_COLLATIONS
  add_compiled_collation(&my_charset_ucs2_general_uca);
  add_compiled_collation(&my_charset_ucs2_icelandic_uca_ci);
  add_compiled_collation(&my_charset_ucs2_latvian_uca_ci);
... (etc.)
  add_compiled_collation(&my_charset_ucs2_hungarian_uca_ci);
#endif
"
Add this line before the #endif:
"
  add_compiled_collation(&my_charset_ucs2_vietnamese_uca_ci);
"

Look for lines approximately like this:
"
#ifdef HAVE_UCA_COLLATIONS
  add_compiled_collation(&my_charset_utf8_general_uca_ci);
  add_compiled_collation(&my_charset_utf8_icelandic_uca_ci);
  add_compiled_collation(&my_charset_utf8_latvian_uca_ci);
... (etc.)
  add_compiled_collation(&my_charset_utf8_hungarian_uca_ci);
#endif
"
Add this line before the #endif:
"
  add_compiled_collation(&my_charset_utf8_vietnamese_uca_ci);
"

3. Edit strings/ctype-uca.c

This is where you have to use your knowledge of Vietnamese.
We know nothing. We just believe what Vietnamese people said
in http://vietunicode.sourceforge.net/charset/vietalphabet.html.

Look for lines approximately like this:
"
/*
  A simplified version of Hungarian, without consonant contractions.
*/
static const char hungarian[]=
    "&O < \\u00F6 <<< \\u00D6 << \\u0151 <<< \\u0150"
    "&U < \\u00FC <<< \\u00DC << \\u0171 <<< \\u0170";
"
This is an example of "tailoring" following a subset of
"ICU Collation Customization (tailorings)". It has rules
for expressing the difference from the usual UCA collation.
Add these lines:
(NB: each quoted string is on a single line, turn word wrapping off.)
"
/*
  Vietnamese experiment. Not for use in production systems.
*/
static const char vietnamese[]=
    "& A < \\u0103 <<< \\u0102 << \\u1EB1 <<< \\u1EB0 << \\u1EB3 <<< \\u1EB2 << \\u1EB5 <<< \\u1EB4 << \\u1EAF <<< \\u1EAE << \\u1EB7 <<< \\u1EB6"
       " < \\u00E2 <<< \\u00C2 << \\u1EA7 <<< \\u1EA6 << \\u1EA9 <<< \\u1EA8 << \\u1EAB <<< \\u1EAA << \\u1EA5 <<< \\u1EA4 << \\u1EAD <<< \\u1EAC"
    "& D < \\u0111 <<< \\u0110"
    "& E < \\u00EA <<< \\u00CA << \\u1EC1 <<< \\u1EC0 << \\u1EC3 <<< \\u1EC2 << \\u1EC5 <<< \\u1EC4 << \\u1EBF <<< \\u1EBE << \\u1EC7 <<< \\u1EC6"
    "& O < \\u00F4 <<< \\u00D4 << \\u1ED3 <<< \\u1ED2 << \\u1ED5 <<< \\u1ED4 << \\u1ED7 <<< \\u1ED6 << \\u1ED1 <<< \\u1ED0 << \\u1ED9 <<< \\u1ED8"
       " < \\u01A1 <<< \\u01A0 << \\u1EDD <<< \\u1EDC << \\u1EDF <<< \\u1EDE << \\u1EE1 <<< \\u1EE0 << \\u1EDB <<< \\u1EDA << \\u1EE3 <<< \\u1EE2"
    "& U < \\u01B0 <<< \\u01AF << \\u1EEB <<< \\u1EEA << \\u1EED <<< \\u1EEC << \\u1EEF <<< \\u1EEE << \\u1EE9 <<< \\u1EE8 << \\u1EF1 <<< \\u1EF0";
"

Look for lines approximately like this:
"
CHARSET_INFO my_charset_ucs2_hungarian_uca_ci=
{
    146,0,0,            /* number       */
    MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE,
    "ucs2",             /* cs name    */
    "ucs2_hungarian_ci",/* name         */
    "",                 /* comment      */
    hungarian,          /* tailoring    */
    NULL,               /* ctype        */
    NULL,               /* to_lower     */
    NULL,               /* to_upper     */
    NULL,               /* sort_order   */
    NULL,               /* contractions */
    NULL,               /* sort_order_big*/
    NULL,               /* tab_to_uni   */
    NULL,               /* tab_from_uni */
    my_unicase_default, /* caseinfo     */
    NULL,               /* state_map    */
    NULL,               /* ident_map    */
    8,                  /* strxfrm_multiply */
    1,                  /* caseup_multiply  */
    1,                  /* casedn_multiply  */
    2,                  /* mbminlen     */
    2,                  /* mbmaxlen     */
    9,                  /* min_sort_char */
    0xFFFF,             /* max_sort_char */
    ' ',                /* pad char      */
    0,                  /* escape_with_backslash_is_dangerous */
    &my_charset_ucs2_handler,
    &my_collation_ucs2_uca_handler
};
"
Add these lines:
"
CHARSET_INFO my_charset_ucs2_vietnamese_uca_ci=
{
    147,0,0,            /* number       */
    MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE,
    "ucs2",             /* cs name    */
    "ucs2_vietnamese_ci",/* name         */
    "",                 /* comment      */
    vietnamese,         /* tailoring    */
    NULL,               /* ctype        */
    NULL,               /* to_lower     */
    NULL,               /* to_upper     */
    NULL,               /* sort_order   */
    NULL,               /* contractions */
    NULL,               /* sort_order_big*/
    NULL,               /* tab_to_uni   */
    NULL,               /* tab_from_uni */
    my_unicase_default, /* caseinfo     */
    NULL,               /* state_map    */
    NULL,               /* ident_map    */
    8,                  /* strxfrm_multiply */
    1,                  /* caseup_multiply  */
    1,                  /* casedn_multiply  */
    2,                  /* mbminlen     */
    2,                  /* mbmaxlen     */
    9,                  /* min_sort_char */
    0xFFFF,             /* max_sort_char */
    ' ',                /* pad char      */
    0,                  /* escape_with_backslash_is_dangerous */
    &my_charset_ucs2_handler,
    &my_collation_ucs2_uca_handler
};
"
The "number" must be greater than the last
"number" for ucs2. At the time of writing, the
last ucs2 collation is "Hungarian" and its number is
146 so we're adding 147 here. But this is transient!
You'll see this number with SHOW COLLATION, illustrated later.

Look for lines approximately like this:
"
CHARSET_INFO my_charset_utf8_hungarian_uca_ci=
{
    210,0,0,            /* number       */
    MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE,
    "utf8",             /* cs name    */
    "utf8_hungarian_ci",/* name         */
    "",                 /* comment      */
    hungarian,          /* tailoring    */
    ctype_utf8,         /* ctype        */
    NULL,               /* to_lower     */
    NULL,               /* to_upper     */
    NULL,               /* sort_order   */
    NULL,               /* contractions */
    NULL,               /* sort_order_big*/
    NULL,               /* tab_to_uni   */
    NULL,               /* tab_from_uni */
    my_unicase_default, /* caseinfo     */
    NULL,               /* state_map    */
    NULL,               /* ident_map    */
    8,                  /* strxfrm_multiply */
    1,                  /* caseup_multiply  */
    1,                  /* casedn_multiply  */
    1,                  /* mbminlen     */
    3,                  /* mbmaxlen     */
    9,                  /* min_sort_char */
    0xFFFF,             /* max_sort_char */
    ' ',                /* pad char      */
    0,                  /* escape_with_backslash_is_dangerous */
    &my_charset_utf8_handler,
    &my_collation_any_uca_handler
};
"
Add these lines:
"
CHARSET_INFO my_charset_utf8_vietnamese_uca_ci=
{
    211,0,0,            /* number       */
    MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE,
    "utf8",             /* cs name    */
    "utf8_vietnamese_ci",/* name         */
    "",                 /* comment      */
    vietnamese,         /* tailoring    */
    ctype_utf8,         /* ctype        */
    NULL,               /* to_lower     */
    NULL,               /* to_upper     */
    NULL,               /* sort_order   */
    NULL,               /* contractions */
    NULL,               /* sort_order_big*/
    NULL,               /* tab_to_uni   */
    NULL,               /* tab_from_uni */
    my_unicase_default, /* caseinfo     */
    NULL,               /* state_map    */
    NULL,               /* ident_map    */
    8,                  /* strxfrm_multiply */
    1,                  /* caseup_multiply  */
    1,                  /* casedn_multiply  */
    1,                  /* mbminlen     */
    3,                  /* mbmaxlen     */
    9,                  /* min_sort_char */
    0xFFFF,             /* max_sort_char */
    ' ',                /* pad char      */
    0,                  /* escape_with_backslash_is_dangerous */
    &my_charset_utf8_handler,
    &my_collation_any_uca_handler
};
"
The "number" must be greater than the last
"number" for utf8. At the time of writing, the
last utf8 collation is "Hungarian" and its number is
210 so we're adding 211 here. But this is transient!
You'll see this number with SHOW COLLATION, illustrated later.

4. Build!

If all went well, after building you'll have two new
collations. They will be in information_schema.collations.

If all went well, after building you'll have two new 
collations. They will be in the metadata. For example:

mysql> show collation like '%vietnamese%';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| utf8_vietnamese_ci | utf8    | 209 |         | Yes      |       8 |
| ucs2_vietnamese_ci | ucs2    | 145 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
2 rows in set (0.00 sec)

CHARACTER SET utf8 COLLATE utf8_vietnamese_ci will work.
CHARACTER SET ucs2 COLLATEE ucs2_vietnamese_ci will work.

ADDING A NEW COLLATION BY CHANGING THE MARKUP FILE

In this section we will describe how to change
an LDML file to add Vietnamese Unicode collations.

To understand this, you will have to know how what
an LDML file is. (The Unicode Consortium explains
that in their Locale Data Markup Language page,
http://unicode.org/reports/tr35/.)

These instructions apply to MySQL versions 5.0.44
or higher and 5.1.20 or higher. They might work
with earlier versions, but we don't recommend that
you try. They should work for most operating
systems including Windows.


1. Open MySQL character set definition file
(usually /usr/share/mysql/charsets/Index.xml,
but its location may vary depending on build flags).


2. Find this code:

<charset name="ucs2">
   <family>Unicode</family>
   <description>UCS-2 Unicode</description>
   <collation name="ucs2_general_ci" id="35">
     <flag>primary</flag>
     <flag>compiled</flag>
   </collation>
   <collation name="ucs2_bin" id="90">
     <flag>binary</flag>
     <flag>compiled</flag>
   </collation>
</charset>


3. Change it to:

<charset name="ucs2">
   <family>Unicode</family>
   <description>UCS-2 Unicode</description>
   <collation name="ucs2_general_ci" id="35">
     <flag>primary</flag>
     <flag>compiled</flag>
   </collation>
   <collation name="ucs2_bin" id="90">
     <flag>binary</flag>
     <flag>compiled</flag>
   </collation>
   <collation name="ucs2_vietnamese_ci" id="147">
     <!-- Vietnamese experimental collation -->
     <rules>
       <reset>A</reset>
       <p>\u0103</p><t>\u0102</t>
       <s>\u1EB1</s><t>\u1EB0</t>
       <s>\u1EB3</s><t>\u1EB2</t>
       <s>\u1EB5</s><t>\u1EB4</t>
       <s>\u1EAF</s><t>\u1EAE</t>
       <s>\u1EB7</s><t>\u1EB6</t>
       <p>\u00E2</p><t>\u00C2</t>
       <s>\u1EA7</s><t>\u1EA6</t>
       <s>\u1EA9</s><t>\u1EA8</t>
       <s>\u1EAB</s><t>\u1EAA</t>
       <s>\u1EA5</s><t>\u1EA4</t>
       <s>\u1EAD</s><t>\u1EAC</t>
       <reset>D</reset>
       <p>\u0111</p><t>\u0110</t>
       <reset>E</reset>
       <p>\u00EA</p><t>\u00CA</t>
       <s>\u1EC1</s><t>\u1EC0</t>
       <s>\u1EC3</s><t>\u1EC2</t>
       <s>\u1EC5</s><t>\u1EC4</t>
       <s>\u1EBF</s><t>\u1EBE</t>
       <s>\u1EC7</s><t>\u1EC6</t>
       <reset>O</reset>
       <p>\u00F4</p><t>\u00D4</t>
       <s>\u1ED3</s><t>\u1ED2</t>
       <s>\u1ED5</s><t>\u1ED4</t>
       <s>\u1ED7</s><t>\u1ED6</t>
       <s>\u1ED1</s><t>\u1ED0</t>
       <s>\u1ED9</s><t>\u1ED8</t>
       <p>\u01A1</p><t>\u01A0</t>
       <s>\u1EDD</s><t>\u1EDC</t>
       <s>\u1EDF</s><t>\u1EDE</t>
       <s>\u1EE1</s><t>\u1EE0</t>
       <s>\u1EDB</s><t>\u1EDA</t>
       <s>\u1EE3</s><t>\u1EE2</t>
       <reset>U</reset>
       <p>\u01B0</p><t>\u01AF</t>
       <s>\u1EEB</s><t>\u1EEA</t>
       <s>\u1EED</s><t>\u1EEC</t>
       <s>\u1EEF</s><t>\u1EEE</t>
       <s>\u1EE9</s><t>\u1EE8</t>
       <s>\u1EF1</s><t>\u1EF0</t>
     </rules>
   </collation>
</charset>


The "id" must be greater than the last ID for ucs2.
At the time of writing, the last ucs2 collation is "Hungarian"
and its number is 146:

mysql> select collation_name, id from information_schema.collations
      > where character_set_name='ucs2' order by id;
+--------------------+-----+
| collation_name     | id  |
+--------------------+-----+
| ucs2_general_ci    |  35 |
  .....
| ucs2_hungarian_ci  | 146 |
+--------------------+-----+
21 rows in set (0.00 sec)

So we're adding 147 here.

4. Find this code:

<charset name="utf8">
   <family>Unicode</family>
   <description>UTF-8 Unicode</description>
   <alias>utf-8</alias>
   <collation name="utf8_general_ci"     id="33">
    <flag>primary</flag>
    <flag>compiled</flag>
   </collation>
   <collation name="utf8_bin"            id="83">
     <flag>binary</flag>
     <flag>compiled</flag>
   </collation>
</charset>


5. Change it to:

<charset name="utf8">
   <family>Unicode</family>
   <description>UTF-8 Unicode</description>
   <alias>utf-8</alias>
   <collation name="utf8_general_ci"     id="33">
    <flag>primary</flag>
    <flag>compiled</flag>
   </collation>
   <collation name="utf8_bin"            id="83">
     <flag>binary</flag>
     <flag>compiled</flag>
   </collation>
   <collation name="utf8_vietnamese_ci" id="211">
     <!-- Vietnamese experimental collation -->
     <rules>
       <reset>A</reset>
       <p>\u0103</p><t>\u0102</t>
       <s>\u1EB1</s><t>\u1EB0</t>
       <s>\u1EB3</s><t>\u1EB2</t>
       <s>\u1EB5</s><t>\u1EB4</t>
       <s>\u1EAF</s><t>\u1EAE</t>
       <s>\u1EB7</s><t>\u1EB6</t>
       <p>\u00E2</p><t>\u00C2</t>
       <s>\u1EA7</s><t>\u1EA6</t>
       <s>\u1EA9</s><t>\u1EA8</t>
       <s>\u1EAB</s><t>\u1EAA</t>
       <s>\u1EA5</s><t>\u1EA4</t>
       <s>\u1EAD</s><t>\u1EAC</t>
       <reset>D</reset>
       <p>\u0111</p><t>\u0110</t>
       <reset>E</reset>
       <p>\u00EA</p><t>\u00CA</t>
       <s>\u1EC1</s><t>\u1EC0</t>
       <s>\u1EC3</s><t>\u1EC2</t>
       <s>\u1EC5</s><t>\u1EC4</t>
       <s>\u1EBF</s><t>\u1EBE</t>
       <s>\u1EC7</s><t>\u1EC6</t>
       <reset>O</reset>
       <p>\u00F4</p><t>\u00D4</t>
       <s>\u1ED3</s><t>\u1ED2</t>
       <s>\u1ED5</s><t>\u1ED4</t>
       <s>\u1ED7</s><t>\u1ED6</t>
       <s>\u1ED1</s><t>\u1ED0</t>
       <s>\u1ED9</s><t>\u1ED8</t>
       <p>\u01A1</p><t>\u01A0</t>
       <s>\u1EDD</s><t>\u1EDC</t>
       <s>\u1EDF</s><t>\u1EDE</t>
       <s>\u1EE1</s><t>\u1EE0</t>
       <s>\u1EDB</s><t>\u1EDA</t>
       <s>\u1EE3</s><t>\u1EE2</t>
       <reset>U</reset>
       <p>\u01B0</p><t>\u01AF</t>
       <s>\u1EEB</s><t>\u1EEA</t>
       <s>\u1EED</s><t>\u1EEC</t>
       <s>\u1EEF</s><t>\u1EEE</t>
       <s>\u1EE9</s><t>\u1EE8</t>
       <s>\u1EF1</s><t>\u1EF0</t>
     </rules>
   </collation>
</charset>


The "id" must be greater than the last ID for ucs2.
At the time of writing, the last utf8 collation is
"Hungarian" and its number is 210:

mysql> select collation_name, id from information_schema.collations
     -> where character_set_name='utf8' order by id;
+--------------------+-----+
| collation_name     | id  |
+--------------------+-----+
| utf8_general_ci    |  33 |
  ...
| utf8_hungarian_ci  | 210 |
+--------------------+-----+
21 rows in set (0.00 sec)

So we're adding 211 here.


4. Restart mysqld

If all went well, after restart you'll have two new
collations. They will be in information_schema.collations.
CHARACTER SET utf8 COLLATE utf8_vietnamese_ci will work.
CHARACTER SET ucs2 COLLATE ucs2_vietnamese_ci will work.


WHAT TO DO IF YOU HAVE A PROBLEM, OR HAVE NO PROBLEM


There is no warranty or liability for MySQL if you
change the product. MySQL will only accept bug reports
for unchanged versions. However, our instructions work
according to our tests at this time. If the instructions
have to change due other changes in the source code in
future MySQL versions, we intend to add notes in the
MySQL Internals Manual,
http://forge.mysql.com/wiki/MySQL_Internals.

Later, someday, after everyone is sure that these
changes are what Vietnamese people really want, somebody
might consider submitting a "patch" to MySQL so
that this might appear in a future version of
MySQL/community server.

ABOUT THE AUTHORS

Alexander Barkov and Peter Gulutzan work for MySQL AB.