Bug #46504 Not regconizing some multibyte chars with user-defined utf8 collation
Submitted: 31 Jul 2009 22:58 Modified: 18 Jan 2018 13:13
Reporter: Donnie To Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.1.36 OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[31 Jul 2009 22:58] Donnie To
Description:
I followed the instruction here to add a new unicode collation by changing the markup file:

http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/

All multibyte chars work well, except the following chars:

        <reset>U</reset>
        <p>\u00F9</p>
        <t>\u00D9</t>
        <p>\u1EE7</p>
        <t>\u1EE6</t>
        <p>\u0169</p>
        <t>\u0168</t>
        <p>\u00FA</p>
        <t>\u00DA</t>
        <p>\u1EE5</p>
        <t>\u1EE4</t>
        <p>\u01B0</p>
        <t>\u01AF</t>
        <p>\u1EEB</p>
        <t>\u1EEA</t>
        <p>\u1EED</p>
        <t>\u1EEC</t>
        <p>\u1EEF</p>
        <t>\u1EEE</t>
        <p>\u1EE9</p>
        <t>\u1EE8</t>
        <p>\u1EF1</p>
        <t>\u1EF0</t>
        <reset>Y</reset>
        <p>\u1EF3</p>
        <t>\u1EF2</t>
        <p>\u1EF7</p>
        <t>\u1EF6</t>
        <p>\u1EF9</p>
        <t>\u1EF8</t>
        <p>\u00FD</p>
        <t>\u00DD</t>
        <p>\u1EF5</p>
        <t>\u1EF4</t>

Kind regards,
Donnie

How to repeat:
    <collation name="utf8_newcollation" id="211">
      <rules>
        <reset>A</reset>
        <p>\u00E0</p>
        <t>\u00C0</t>
        <p>\u1EA3</p>
        <t>\u1EA2</t>
        <p>\u00E3</p>
        <t>\u00C3</t>
        <p>\u00E1</p>
        <t>\u00C1</t>
        <p>\u1EA1</p>
        <t>\u1EA0</t>
        <p>\u0103</p>
        <t>\u0102</t>
        <p>\u1EB1</p>
        <t>\u1EB0</t>
        <p>\u1EB3</p>
        <t>\u1EB2</t>
        <p>\u1EB5</p>
        <t>\u1EB4</t>
        <p>\u1EAF</p>
        <t>\u1EAE</t>
        <p>\u1EB7</p>
        <t>\u1EB6</t>
        <p>\u00E2</p>
        <t>\u00C2</t>
        <p>\u1EA7</p>
        <t>\u1EA6</t>
        <p>\u1EA9</p>
        <t>\u1EA8</t>
        <p>\u1EAB</p>
        <t>\u1EAA</t>
        <p>\u1EA5</p>
        <t>\u1EA4</t>
        <p>\u1EAD</p>
        <t>\u1EAC</t>
        <reset>D</reset>
        <p>\u0111</p>
        <t>\u0110</t>
        <reset>E</reset>
        <p>\u00E8</p>
        <t>\u00C8</t>
        <p>\u1EBB</p>
        <t>\u1EBA</t>
        <p>\u1EBD</p>
        <t>\u1EBC</t>
        <p>\u00E9</p>
        <t>\u00C9</t>
        <p>\u1EB9</p>
        <t>\u1EB8</t>
        <p>\u00EA</p>
        <t>\u00CA</t>
        <p>\u1EC1</p>
        <t>\u1EC0</t>
        <p>\u1EC3</p>
        <t>\u1EC2</t>
        <p>\u1EC5</p>
        <t>\u1EC4</t>
        <p>\u1EBF</p>
        <t>\u1EBE</t>
        <p>\u1EC7</p>
        <t>\u1EC6</t>
        <reset>I</reset>
        <p>\u00EC</p>
        <t>\u00CC</t>
        <p>\u1EC9</p>
        <t>\u1EC8</t>
        <p>\u0129</p>
        <t>\u0128</t>
        <p>\u00ED</p>
        <t>\u00CD</t>
        <p>\u1ECB</p>
        <t>\u1ECA</t>
        <reset>O</reset>
        <p>\u00F2</p>
        <t>\u00D2</t>
        <p>\u1ECF</p>
        <t>\u1ECE</t>
        <p>\u00F5</p>
        <t>\u00D5</t>
        <p>\u00F3</p>
        <t>\u00D3</t>
        <p>\u1ECD</p>
        <t>\u1ECC</t>
        <p>\u00F4</p>
        <t>\u00D4</t>
        <p>\u1ED3</p>
        <t>\u1ED2</t>
        <p>\u1ED5</p>
        <t>\u1ED4</t>
        <p>\u1ED7</p>
        <t>\u1ED6</t>
        <p>\u1ED1</p>
        <t>\u1ED0</t>
        <p>\u1ED9</p>
        <t>\u1ED8</t>
        <p>\u01A1</p>
        <t>\u01A0</t>
        <p>\u1EDD</p>
        <t>\u1EDC</t>
        <p>\u1EDF</p>
        <t>\u1EDE</t>
        <p>\u1EE1</p>
        <t>\u1EE0</t>
        <p>\u1EDB</p>
        <t>\u1EDA</t>
        <p>\u1EE3</p>
        <t>\u1EE2</t>
        <reset>U</reset>
        <p>\u00F9</p>
        <t>\u00D9</t>
        <p>\u1EE7</p>
        <t>\u1EE6</t>
        <p>\u0169</p>
        <t>\u0168</t>
        <p>\u00FA</p>
        <t>\u00DA</t>
        <p>\u1EE5</p>
        <t>\u1EE4</t>
        <p>\u01B0</p>
        <t>\u01AF</t>
        <p>\u1EEB</p>
        <t>\u1EEA</t>
        <p>\u1EED</p>
        <t>\u1EEC</t>
        <p>\u1EEF</p>
        <t>\u1EEE</t>
        <p>\u1EE9</p>
        <t>\u1EE8</t>
        <p>\u1EF1</p>
        <t>\u1EF0</t>
        <reset>Y</reset>
        <p>\u1EF3</p>
        <t>\u1EF2</t>
        <p>\u1EF7</p>
        <t>\u1EF6</t>
        <p>\u1EF9</p>
        <t>\u1EF8</t>
        <p>\u00FD</p>
        <t>\u00DD</t>
        <p>\u1EF5</p>
        <t>\u1EF4</t>
      </rules>
    </collation>
[1 Aug 2009 6:24] Sveta Smirnova
Thank you for the report.

> All multibyte chars work well, except the following chars:

Could you please provide test case showing how "following chars" work incorrectly?
[1 Aug 2009 22:31] Donnie To
For those multibyte chars, I used mysql fulltext search to search for them, but it can't distinguish one from another.

For example, consider the following 2 pairs:
<p>\u00F9</p>
<t>\u00D9</t>
<p>\u1EE7</p>
<t>\u1EE6</t>

u00F9 is the same as u00D9,
u1EE7 is the same as u1EE6,
but u00F9 or u00D9 is NOT the same as u1EE7, u1EE6. However, mysql fulltext search returns results with all those multibyte chars as the same.

Thx!
[3 Aug 2009 7:30] Susanne Ebrecht
What signs do you expect from u00F9, u00D9, u1EE7, u1EE6?

All for codes not seem to be valid utf8 signs.

Please show us here the printed version of the signs.
[3 Aug 2009 7:38] Donnie To
Hi Susanne,

They are all valid. You may check the signs here: http://www.fileformat.info/
eg. For u00F9, it is here http://www.fileformat.info/info/unicode/char/00f9/index.htm

Thank you!
[3 Aug 2009 8:08] Susanne Ebrecht
Many thanks for your fast feedback.

According to your link: http://www.fileformat.info/info/unicode/char/00f9/index.htm

0x00f9 is not a valid utf8 sign. It is utf16 sign.

The right utf8 hex value for ù is 0xc3b9

Same with your other hex values. They all are for utf16 and not for utf8. MySQL don't support utf16. MySQL only support utf8.
[3 Aug 2009 8:25] Donnie To
Are they should be in "C/C++/Java source code" form like "\u00F9" ?

http://blogs.mysql.com/peterg/2008/05/19/instructions-for-adding-a-new-unicode-collation/

According to the above instruction, they also used the "C/C++/Java source code" format.

<reset>U</reset>
       <p>\u01B0</p><t>\u01AF</t>
       <s>\u1EEB</s><t>\u1EEA</t>
       <s>\u1EED</s><t>\u1EEC</t>
       <s>\u1EEF</s><t>\u1EEE</t>
       <s>\u1EE9</s><t>\u1EE8</t>
       <s>\u1EF1</s><t>\u1EF0</t>
[5 Aug 2009 4:43] Donnie To
Sorry, changed the status back to open.
[5 Aug 2009 6:23] Sveta Smirnova
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

According to http://dev.mysql.com/doc/refman/5.1/en/adding-collation-unicode-uca.html:

----<q>----
# Shift rules define primary, secondary, and tertiary differences of a character from another character. They are specified using <p>, <s>, and <t> elements. Either of the following rules specifies a primary shift rule for the 'G' character:

<p>G</p>

<p>\u0047</p>

    *   Use primary differences to distinguish separate letters.
    *      Use secondary differences to distinguish accent variations.
    *      Use tertiary differences to distinguish lettercase variations.

----</q>----

You define these problem letters as:

<p>\u00E0</p>
<t>\u00C0</t>
<p>\u1EA3</p>
<t>\u1EA2</t>

So you specify them as same letters, but with different lettercase. MySQL fulltext search is case insensitive, this is why you get same results.
[5 Aug 2009 6:49] Donnie To
Thank you for explaination! But something seems not correct!

If I specified them as same letters, but with different lettercase as the following case, I understand that they'd be the same.

<s>\u00E0</s>
<t>\u00C0</t>
<s>\u1EA3</s>
<t>\u1EA2</t>

but I specify them like this:
<p>\u00E0</p>
<t>\u00C0</t>
<p>\u1EA3</p>
<t>\u1EA2</t>

so \u00E0 is supposed to be diffent than \u1EA3 (also that's what I want). Is that correct?

FYI, I reported on the first post that all of other cases are working correctly, except the "u" and "y" cases on the bottom. (is that caused by array size or something like that?)

<collation name="utf8_newcollation" id="211">
      <rules>
        <reset>A</reset>
        <p>\u00E0</p>
        <t>\u00C0</t>
        <p>\u1EA3</p>
        <t>\u1EA2</t>
        <p>\u00E3</p>
        <t>\u00C3</t>
        <p>\u00E1</p>
        <t>\u00C1</t>
        <p>\u1EA1</p>
        <t>\u1EA0</t>
        <p>\u0103</p>
        <t>\u0102</t>
        <p>\u1EB1</p>
        <t>\u1EB0</t>
        <p>\u1EB3</p>
        <t>\u1EB2</t>
        <p>\u1EB5</p>
        <t>\u1EB4</t>
        <p>\u1EAF</p>
        <t>\u1EAE</t>
        <p>\u1EB7</p>
        <t>\u1EB6</t>
        <p>\u00E2</p>
        <t>\u00C2</t>
        <p>\u1EA7</p>
        <t>\u1EA6</t>
        <p>\u1EA9</p>
        <t>\u1EA8</t>
        <p>\u1EAB</p>
        <t>\u1EAA</t>
        <p>\u1EA5</p>
        <t>\u1EA4</t>
        <p>\u1EAD</p>
        <t>\u1EAC</t>
        <reset>D</reset>
        <p>\u0111</p>
        <t>\u0110</t>
        <reset>E</reset>
        <p>\u00E8</p>
        <t>\u00C8</t>
        <p>\u1EBB</p>
        <t>\u1EBA</t>
        <p>\u1EBD</p>
        <t>\u1EBC</t>
        <p>\u00E9</p>
        <t>\u00C9</t>
        <p>\u1EB9</p>
        <t>\u1EB8</t>
        <p>\u00EA</p>
        <t>\u00CA</t>
        <p>\u1EC1</p>
        <t>\u1EC0</t>
        <p>\u1EC3</p>
        <t>\u1EC2</t>
        <p>\u1EC5</p>
        <t>\u1EC4</t>
        <p>\u1EBF</p>
        <t>\u1EBE</t>
        <p>\u1EC7</p>
        <t>\u1EC6</t>
        <reset>I</reset>
        <p>\u00EC</p>
        <t>\u00CC</t>
        <p>\u1EC9</p>
        <t>\u1EC8</t>
        <p>\u0129</p>
        <t>\u0128</t>
        <p>\u00ED</p>
        <t>\u00CD</t>
        <p>\u1ECB</p>
        <t>\u1ECA</t>
        <reset>O</reset>
        <p>\u00F2</p>
        <t>\u00D2</t>
        <p>\u1ECF</p>
        <t>\u1ECE</t>
        <p>\u00F5</p>
        <t>\u00D5</t>
        <p>\u00F3</p>
        <t>\u00D3</t>
        <p>\u1ECD</p>
        <t>\u1ECC</t>
        <p>\u00F4</p>
        <t>\u00D4</t>
        <p>\u1ED3</p>
        <t>\u1ED2</t>
        <p>\u1ED5</p>
        <t>\u1ED4</t>
        <p>\u1ED7</p>
        <t>\u1ED6</t>
        <p>\u1ED1</p>
        <t>\u1ED0</t>
        <p>\u1ED9</p>
        <t>\u1ED8</t>
        <p>\u01A1</p>
        <t>\u01A0</t>
        <p>\u1EDD</p>
        <t>\u1EDC</t>
        <p>\u1EDF</p>
        <t>\u1EDE</t>
        <p>\u1EE1</p>
        <t>\u1EE0</t>
        <p>\u1EDB</p>
        <t>\u1EDA</t>
        <p>\u1EE3</p>
        <t>\u1EE2</t>
/* Beginning here, they're not working correctly */
        <reset>U</reset>
        <p>\u00F9</p>
        <t>\u00D9</t>
        <p>\u1EE7</p>
        <t>\u1EE6</t>
        <p>\u0169</p>
        <t>\u0168</t>
        <p>\u00FA</p>
        <t>\u00DA</t>
        <p>\u1EE5</p>
        <t>\u1EE4</t>
        <p>\u01B0</p>
        <t>\u01AF</t>
        <p>\u1EEB</p>
        <t>\u1EEA</t>
        <p>\u1EED</p>
        <t>\u1EEC</t>
        <p>\u1EEF</p>
        <t>\u1EEE</t>
        <p>\u1EE9</p>
        <t>\u1EE8</t>
        <p>\u1EF1</p>
        <t>\u1EF0</t>
        <reset>Y</reset>
        <p>\u1EF3</p>
        <t>\u1EF2</t>
        <p>\u1EF7</p>
        <t>\u1EF6</t>
        <p>\u1EF9</p>
        <t>\u1EF8</t>
        <p>\u00FD</p>
        <t>\u00DD</t>
        <p>\u1EF5</p>
        <t>\u1EF4</t>
      </rules>
    </collation>
[5 Aug 2009 7:13] Donnie To
I forgot to mention the sources.

http://vietunicode.sourceforge.net/howto/vietcollationmysql.html
http://bugs.mysql.com/bug.php?id=4745
[5 Aug 2009 7:25] Sveta Smirnova
Thank you for the feedback.

> FYI, I reported on the first post that all of other cases are working correctly, except
> the "u" and "y" cases on the bottom. (is that caused by array size or something like
> that?)

Sorry, misread. Verified as described:

mysql> create table t1(f1 varchar(100) character set utf8 collate  utf8_newcollation );
Query OK, 0 rows affected (0.31 sec)

mysql> set names utf8;
Query OK, 0 rows affected (0.08 sec)

mysql> insert into t1 values(0xc3a0);
Query OK, 1 row affected (0.04 sec)

mysql> insert into t1 values(0xe1baa3);
Query OK, 1 row affected (0.00 sec)

mysql> select * from t1;
+------+
| f1   |
+------+
| à   |
| ả  |
+------+
2 rows in set (0.06 sec)

mysql> select distinct f1 from t1;
+------+
| f1   |
+------+
| à   |
| ả  |
+------+
2 rows in set (0.12 sec)

mysql> alter table t1 add fulltext(f1);
Query OK, 2 rows affected (0.45 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> insert into t1 values(0xc3bd);
Query OK, 1 row affected (0.00 sec)

mysql> insert into t1 values(0xc3bd);
Query OK, 1 row affected (0.03 sec)

mysql> insert into t1 values(0xc39d);
Query OK, 1 row affected (0.03 sec)

mysql> insert into t1 values(0xc39d);
Query OK, 1 row affected (0.12 sec)

mysql> insert into t1 values(0xe1bbb5);
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 values(0xe1bbb5);
Query OK, 1 row affected (0.06 sec)

mysql> insert into t1 values(0xe1bbb4);
Query OK, 1 row affected (0.01 sec)

mysql> insert into t1 values(0xe1bbb4);
Query OK, 1 row affected (0.00 sec)

mysql> select distinct f1 from t1;
+------+
| f1   |
+------+
| à   |
| ả  |
| ý   |
+------+
3 rows in set (0.01 sec)

mysql> alter table t1 drop key f1;
Query OK, 12 rows affected (0.31 sec)
Records: 12  Duplicates: 0  Warnings: 0

mysql> select distinct f1 from t1;
+------+
| f1   |
+------+
| à   |
| ả  |
| ý   |
+------+
3 rows in set (0.00 sec)
[5 Aug 2009 7:33] Sveta Smirnova
Seems this really caused "by array size or something like that" as problem is not repeatable with collaltion defined as:

<collation name="utf8_newcollation" id="211">
      <rules>
        <reset>Y</reset>
        <p>\u1EF3</p>
        <t>\u1EF2</t>
        <p>\u1EF7</p>
        <t>\u1EF6</t>
        <p>\u1EF9</p>
        <t>\u1EF8</t>
        <p>\u00FD</p>
        <t>\u00DD</t>
        <p>\u1EF5</p>
        <t>\u1EF4</t>
      </rules>
    </collation>
[9 Aug 2009 21:48] Donnie To
Thank you so much for the bug verification, Sveta Smirnova! Hopefully it'll be fixed in the next release.
[28 Sep 2009 5:31] Donnie To
Any updates, please? 

Is there a way to fix it by myself like fixing the array sizes?
[18 Jan 2018 13:13] Erlend Dahl
Fixed in 5.6.1.