Bug #55980 Character sets: supplementary character _bin ordering is wrong
Submitted: 13 Aug 2010 23:05 Modified: 26 Nov 2010 19:19
Reporter: Peter Gulutzan Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.5.6-m3, 5.6.0-m4 OS:Linux (SUSE 64-bit)
Assigned to: Alexander Barkov
Triage: Triaged: D2 (Serious)

[13 Aug 2010 23:05] Peter Gulutzan
Description:
From the worklog task specification (WL#1213):

"
Here is a chart showing two rare characters. The first character is
in the range E000-FFFF, so it is greater than a surrogate but
less than a supplementary. The second character is a supplementary.

 Code point  Character                    utf16        utf8mb4
 ----------  ---------                    -----        -------
 0FF9D       HALFWIDTH KATAKANA LETTER N  FF 9D        EF BE 9D
 10384       UGARITIC LETTER DELTA        D8 00 DF 84  F0 90 8E 84

The two characters in the chart are in order by code point value,
and they are in order by utf8mb4 value, but they are not in order
by utf16 value, because 0xff > 0xd8.
"

For a _bin collation, ordering must be byte-by-byte
(although perhaps it will be word-by-word if we implement utf16le).
We considered that utf16_bin could be according to code point,
but the decision was that even for utf16_bin it should be byte-by-byte.

But in mysql-trunk, pulled today, ordering of these characters is
not byte-by-byte for all cases.

How to repeat:
drop table if exists t;
create table t (utf16 char(1) character set utf16 collate utf16_bin,
                utf32 char(1) character set utf32 collate utf32_bin,
                utf8mb4 char(1) character set utf8mb4 collate utf8mb4_bin);
insert into t (utf32) values (0xff9d), (0x10384);
update t set utf16 = utf32, utf8mb4 = utf32;
select hex(utf16),hex(utf32),hex(utf8mb4) from t order by utf16;
select hex(utf16),hex(utf32),hex(utf8mb4) from t order by utf8mb4;
select hex(utf16),hex(utf32),hex(utf8mb4) from t order by utf32;
[15 Aug 2010 11:01] Sveta Smirnova
Thank you for the report.

Verified as described.
[24 Aug 2010 6:34] Alexander Barkov
The same problem is repeatable in 5.5.6-m3

(the original report was about 5.6.0-m4).
[31 Aug 2010 12:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117225

3195 Alexander Barkov	2010-08-31
      Bug#55980 Character sets: supplementary character _bin ordering is wrong
      
      Problem:
      - ORDER BY for utf8mb4_bin, utf16_bin and utf32_bin returned
        results in a wrong order, because old functions
        (supporting only BMP range) were used to handle these collations.
      - Additionally, utf16_bin did not sort supplementary characters
        between U+D700 and U+E000, as WL#1213 specification specified.
      
        mysql-test/include/ctype_filesort2.inc
        Adding a new shared test file
      
        include/m_ctype.h
        Adding prototypes
      
        mysql-test/r/ctype_utf16.result
        mysql-test/r/ctype_utf32.result
        mysql-test/r/ctype_utf8mb4.result
        mysql-test/t/ctype_utf16.test
        mysql-test/t/ctype_utf32.test
        mysql-test/t/ctype_utf8mb4.test
        Adding tests
      
        strings/ctype-ucs2.c
        - Fixing my_strncoll[sp]_utf16_bin to compare
          binary representation instead of code points,
          to make columns with indexes sort correct.
        - Fixing my_collation_handler_utf32_bin and
          my_collation_handler_utf16_bin to use new
          functions
       
        strings/ctype-utf8.c
        - Adding my_strnxfrm[len]_unicode_fill_bin()
          to handle utf8mb4_bin, utf16_bin and utf32_bin,
          using 3 bytes per weight.
          This function also performs special reordering in case of utf16_bin.
        - Fixing my_collation_utf8mb4_bin handler to use the
          new function.
[31 Aug 2010 13:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117256

3197 Alexander Nozdrin	2010-08-31
      Bug#55980 Character sets: supplementary character _bin ordering is wrong
      
      Problem:
      - ORDER BY for utf8mb4_bin, utf16_bin and utf32_bin returned
        results in a wrong order, because old functions
        (supporting only BMP range) were used to handle these collations.
      - Additionally, utf16_bin did not sort supplementary characters
        between U+D700 and U+E000, as WL#1213 specification specified.
     @ include/m_ctype.h
        Adding prototypes.
     @ mysql-test/include/ctype_filesort2.inc
        Adding a new shared test file.
     @ mysql-test/t/ctype_utf8mb4.test
        Adding tests.
     @ strings/ctype-ucs2.c
        - Fixing my_strncoll[sp]_utf16_bin to compare
          binary representation instead of code points,
          to make columns with indexes sort correct.
        - Fixing my_collation_handler_utf32_bin and
          my_collation_handler_utf16_bin to use new
          functions.
     @ strings/ctype-utf8.c
        - Adding my_strnxfrm[len]_unicode_fill_bin()
          to handle utf8mb4_bin, utf16_bin and utf32_bin,
          using 3 bytes per weight.
          This function also performs special reordering in case of utf16_bin.
        - Fixing my_collation_utf8mb4_bin handler to use the
          new function.
[31 Aug 2010 14:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117262

3086 Alexander Nozdrin	2010-08-31
      Cherry-picking patch for Bug#55980.
      Original changeset:
      ------------------------------------------------------------
      revno: 3197
      revision-id: alik@sun.com-20100831135426-h5a4s2w6ih1d8q2x
      parent: magnus.blaudd@sun.com-20100830120632-u3xzy002mdwueli8
      committer: Alexander Nozdrin <alik@sun.com>
      branch nick: mysql-5.5-bugfixing
      timestamp: Tue 2010-08-31 17:54:26 +0400
      message:
        Bug#55980 Character sets: supplementary character _bin ordering is wrong
        
        Problem:
        - ORDER BY for utf8mb4_bin, utf16_bin and utf32_bin returned
          results in a wrong order, because old functions
          (supporting only BMP range) were used to handle these collations.
        - Additionally, utf16_bin did not sort supplementary characters
          between U+D700 and U+E000, as WL#1213 specification specified.
      ------------------------------------------------------------
[1 Sep 2010 6:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/117288

3250 Alexander Barkov	2010-09-01 [merge]
      Merging Bug#55980 from mysql-5.5-bugfixing and
      applying "WL#3664 strnxfrm() changes for prefix keys and NOPAD"
      related changes.
[1 Sep 2010 7:07] Alexander Barkov
See also:

Bug#37244 Character sets: short utf8_bin weight_string value
[1 Sep 2010 7:50] Alexander Barkov
Pushed into mysql-5.5-bugfixing     [5.5.6-m3]
Pushed into mysql-trunk-bugfixing   [5.6.1-m4]
Pushed into mysql-next-mr-bugfixing [5.6.99-m5]
[10 Sep 2010 18:52] Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:joerg@mysql.com-20100910184813-csdto6tk4nlogrsq) (version source revid:davi.arnaut@oracle.com-20100831142822-2qhufn3hho4xqr4p) (merge vers: 5.5.7-m3) (pib:21)
[13 Sep 2010 10:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118059

3194 Dmitry Lenev	2010-09-13 [merge]
      Null-merge fix for bug#55980 from mysql-5.5.6-m3-release
      tree into mysql-trunk tree. Proper version of the fix
      for this tree will come from mysql-trunk-bugfixing.
[13 Sep 2010 13:50] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:dlenev@mysql.com-20100913103627-p2oqplu42x1gv2bd) (version source revid:dlenev@mysql.com-20100913100411-2qdg15bp0qu98ce5) (merge vers: 5.6.1-m4) (pib:21)
[13 Sep 2010 13:52] Bugs System
Pushed into mysql-next-mr (revid:dlenev@mysql.com-20100913121556-sfxqlpj9kbc28kaf) (version source revid:davi.arnaut@oracle.com-20100831142822-2qhufn3hho4xqr4p) (pib:21)
[24 Sep 2010 19:35] Paul Dubois
Noted in 5.5.7, 5.6.1 changelogs.

The ordering for supplementary characters with the utf8mb4_bin,
utf16_bin, and utf32_bin collations was incorrect.
[2 Oct 2010 18:13] Bugs System
Pushed into mysql-trunk 5.6.1-m4 (revid:alexander.nozdrin@oracle.com-20101002180948-852x1cuv7c6i85ea) (version source revid:alexander.nozdrin@oracle.com-20101002180857-an32jpuwzemsp4f2) (merge vers: 5.6.1-m4) (pib:21)
[2 Oct 2010 18:14] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[2 Oct 2010 18:16] Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (version source revid:alexander.nozdrin@oracle.com-20101002180831-590ka2tuit9qoxbb) (merge vers: 5.5.7-rc) (pib:21)
[24 Nov 2010 15:45] Alexander Barkov
The patch reverting  the change about supplementary characters in utf16.

Attachment: b55980-revert.diff (text/x-patch), 2.03 KiB.

[24 Nov 2010 15:48] Alexander Barkov
A patch reverting the change about the order of supplementary characters
has been applied. See the patch in the "files" section.
Let utf16_bin be "code point" order, according to this manual section:
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html
[26 Nov 2010 19:19] Peter Gulutzan
It has been decided that the described
behaviour is correct -- with utf16_bin
ordering should be be code point, not
byte by byte. So this is not a bug.
[3 Dec 2010 9:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/125906

3408 Alexander Barkov	2010-12-03
      Bug#55980 
      
      Reverting the "utf16_bin is byte-by-byte" patch.
      
      This reverting patch is actually already in mysql-5.5-security,
      which will be merged to mysql-trunk-* later this month.
      But I need this reverting patch now, as pre-requisite for WL#4616.
      
      Applying it to mysql-trunk-bugfixing manually, not to wait for merge.
[5 Dec 2010 12:39] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (version source revid:alexander.nozdrin@oracle.com-20101205122447-6x94l4fmslpbttxj) (merge vers: 5.6.1) (pib:23)
[16 Dec 2010 21:47] Bugs System
Pushed into mysql-trunk 5.6.1 (revid:alexander.nozdrin@oracle.com-20101216181820-7afubgk2fmuv9qsb) (version source revid:alexander.nozdrin@oracle.com-20101216173826-ze3y5h450sksotrh) (merge vers: 5.6.1) (pib:23)
[16 Dec 2010 22:28] Bugs System
Pushed into mysql-5.5 5.5.9 (revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (version source revid:jonathan.perkin@oracle.com-20101216101358-fyzr1epq95a3yett) (merge vers: 5.5.9) (pib:24)