Bug #72558 LOWER() returns a wrong result for gb18030_unicode_520_ci
Submitted: 7 May 2014 10:22 Modified: 13 Jun 2014 11:22
Reporter: Alexander Barkov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.7.4-m14-debug OS:Any
Assigned to: CPU Architecture:Any

[7 May 2014 10:22] Alexander Barkov
Description:
LOWER() does not work well for some characters for the gb18030_unicode_520_ci
collation.

How to repeat:
set names utf8mb4;

select hex(a),hex(convert(a using gb18030)) as hn, a, lower(a) as l_utf32,lower(convert(a using gb18030) collate gb18030_chinese_ci) as l_gb18030_chi, lower(convert(a using gb18030) collate gb18030_unicode_520_ci) as l_gb18030_uni  from (select _utf32 0x216A as a union all select _utf32 0x10300) as list;
Query OK, 0 rows affected (0.00 sec)

+----------+----------+------+---------+---------------+---------------+
| hex(a)   | hn       | a    | l_utf32 | l_gb18030_chi | l_gb18030_uni |
+----------+----------+------+---------+---------------+---------------+
| 0000216A | A2FB     | Ⅺ   | ⅺ       | ⅺ            |               |
| 00010300 | 9030CD38 |
[7 May 2014 11:24] Alexander Barkov
2 rows in set (0.03 sec)

U+216A  ROMAN NUMERAL ELEVEN (GB+A2FB)
U+10300 OLD ITALIC LETTER A  (GB+9030CD38)

l_utf32 is correct
l_gb18030_chi is correct

l_gb19030_uni looks wrong for the both rows.
The expected result should be the same with the other two columns.
[7 May 2014 11:26] Alexander Barkov
The bug system obviously does not support non-BMP characters.
The original report body was cut.
Anyway, the SQL script in the "how to repeat" section
should be enough to reproduce the problem.
[7 May 2014 14:30] MySQL Verification Team
Thank you for the bug report.
[13 Jun 2014 11:22] Erlend Dahl
[11 Jun 2014 9:31] Paul Dubois

Noted in 5.7.5 changelog.

The code for processing the gb18030 character set had a too-strict
assertion for single-character invalid characters.