Bug #77907 The null character '\0' is ignored by strcmp when collation is utf8_unicode_ci
Submitted: 2 Aug 2015 5:01 Modified: 27 Oct 2015 11:11
Reporter: Jaime Sicam Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1.76, 5.5.46, 5.6.26, 5.6.27 and 5.7.9 OS:Any
Assigned to: CPU Architecture:Any

[2 Aug 2015 5:01] Jaime Sicam
Description:
It seems like "\0" is ignored by strcmp when collation is utf8_unicode_ci. I also get the same result if collation is utf8_spanish_ci or utf8_roman_ci

How to repeat:
mysql> set @s1 = "ab";
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = "a\0\0\0b";
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> set @s1 = _utf8 'ab' collate utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = _utf8 'a\0\0\0b' collate utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> set @s1 = _utf8 'ab' collate utf8_unicode_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = _utf8 'a\0\0\0b' collate utf8_unicode_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               0 |
+-----------------+
1 row in set (0.00 sec)
[3 Aug 2015 6:42] Umesh Shastry
Hello Jaime,

Thank you for the report.
Verified as described on 5.1.76, 5.5.46, 5.6.26, 5.6.27 and 5.7.9 builds.

Thanks,
Umesh
[5 Aug 2015 19:05] Sveta Smirnova
Further studies showed this is not a bug: according to http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html "utf8_unicode_ci also supports contractions and ignorable characters.
utf8_general_ci is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only one-to-one
comparisons between characters." and \0 is such ignorable character, specified at http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[27 Oct 2015 11:11] Erlend Dahl
Posted by developer:

[25 Oct 2015 23:31] Xing Z Zhang

Agree with Sveta Smirnova's comments. This is "by design". Control characters
like '\0' is ignorable in UCA, which is implemented in utf8_unicode_ci and
other collations.