MySQL Bugs: #77907: The null character '\0' is ignored by strcmp when collation is utf8_unicode

Bug #77907	The null character '\0' is ignored by strcmp when collation is utf8_unicode_ci
Submitted:	2 Aug 2015 5:01	Modified:	27 Oct 2015 11:11
Reporter:	Jaime Sicam	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.1.76, 5.5.46, 5.6.26, 5.6.27 and 5.7.9	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
It seems like "\0" is ignored by strcmp when collation is utf8_unicode_ci. I also get the same result if collation is utf8_spanish_ci or utf8_roman_ci

How to repeat:
mysql> set @s1 = "ab";
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = "a\0\0\0b";
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> set @s1 = _utf8 'ab' collate utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = _utf8 'a\0\0\0b' collate utf8_general_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

mysql> set @s1 = _utf8 'ab' collate utf8_unicode_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> set @s2 = _utf8 'a\0\0\0b' collate utf8_unicode_ci;
Query OK, 0 rows affected (0.00 sec)

mysql> select strcmp(@s1,@s2);
+-----------------+
| strcmp(@s1,@s2) |
+-----------------+
|               0 |
+-----------------+
1 row in set (0.00 sec)

Hello Jaime,

Thank you for the report.
Verified as described on 5.1.76, 5.5.46, 5.6.26, 5.6.27 and 5.7.9 builds.

Thanks,
Umesh

Further studies showed this is not a bug: according to http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-sets.html "utf8_unicode_ci also supports contractions and ignorable characters.
utf8_general_ci is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only one-to-one
comparisons between characters." and \0 is such ignorable character, specified at http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

Posted by developer:

[25 Oct 2015 23:31] Xing Z Zhang

Agree with Sveta Smirnova's comments. This is "by design". Control characters
like '\0' is ignorable in UCA, which is implemented in utf8_unicode_ci and
other collations.