MySQL Bugs: #74901: UTF8 LIKE query with wildcard and escape character produce incorrect result

Bug #74901	UTF8 LIKE query with wildcard and escape character produce incorrect result
Submitted:	17 Nov 2014 13:10	Modified:	11 Apr 2018 10:44
Reporter:	A R	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.5.40, 5.6.23	OS:	Linux (Ubuntu 14.10)
Assigned to:		CPU Architecture:	Any

Description:
The utf8_unicode_ci collation handles wildcard queries with escape characters incorrectly. Same thing happens with utf8mb4.

How to repeat:
Sample queries and their output from mysql client (all three show incorrect results):

mysql> select _utf8'10% off!' collate utf8_unicode_ci like '%!%' escape '!';
+---------------------------------------------------------------+
| _utf8'10% off!' collate utf8_unicode_ci like '%!%' escape '!' |
+---------------------------------------------------------------+
|                                                             1 |
+---------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select _utf8'10% off!' collate utf8_unicode_ci like '%!%ff%' escape '!';
+------------------------------------------------------------------+
| _utf8'10% off!' collate utf8_unicode_ci like '%!%ff%' escape '!' |
+------------------------------------------------------------------+
|                                                                1 |
+------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select _utf8'10% off!' collate utf8_unicode_ci like '%!!' escape '!';
+---------------------------------------------------------------+
| _utf8'10% off!' collate utf8_unicode_ci like '%!!' escape '!' |
+---------------------------------------------------------------+
|                                                             0 |
+---------------------------------------------------------------+
1 row in set (0.00 sec)

Suggested fix:

Further investigation shows that the fix already exists, but was not applied here: just diff my_wildcmp_uca_impl (in ctype-uca.c) with my_wildcmp_unicode_impl (in ctype-utf8.c), and you'll see the changes that need to be made. The one in ctype-utf8.c is correct as far as wildcard+escape is concerned, however there is another difference between the two in how they treat a single escape character at the end of the search string. One treats it as invalid, and the other as a literal character (I'd say the first is correct and the second is a bug, but I haven't read the SQL spec).

In any case, both of them should be fixed to make both of these issues work properly, and it would probably be a good idea to check the other collations, make sure they're all consistent in behavior, add unit tests for all, etc.

In addition, these functions could use some TLC. I haven't tested it, but at a glance, for example in my_wildcmp_unicode_impl: the comment above the function explaining the return value is either unclear or incorrect, the first while loop actually never loops (should be an if, or better yet, inverse the condition and move the last line of the function up to the top), the first two assignments to the result variable are never used and can be dropped, the second comparison with w_many (after the first nested while loop) is always true and can be dropped, the for loop after it should be a while loop to be consistent with the other loop styles, the mb_wc call and conditional a few lines under this for loop is redundant (already happened in the loop) and can be dropped.

Thank you for the report.

Verified as described. Workaround: use collation utf8_general_ci.

mysql> show variables like '%col%';
+---------------------------+-------------------+
| Variable_name             | Value             |
+---------------------------+-------------------+
| collation_connection      | utf8_unicode_ci   |
| collation_database        | latin1_swedish_ci |
| collation_server          | latin1_swedish_ci |
| protocol_version          | 10                |
| slave_compressed_protocol | OFF               |
+---------------------------+-------------------+
5 rows in set (0.00 sec)

mysql> select _utf8'10% off!' collate utf8_unicode_ci like '%!%ff%' escape '!';
+------------------------------------------------------------------+
| _utf8'10% off!' collate utf8_unicode_ci like '%!%ff%' escape '!' |
+------------------------------------------------------------------+
|                                                                1 |
+------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> select _utf8'10% off!' collate utf8_general_ci like '%!%ff%' escape '!';
+------------------------------------------------------------------+
| _utf8'10% off!' collate utf8_general_ci like '%!%ff%' escape '!' |
+------------------------------------------------------------------+
|                                                                0 |
+------------------------------------------------------------------+
1 row in set (0.00 sec)

This was fixed in 8.0.1.