Bug #68366 regression: >=mysql-5.4 utf8 collations are marked as not ascii compatible
Submitted: 13 Feb 2013 16:03 Modified: 18 Jan 2018 13:19
Reporter: Harm Geerts Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S5 (Performance)
Version: 5.5.29-0ubuntu0.12.10.1-log for debian- OS:Linux (Ubuntu 12.10)
Assigned to: CPU Architecture:Any
Tags: ASCII, performance, regression, utf8

[13 Feb 2013 16:03] Harm Geerts
Description:
Commit http://lists.mysql.com/commits/85677 marks all non primary utf8 collations as not ascii compatible. The original commit was http://bazaar.launchpad.net/~mysql/mysql-server/mysql-next-mr/revision/2884

utf8_general_ci, the primary collation, is not affected by this.

The definition of MY_CS_NONASCII:
#define MY_CS_NONASCII  8192   /* if not ASCII-compatible        */

The function my_charset_is_ascii_compatible() which was added in the same commit says "Check if a 8bit character set is compatible with ascii on the range 0x00..0x7F." which is true for utf8.

This leads to believe marking the utf8 collations as MY_CS_NONASCII is incorrect.

Because of this utf8 collations suffer from performance loss in queries that only use ascii and are optimized to handle the ascii case. For example in http://bugs.mysql.com/bug.php?id=52849

Since utf8_general_ci is not affected and there haven't been any bug reports against it it seems safe to remove MY_CS_NONASCII from the other utf8 collations. However the constant name MY_CS_NONASCII does not describe the meaning very well and may have been used as a literal "collation is not ascii". I'm unable to determine this myself but as I said, I haven't seen any bug reports against utf8_general_ci to indicate this is a problem.

How to repeat:
= utf8_general_ci which is not marked as non ascii =
mysql> show variables like '%coll%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> explain select SQL_NO_CACHE count(*) from test where val >= '2013-01-01 00:00:00';
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|  1 | SIMPLE      | test  | range | val           | val  | 8       | NULL |  230 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

mysql> explain select SQL_NO_CACHE count(*) from test where val >= CAST('2013-01-01 00:00:00' as datetime);
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|  1 | SIMPLE      | test  | range | val           | val  | 8       | NULL |  230 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

= utf8_unicode_ci which is marked as non ascii =
mysql> show variables like '%coll%';
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_unicode_ci |
| collation_database   | utf8_unicode_ci |
| collation_server     | utf8_unicode_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)

mysql> explain select SQL_NO_CACHE count(*) from test where val >= '2013-01-01 00:00:00';
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows  | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
|  1 | SIMPLE      | test  | index | NULL          | val  | 8       | NULL | 50378 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+-------+--------------------------+
1 row in set (0.00 sec)

mysql> explain select SQL_NO_CACHE count(*) from test where val >= CAST('2013-01-01 00:00:00' as datetime);
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|  1 | SIMPLE      | test  | range | val           | val  | 8       | NULL |  230 | Using where; Using index |
+----+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

Suggested fix:
remove the MY_CS_NONASCII state from utf8 collations.
[13 Feb 2013 16:37] Tatjana Nuernberg
prelim notes:

(gdb) print character_set_client->csname
$1 = 0x10774d4 "utf8"
(gdb) print character_set_client->name
$2 = 0x10774d9 "utf8_unicode_ci"
(gdb) print all_charsets[character_set_client->number]->state
$3 = 961

vs

#define MY_CS_NONASCII  8192

(gdb) print my_charset_is_ascii_compatible(character_set_client)
$11 = 1 '\001'

;

(gdb) print character_set_client->name
$6 = 0x3034328 "utf8_general_ci"
(gdb) print all_charsets[character_set_client->number]->state
$7 = 993

(the difference being MY_CS_PRIMARY, 32)

(gdb) print my_charset_is_ascii_compatible(character_set_client)
$8 = 1 '\001'

An issue might of course still exist.
[13 Feb 2013 16:43] Tatjana Nuernberg
(tested on trunk)
[13 Feb 2013 17:36] Walter Doekes
You're right Tatjana, it is fixed in trunk (and 5.6), but not in 5.5.

Compare:
http://bazaar.launchpad.net/~mysql/mysql-server/5.5/view/head:/strings/ctype-uca.c
http://bazaar.launchpad.net/~mysql/mysql-server/5.6/view/head:/strings/ctype-uca.c

$ grep MY_CS_NONASCII ctype-uca-5.5.c -A1 | grep utf8 | wc -l
19
$ grep MY_CS_NONASCII ctype-uca-5.6.c -A1 | grep utf8 | wc -l
0

And:

$ grep '"utf8_unicode_ci"' ctype-uca-5.5.c -B2
    MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE|MY_CS_NONASCII,
    "utf8",		/* cs name    */
    "utf8_unicode_ci",	/* name         */
$ grep '"utf8_unicode_ci"' ctype-uca-5.6.c -B2
    MY_CS_UTF8MB3_UCA_FLAGS,/* flags    */
    "utf8",		/* cs name    */
    "utf8_unicode_ci",	/* name         */

Could that fix be backported to 5.5?

See the
#define MY_CS_UTF8MB3_UCA_FLAGS  (MY_CS_COMPILED|MY_CS_STRNXFRM|MY_CS_UNICODE)
changes in ctype-uca.c in
http://bazaar.launchpad.net/~mysql/mysql-server/5.6/revision/2876.192.1

Kind regards,
Walter Doekes
OSSO B.V.
[9 Sep 2013 10:41] Walter Doekes
Apparently http://bugs.mysql.com/bug.php?id=68795 was deemed important enough to fix.

You could clean up the bug db and close this too.

Note that the following message --

> Noted in 5.5.33, 5.6.13, 5.7.2 changelogs.
> Comparison of a DATETIME value and a string did not work correctly
> for the utf8_unicode_ci collation.

-- would be so much cooler if it contained a link to a diff/patch file we could see.
[18 Jan 2018 13:19] Erlend Dahl
Fixed in 5.6 and upwards.