| Bug #85331 | _characterset specifications have lousy performance for some character sets | ||
|---|---|---|---|
| Submitted: | 6 Mar 2017 19:46 | Modified: | 27 Sep 2017 2:00 |
| Reporter: | Domas Mituzas | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
| Version: | 5.6, 5.7, etc, 5.6.35 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[7 Mar 2017 8:38]
MySQL Verification Team
Verified on my non-scientific single-threaded tests on windows 5.7.17. I got: 63K qps for query: select _big5''; 58K qps for query: select _cp932; big5 is at the start of the list and _cp932 near the end of the list. Will let Umesh verify under perf/mysqlslap on linux..
[7 Mar 2017 12:10]
MySQL Verification Team
Hello Domas, Thank you for the report. Thanks, Umesh
[7 Mar 2017 12:12]
MySQL Verification Team
Thank you Shane for quickly verifying and helping me on rest things on this :)
[27 Sep 2017 2:00]
Paul DuBois
Posted by developer: Fixed in 9.0.0. For constructs using _charset character set introducers (for example, _utf8mb4 'abc'), lookup performance was improved for mapping the introducer onto the proper character set.

Description: For queries where literals have character sets supplied, the character set resolution is done by traversing an array until a match is found. So, SELECT _binary'' <--- will walk an array to find 'binary' Unfortunately, on some of our systems these queries happen enough to have non-trivial performance issues at the following stack: dispatch_command mysql_parse MYSQLparse MYSQLlex lex_one_token get_charset_by_csname my_charset_get_by_name my_strcasecmp_8bit If you look at my_charset_get_by_name, you will find calls to get_charset_number_internal, which walks all_charsets: for (cs= all_charsets; cs < all_charsets + array_elements(all_charsets); cs++) { if ( cs[0] && cs[0]->csname && (cs[0]->state & cs_flags) && !my_strcasecmp(&my_charset_latin1, cs[0]->csname, charset_name)) return cs[0]->number; } As all_charsets can grow quite a bit, this gets expensive as you go. Even worse, even if all your literals in the query are using same character set, it is being re-fetched each time!!! How to repeat: Run any query that has lots of _binary or _othercharacterset literals, like: SELECT _binary'', _binary'', ... Suggested fix: Use a hash map to look up character sets passed for query literals. Hash map allows you O(1) lookups, whereas array walk can be O(N), and N keeps growing.