Bug #85331 _characterset specifications have lousy performance for some character sets
Submitted: 6 Mar 2017 19:46 Modified: 27 Sep 2017 2:00
Reporter: Domas Mituzas Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.6, 5.7, etc, 5.6.35 OS:Any
Assigned to: CPU Architecture:Any

[6 Mar 2017 19:46] Domas Mituzas
Description:
For queries where literals have character sets supplied, the character set resolution is done by traversing an array until a match is found.

So, SELECT _binary'' <--- will walk an array to find 'binary'

Unfortunately, on some of our systems these queries happen enough to have non-trivial performance issues at the following stack:

        dispatch_command
        mysql_parse
        MYSQLparse
        MYSQLlex
        lex_one_token
        get_charset_by_csname
        my_charset_get_by_name
        my_strcasecmp_8bit                          

If you look at my_charset_get_by_name, you will find calls to get_charset_number_internal, which walks all_charsets:

  for (cs= all_charsets;
       cs < all_charsets + array_elements(all_charsets);
       cs++)
  {
    if ( cs[0] && cs[0]->csname && (cs[0]->state & cs_flags) &&
         !my_strcasecmp(&my_charset_latin1, cs[0]->csname, charset_name))
      return cs[0]->number;
  }  

As all_charsets can grow quite a bit, this gets expensive as you go. Even worse, even if all your literals in the query are using same character set, it is being re-fetched each time!!!

How to repeat:
Run any query that has lots of _binary or _othercharacterset literals, like:

SELECT _binary'', _binary'', ...

Suggested fix:
Use a hash map to look up character sets passed for query literals.
Hash map allows you O(1) lookups, whereas array walk can be O(N), and N keeps growing.
[7 Mar 2017 8:38] MySQL Verification Team
Verified on my non-scientific single-threaded tests on windows 5.7.17. 
I got:
 63K qps for query:  select _big5'';
 58K qps for query:  select _cp932;

big5 is at the start of the list and _cp932 near the end of the list.
Will let Umesh verify under perf/mysqlslap on linux..
[7 Mar 2017 12:10] MySQL Verification Team
Hello Domas,

Thank you for the report.

Thanks,
Umesh
[7 Mar 2017 12:12] MySQL Verification Team
Thank you Shane for quickly verifying and helping me on rest things on this :)
[27 Sep 2017 2:00] Paul DuBois
Posted by developer:
 
Fixed in 9.0.0.

For constructs using _charset character set introducers (for example,
_utf8mb4 'abc'), lookup performance was improved for mapping the
introducer onto the proper character set.