Bug #101278 Field_string::cmp suboptimal string comparison
Submitted: 22 Oct 2020 17:51 Modified: 23 Oct 2020 7:17
Reporter: Georgy Kirichenko Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[22 Oct 2020 17:51] Georgy Kirichenko
Description:
Before Field_string::cmp actually compares two strings it decodes both strings length in bytes using my_charpos what could be relatively expensive in case of variable-length encodings like UTF8. However, in case if both string have a difference in their heads, it is useless to decode tails of the strings.

For instance if we have two strings like 'aaaaaaa..a'(100 characters in length) and 'bbbb..b'(100 characters in length) then comparison could stop immediately after comparing the first 'a' and the first 'b' without decoding consequence 99*2 characters.

My proposal is to virtually split string into small 8-character chunks and compare chunk by chunk until first difference found. According to my benchmarking of query like `select count(distinct c) from sbtest1;` using standard sysbench dataset there is up to 4x speedup.

How to repeat:
Initialize MySQL with sysbench standard dataset and then execute queries like 
`select count(distinct c) from sbtest1;` and compare results for patched and unpatched versions.

Suggested fix:
Contribution is attached
[23 Oct 2020 7:17] MySQL Verification Team
Hello Georgy Kirichenko,

Thank you for the report and contribution.

regards,
Umesh