MySQL Bugs: #99694: Improving performance of my_hash_sort_utf8(mb4)/my

Bug #99694	Improving performance of my_hash_sort_utf8(mb4)/my_strnncollsp*utf8(mb4)
Submitted:	26 May 2020 13:06	Modified:	1 Jun 2020 14:48
Reporter:	Dmitriy Philimonov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	strings, utf8

Description:
According to our performance investigations, these functions are hot when we
use utf8 collation:
  1. my_hash_sort_utf8
  2. my_hash_sort_utf8mb4
  3. my_strnncollsp_utf8
  4. my_strnncollsp_utf8mb4

Since the commit https://github.com/mysql/mysql-server/commit/67ce24796584
from 2004 skipping trailing spaces in my_hash_sort_utf8 has been implemented
via naive chopping spaces one after another from the end of the string:

while (e > s && e[-1] == ' ')
    e--;

We know that for CHAR(XXX) utf8 fields all unused space is filled with spaces.
So, using brand new skip_trailing_space for my_hash_sort_utf8 function
significantly increase performance here. See for a reference:
https://github.com/mysql/mysql-server/commit/4fd18025f46e

The performance increase achives by skipping not 1 but 8 spaces for every
CPU cycle.

For my_strnncollsp_utf8 function these approach can't be applied as is.
However, we proved, than aggressive forward space skipping still improves
performance with no degradation in all other cases. Following this idea,
we introduce additional function for forward space skipping:

/**
  Simultaneously skip space for two strings (ASCII spaces only).
  Small special routine function for my_strnncollsp_utf8(mb4) functions
*/
static inline void skip_space(const uchar **sp, const uchar **tp,
                              const uchar *const se, const uchar *const te) {
  while (*sp + 8 < se && *tp + 8 < te) {
    uint64_t s, t;
    memcpy(&s, *sp, 8);
    memcpy(&t, *tp, 8);
    if (s != 0x2020202020202020ULL || t != 0x2020202020202020ULL) break;

    *sp += 8;
    *tp += 8;
  }
  while (*sp < se && *tp < te && **sp == 0x20 && **tp == 0x20) {
    ++*sp;
    ++*tp;
  }
}

After that every time when we run into two simultaneous spaces while comparing
two strings in my_strnncollsp_utf8 function, we invoke skip_space instead of
slow utf8 processing code:

while (s < se && t < te) {
  /* aggressive space skipping improves performance */
  if (*s == ' ' && *t == ' ') {
    skip_space(&s, &t, se, te);
    continue;
  }
  ... // utf8 processing
}

Using this simple approach we achived significant performance improvement in
the case when two strings are equal (more than 7 times faster for CHAR(120)).

If we speak about general database performance, we improve single-threaded
sysbench OLTP RO by up to 2.5%.

How to repeat:
Code analysis, mini-benchmarks for English/Russian/Chinese language (CHAR(120))

Suggested fix:
1. For my_hash_sort_utf8(mb4) - use skip_trailing_space as is.
2. For my_strnncollsp_utf8(mb4) - use aggressive forward space skipping right
   in the main loop which processes utf8 characters one by one.

Suggested fix

Attachment: skip-space.git.patch (application/octet-stream, text), 3.30 KiB.

Hi Mr. PHilimonov,

Thank you for your performance improvement report. Thank you, also, for sharing your code with us.

I do have several questions:

* You state that backward space chopping is far slower then front skipping of the spaces. Why is that and can you prove it ????

* Why do you think that skipping forward is faster then skipping backward ???

* My limited experience, during 10 years of MySQL development, is that sorting with shorter strings is always faster then longer ones, provided length difference is larger then a width of one word

* I hope that you have a knowledge that not all space in Unicode is 0x20 ...

* In your skip_space() function, you first do memcpy() and then compare the strings for eight ASCII spaces. Why don't you just skip it before memcpy's ??? Also, in the same function, if a condition is not met, you again go for naive space chopping. This brings us back to your statement that forward chopping is much faster then backward one.

* When you call your function you have an IF clause, which seems unnecessary to me

* You claim to have seen significant improvements with your code. Can you share your benchmarks with us ??? We would like to repeat them, for which we need the entire patch and all the tests.

Thank you very much, in advance.

I wonder if one could optimize this further, by working on local copies of the pointers, as opposed to modifying the pointers in each loop, and storing the result just once at the end:
https://godbolt.org/z/bV4iP4
(I hope I didn't messed up anything)

The difference I see in opcodes is that the original version tries hard to write back the `*tp += 8`:
.LBB1_2:                                # =>This Inner Loop Header: Depth=1
        lea     rax, [r11 + 8]
        cmp     rax, rcx
        jae     .LBB1_6
        cmp     qword ptr [r10], r8
        jne     .LBB1_6
        cmp     qword ptr [r11], r8
        jne     .LBB1_6
        mov     qword ptr [rdi], r9
        mov     r11, qword ptr [rsi]
        add     r11, 8
        mov     qword ptr [rsi], r11
        mov     r10, qword ptr [rdi]
        lea     r9, [r10 + 8]
        cmp     r9, rdx
        jb      .LBB1_2
and the new version does not:
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     r8, rax
        mov     r10, r11
        add     r11, 8
        cmp     r11, rdx
        jae     .LBB0_5
        lea     rax, [r8 + 8]
        cmp     rax, rcx
        jae     .LBB0_5
        cmp     qword ptr [r10], r9
        jne     .LBB0_5
        cmp     qword ptr [r8], r9
        je      .LBB0_1

Could you please check the impact of that, too?

I fully agree with my colleague, Jakub Lopuszanski.

> * You state that backward space chopping is far slower then front skipping of the spaces. Why is that and can you prove it ????
Actually, it doesn't matter what we choose: backward of forward space chopping, it matters **where** we choose the approach.
As I stated in the bugreport description, backward chopping is good for the my_hash* functions, because we just replace naive chopping with 8byte-a-time chopping.
However, for my_strnncollsp* functions it's a bad idea, because chopping a huge amount of space takes time, so comparing two typical CHAR(120) fields:
'abc   <117 spaces>'
'bca   <117 spaces>'
We will spend some constant time chopping 117 spaces, however original version will get the result much faster after analyzing the first chararter of each string.
That's why **forward** space chopping is proposed here. It has almost the same speed as original one with different strings, and much faster with equal ones.

> * Why do you think that skipping forward is faster then skipping backward ???
The speed is equal, believe me. Look at the previous question for details.

> * My limited experience, during 10 years of MySQL development, is that sorting with shorter strings is always faster then longer ones, provided length difference is larger then a width of one word
Optimization is focused on fields with fixed size. Meanwhile, nothing is changed in other cases. 

> * I hope that you have a knowledge that not all space in Unicode is 0x20 ...
According to mysql server specification, fixed length fields is filled with 0x20 (space) to their full length by current realization. That's why this patch is proposed in the first place.

>* In your skip_space() function, you first do memcpy() and then compare the strings for eight ASCII spaces. Why don't you just skip it before memcpy's ??? Also, in the same function, if a condition is not met, you again go for naive space chopping. This brings us back to your statement that forward chopping is much faster then backward one.
If I understand you correctly, you wonder why I use memcpy. It's all because of aligned/unalined access on different platforms. E.g. SPARC don't have unaligned access.
Then, memcpy to a uint64 word while complied with optimizations(-O2) converted to a single assembler instruction in x86_64/aarch64.

>* When you call your function you have an IF clause, which seems unnecessary to me
The truth is that I wanted to minimize the impact to the main while loop. Every 'if' clause adds additional delay (according to our mini-benchmarks). So one small if which checks only two bytes (one per string) seems like a good trade-off.
And if we're in the space_skip routine, we do process fast.

> * You claim to have seen significant improvements with your code. Can you share your benchmarks with us ??? We would like to repeat them, for which we need the entire patch and all the tests.
Yes, of course. I'll attach a *.cc to the ticket. Some notes:
  1. It uses mini-benchmark engine in unittest/gunit from google/gtest.
  2. It covers only my_strnncollsp, my_hash* patch is relatively easy, you can append it if you like
  3. I simply append this code to the strings_utf8-t.cc, however, you can do it in a more accurate way
  4. I don't propose a benchmark inside this ticket because I suppose that it's going to be very hard to merge in like this (too many changes).

> * I wonder if one could optimize this further, by working on local copies of the pointers, as opposed to modifying the pointers in each loop, and storing the result just once at the end:
https://godbolt.org/z/bV4iP4 (I hope I didn't messed up anything)

To be honest, you omit ```inline``` from skip_space, so it changes the generated code. However, if you preserve ```inline``` before skip_space, you'll see how [si]/[di] access is gone:
https://godbolt.org/z/B62wLX

Benchmark, my_strnncollsp, emulate work with CHAR(120) (fixed field)

Attachment: benchmark.cc (application/octet-stream, text), 3.73 KiB.

Frankly, since moving to C++, you can switch from pointers to references here:
inline void skip_space(const uchar *&sp, const uchar *&tp,
                       const uchar *const se, const uchar *const te);

And you do get the same code: https://godbolt.org/z/7L3NpG

However, doing it this way you drop the ability to backport this patch to 5.7 version, where ctype-utf8.c is a pure C file (no references).

Hi Mr. Philimonov,

I do have two requests.

First, can you work out and send us your two final patches. One for 8.0 and the other for 5.7. Actually, I am not quite sure that performance improvements can still go into 5.7, unless the impact is truly very significant.

Second, we do not use Google benchmarks, but sysbench and mysqlslap. Having a fully repeatable test case would be necessary for us. You can also send us your numbers, meaning the percentage of perceived improvement.

Last, but not least, I analysed your benchmark.cc and I do not truly know what to do with it.

We are truly, very much, interested in analysing and verifying your final patch. Hence, let us know if we are not asking too much.

Thank you in advance.

Dear Mr. Milivojevic,

- just to clarify: the current contributed patch for MySQL 8.0 is final,
  and it is not our intention to contribute the same optimization for
  5.7. Sorry about possible confusion in my previous comments;

- the benchmark program is not based on the Google Benchmark library. It
  is based on the unittest micro-benchmark framework introduced in
  MySQL 8.0. To use it, one has place it to the unittest/gunit/
  directory, then build the server with unittests support;

- as I mentioned previously, we also performed SQL-level benchmarks with
  the patch using the sysbench OLTP read-only test, and got up to 2.5%
  improvement in that particular workload.

Thanks and regards,
Dmitriy.

Hi Mr. Philimonov,

I did manage to see an improvement on OLTP tests with your patch.

Hence, I am verifying this report as a performance improvement in 8.0.

Thank you for your contribution.