Bug #99694 | Improving performance of my_hash_sort_utf8(mb4)/my_strnncollsp*utf8(mb4) | ||
---|---|---|---|
Submitted: | 26 May 2020 13:06 | Modified: | 1 Jun 2020 14:48 |
Reporter: | Dmitriy Philimonov | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S5 (Performance) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | strings, utf8 |
[26 May 2020 13:06]
Dmitriy Philimonov
[26 May 2020 13:08]
Dmitriy Philimonov
Suggested fix
Attachment: skip-space.git.patch (application/octet-stream, text), 3.30 KiB.
[27 May 2020 13:00]
MySQL Verification Team
Hi Mr. PHilimonov, Thank you for your performance improvement report. Thank you, also, for sharing your code with us. I do have several questions: * You state that backward space chopping is far slower then front skipping of the spaces. Why is that and can you prove it ???? * Why do you think that skipping forward is faster then skipping backward ??? * My limited experience, during 10 years of MySQL development, is that sorting with shorter strings is always faster then longer ones, provided length difference is larger then a width of one word * I hope that you have a knowledge that not all space in Unicode is 0x20 ... * In your skip_space() function, you first do memcpy() and then compare the strings for eight ASCII spaces. Why don't you just skip it before memcpy's ??? Also, in the same function, if a condition is not met, you again go for naive space chopping. This brings us back to your statement that forward chopping is much faster then backward one. * When you call your function you have an IF clause, which seems unnecessary to me * You claim to have seen significant improvements with your code. Can you share your benchmarks with us ??? We would like to repeat them, for which we need the entire patch and all the tests. Thank you very much, in advance.
[28 May 2020 12:59]
Jakub Lopuszanski
I wonder if one could optimize this further, by working on local copies of the pointers, as opposed to modifying the pointers in each loop, and storing the result just once at the end: https://godbolt.org/z/bV4iP4 (I hope I didn't messed up anything) The difference I see in opcodes is that the original version tries hard to write back the `*tp += 8`: .LBB1_2: # =>This Inner Loop Header: Depth=1 lea rax, [r11 + 8] cmp rax, rcx jae .LBB1_6 cmp qword ptr [r10], r8 jne .LBB1_6 cmp qword ptr [r11], r8 jne .LBB1_6 mov qword ptr [rdi], r9 mov r11, qword ptr [rsi] add r11, 8 mov qword ptr [rsi], r11 mov r10, qword ptr [rdi] lea r9, [r10 + 8] cmp r9, rdx jb .LBB1_2 and the new version does not: .LBB0_1: # =>This Inner Loop Header: Depth=1 mov r8, rax mov r10, r11 add r11, 8 cmp r11, rdx jae .LBB0_5 lea rax, [r8 + 8] cmp rax, rcx jae .LBB0_5 cmp qword ptr [r10], r9 jne .LBB0_5 cmp qword ptr [r8], r9 je .LBB0_1 Could you please check the impact of that, too?
[28 May 2020 13:24]
MySQL Verification Team
I fully agree with my colleague, Jakub Lopuszanski.
[29 May 2020 10:48]
Dmitriy Philimonov
> * You state that backward space chopping is far slower then front skipping of the spaces. Why is that and can you prove it ???? Actually, it doesn't matter what we choose: backward of forward space chopping, it matters **where** we choose the approach. As I stated in the bugreport description, backward chopping is good for the my_hash* functions, because we just replace naive chopping with 8byte-a-time chopping. However, for my_strnncollsp* functions it's a bad idea, because chopping a huge amount of space takes time, so comparing two typical CHAR(120) fields: 'abc <117 spaces>' 'bca <117 spaces>' We will spend some constant time chopping 117 spaces, however original version will get the result much faster after analyzing the first chararter of each string. That's why **forward** space chopping is proposed here. It has almost the same speed as original one with different strings, and much faster with equal ones. > * Why do you think that skipping forward is faster then skipping backward ??? The speed is equal, believe me. Look at the previous question for details. > * My limited experience, during 10 years of MySQL development, is that sorting with shorter strings is always faster then longer ones, provided length difference is larger then a width of one word Optimization is focused on fields with fixed size. Meanwhile, nothing is changed in other cases. > * I hope that you have a knowledge that not all space in Unicode is 0x20 ... According to mysql server specification, fixed length fields is filled with 0x20 (space) to their full length by current realization. That's why this patch is proposed in the first place. >* In your skip_space() function, you first do memcpy() and then compare the strings for eight ASCII spaces. Why don't you just skip it before memcpy's ??? Also, in the same function, if a condition is not met, you again go for naive space chopping. This brings us back to your statement that forward chopping is much faster then backward one. If I understand you correctly, you wonder why I use memcpy. It's all because of aligned/unalined access on different platforms. E.g. SPARC don't have unaligned access. Then, memcpy to a uint64 word while complied with optimizations(-O2) converted to a single assembler instruction in x86_64/aarch64. >* When you call your function you have an IF clause, which seems unnecessary to me The truth is that I wanted to minimize the impact to the main while loop. Every 'if' clause adds additional delay (according to our mini-benchmarks). So one small if which checks only two bytes (one per string) seems like a good trade-off. And if we're in the space_skip routine, we do process fast. > * You claim to have seen significant improvements with your code. Can you share your benchmarks with us ??? We would like to repeat them, for which we need the entire patch and all the tests. Yes, of course. I'll attach a *.cc to the ticket. Some notes: 1. It uses mini-benchmark engine in unittest/gunit from google/gtest. 2. It covers only my_strnncollsp, my_hash* patch is relatively easy, you can append it if you like 3. I simply append this code to the strings_utf8-t.cc, however, you can do it in a more accurate way 4. I don't propose a benchmark inside this ticket because I suppose that it's going to be very hard to merge in like this (too many changes). > * I wonder if one could optimize this further, by working on local copies of the pointers, as opposed to modifying the pointers in each loop, and storing the result just once at the end: https://godbolt.org/z/bV4iP4 (I hope I didn't messed up anything) To be honest, you omit ```inline``` from skip_space, so it changes the generated code. However, if you preserve ```inline``` before skip_space, you'll see how [si]/[di] access is gone: https://godbolt.org/z/B62wLX
[29 May 2020 10:51]
Dmitriy Philimonov
Benchmark, my_strnncollsp, emulate work with CHAR(120) (fixed field)
Attachment: benchmark.cc (application/octet-stream, text), 3.73 KiB.
[29 May 2020 11:08]
Dmitriy Philimonov
Frankly, since moving to C++, you can switch from pointers to references here: inline void skip_space(const uchar *&sp, const uchar *&tp, const uchar *const se, const uchar *const te); And you do get the same code: https://godbolt.org/z/7L3NpG However, doing it this way you drop the ability to backport this patch to 5.7 version, where ctype-utf8.c is a pure C file (no references).
[29 May 2020 13:02]
MySQL Verification Team
Hi Mr. Philimonov, I do have two requests. First, can you work out and send us your two final patches. One for 8.0 and the other for 5.7. Actually, I am not quite sure that performance improvements can still go into 5.7, unless the impact is truly very significant. Second, we do not use Google benchmarks, but sysbench and mysqlslap. Having a fully repeatable test case would be necessary for us. You can also send us your numbers, meaning the percentage of perceived improvement. Last, but not least, I analysed your benchmark.cc and I do not truly know what to do with it. We are truly, very much, interested in analysing and verifying your final patch. Hence, let us know if we are not asking too much. Thank you in advance.
[1 Jun 2020 13:57]
Dmitriy Philimonov
Dear Mr. Milivojevic, - just to clarify: the current contributed patch for MySQL 8.0 is final, and it is not our intention to contribute the same optimization for 5.7. Sorry about possible confusion in my previous comments; - the benchmark program is not based on the Google Benchmark library. It is based on the unittest micro-benchmark framework introduced in MySQL 8.0. To use it, one has place it to the unittest/gunit/ directory, then build the server with unittests support; - as I mentioned previously, we also performed SQL-level benchmarks with the patch using the sysbench OLTP read-only test, and got up to 2.5% improvement in that particular workload. Thanks and regards, Dmitriy.
[1 Jun 2020 14:48]
MySQL Verification Team
Hi Mr. Philimonov, I did manage to see an improvement on OLTP tests with your patch. Hence, I am verifying this report as a performance improvement in 8.0. Thank you for your contribution.