Bug #98737 | my_convert routine is suboptimal in case of non-x86 platforms | ||
---|---|---|---|
Submitted: | 25 Feb 2020 14:38 | Modified: | 3 Mar 2020 13:27 |
Reporter: | Georgiy Kirichenko | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Compiling | Severity: | S5 (Performance) |
Version: | OS: | Any | |
Assigned to: | CPU Architecture: | ARM |
[25 Feb 2020 14:38]
Georgiy Kirichenko
[25 Feb 2020 14:59]
MySQL Verification Team
Hi Mr. Kirichencko, Thank you for your bug report. We do need truly more info from you in order to proceed. First, you write: " Arm64 (aarch64) performance could be improved using the same approach as x86 optimization was done " Can you provide an example ???? Also, your idea of fixing is not elaborate enough. What function would need to be replaced with non-x86 platforms. Please, let us know how exactly should it be done. You can write a draft of the patch. Also, what improvement would you expect ???
[26 Feb 2020 7:29]
Georgiy Kirichenko
Hi Mr. Milivojevic Excuse me for that I forgot to mention the function name in bug commentary and this made my message less understandable. So please take a look on the my_convert function inside strings/ctype.cc. There is a x86 optimization which allows to process 4 bytes at once. #if defined(__i386__) || defined(_WIN32) || defined(__x86_64__) /* Special loop for i386, it allows to refer to a non-aligned memory block as UINT32, which makes it possible to copy four bytes at once. This gives about 10% performance improvement comparing to byte-by-byte loop. */ for (; length >= 4; length -= 4, from += 4, to += 4) { if (uint4korr(from) & 0x80808080) break; int4store(to, uint4korr(from)); } #endif /* __i386__ */ And I would like to extend such optimization for ARM64 as well but process 8 bytes at once. Additionaly I suggest to replace uint4korr and int4store with a memcpy which is effectively replaced with one asm load/store instruction because there is no reliance on endianess.
[26 Feb 2020 13:09]
MySQL Verification Team
Hi Mr. Kirichenko, Your contribution seems very interesting to me. However, there are couple of details that are still missing. Hence, you would change the following loop: for (; length >= 4; length -= 4, from += 4, to += 4) { if (uint4korr(from) & 0x80808080) break; int4store(to, uint4korr(from)); by replacing uint4korr() and int4store() with memcpy(). Since uint4korr() has only one parameter, int4store() has two and memcpy() has three parameters, how would it be done ????? Also, how would we replace memcpy() with ARM assembly, which you mention ???? If you write a new function, we would need you to sing an OCA agreement. Read about it more in the next comment.
[26 Feb 2020 13:09]
MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it! In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well. Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future. Please let me know, if you have any questions. Thank you for your interest in MySQL.
[28 Feb 2020 15:56]
Alexey Kopytov
Sinisa, my dear friend!!!11 No patch is being contributed here just yet. Here's a quick summary of optimization opportunities being reported here. In the code above: 1. __aarch64__ could be added to the list of architectures under #ifdef, so my_convert() could enjoy the same optimization. 2. my_convert() could be optimized even further on 64-bit architectures by processing 8 bytes at a time, rather than 4 bytes 3. uint4korr() usage doesn't make any sense. A simple load from memory into a register would work on both little- and big-endian architectures where unaligned loads are possible. Agree?
[1 Mar 2020 8:11]
Georgiy Kirichenko
Hi Mr. Milivojevic Thanks for your answer! Unfortunately I could not send an OCA right now because I need an approvement from my employer but I hope I receive it in the near future. So please let me explain suggested changes more detailed. for (; length >= 4; length -= 4, from += 4, to += 4) { if (uint4korr(from) & 0x80808080) break; int4store(to, uint4korr(from)); The uint4korr and int4store functioan are used to load/store 4 bytes regarding endianess and memory access aligment. As endianess has no matter here we could safely replace this functions whith memcpy(&temp_variable, from, sizeof(temp_variable)) and memcpy(to, &temp_variable, sizeof(temp_variable)); Such call takes care about memory access aligment and will be efficitively replaced with one assembler load/store instruction on i386, x86, arm. Additionaly we could use the platform-depended integer and process 4bytes per once on i386 and arm and 8bytes on amd64 and aarm64. With best regards, Georgy
[3 Mar 2020 13:27]
MySQL Verification Team
Hi Mr. Krichenko, Hi Kaamos, my friend, I do agree with both of you, although I would have preferred a patch. Still, this is a good contribution. Verified as reported.
[16 Mar 2020 5:31]
Krunal Bauskar
Patch for the bug
Attachment: bug98737.patch (text/x-patch), 813 bytes.
[16 Mar 2020 5:35]
Krunal Bauskar
I have OCA signed so I thought I will submit the relevant patch to get accelerate its adaption. Sinisa, Can you check with dev-team if they can review it and accept it or comment about it.
[16 Mar 2020 13:16]
MySQL Verification Team
Hi Mr. Bauskar, Thank you for your patch. I have informed our Development that a patch is now available. Thanks again.