Bug #98737 my_convert routine is suboptimal in case of non-x86 platforms
Submitted: 25 Feb 2020 14:38 Modified: 3 Mar 2020 13:27
Reporter: Georgiy Kirichenko Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Compiling Severity:S5 (Performance)
Version: OS:Any
Assigned to: CPU Architecture:ARM

[25 Feb 2020 14:38] Georgiy Kirichenko
Description:
Arm64 (aarch64) performance could be improved using the same approach as x86 optimization was done.

Additionaly it is not necessary to use uint4korr/int4store because we do not handle multi-byte values here.

How to repeat:
make a sysbench performance testing with suggested fix applied under the aarch64 platform.

Suggested fix:
Use memcpy to fetch next 8 bytes of converted string and use bit mask 0x8080808080808080 (similar with x86 optimization)
[25 Feb 2020 14:59] MySQL Verification Team
Hi Mr. Kirichencko,

Thank you for your bug report.

We do need truly more info from you in order to proceed.

First, you write:

"
Arm64 (aarch64) performance could be improved using the same approach as x86 optimization was done
"

Can you provide an example ????

Also, your idea of fixing is not elaborate enough. What function would need to be replaced with non-x86 platforms. Please, let us know how exactly should it be done. You can write a draft of the patch.

Also, what improvement would you expect ???
[26 Feb 2020 7:29] Georgiy Kirichenko
Hi Mr. Milivojevic

Excuse me for that I forgot to mention the function name in bug commentary and this made my message less understandable. So please take a look on the my_convert function inside strings/ctype.cc.
There is a x86 optimization which allows to process 4 bytes at once.

#if defined(__i386__) || defined(_WIN32) || defined(__x86_64__)
  /*
    Special loop for i386, it allows to refer to a
    non-aligned memory block as UINT32, which makes
    it possible to copy four bytes at once. This
    gives about 10% performance improvement comparing
    to byte-by-byte loop.
  */
  for (; length >= 4; length -= 4, from += 4, to += 4) {
    if (uint4korr(from) & 0x80808080) break;
    int4store(to, uint4korr(from));
  }
#endif /* __i386__ */

And I would like to extend such optimization for ARM64 as well but process 8 bytes at once.

Additionaly I suggest to replace uint4korr and int4store with a memcpy which is effectively replaced with one asm load/store instruction because there is no reliance on endianess.
[26 Feb 2020 13:09] MySQL Verification Team
Hi Mr. Kirichenko,

Your contribution seems very interesting to me.

However, there are couple of details that are still missing.

Hence, you would change the following loop:

  for (; length >= 4; length -= 4, from += 4, to += 4) {
    if (uint4korr(from) & 0x80808080) break;
    int4store(to, uint4korr(from));

by replacing uint4korr() and int4store() with memcpy(). Since uint4korr() has only one parameter, int4store() has two and memcpy() has three parameters, how would it be done ?????

Also, how would we replace memcpy() with ARM assembly, which you mention ????

If you write a new function, we would need you to sing an OCA agreement. Read about it more in the next comment.
[26 Feb 2020 13:09] MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it!

In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.
[28 Feb 2020 15:56] Alexey Kopytov
Sinisa, my dear friend!!!11

No patch is being contributed here just yet. Here's a quick summary of
optimization opportunities being reported here.

In the code above:

1. __aarch64__ could be added to the list of architectures under #ifdef,
so my_convert() could enjoy the same optimization.

2. my_convert() could be optimized even further on 64-bit architectures
by processing 8 bytes at a time, rather than 4 bytes

3. uint4korr() usage doesn't make any sense. A simple load from memory
into a register would work on both little- and big-endian architectures
where unaligned loads are possible.

Agree?
[1 Mar 2020 8:11] Georgiy Kirichenko
Hi Mr. Milivojevic

Thanks for your answer!

Unfortunately I could not send an OCA right now because I need an approvement from my employer but I hope I receive it in the near future.

So please let me explain suggested changes more detailed.
 for (; length >= 4; length -= 4, from += 4, to += 4) {
    if (uint4korr(from) & 0x80808080) break;
    int4store(to, uint4korr(from));

The uint4korr and int4store functioan are used to load/store 4 bytes regarding endianess and memory access aligment. As endianess has no matter here we could safely replace this functions whith memcpy(&temp_variable, from, sizeof(temp_variable)) and memcpy(to, &temp_variable, sizeof(temp_variable));
Such call takes care about memory access aligment and will be efficitively replaced with one assembler load/store instruction on i386, x86, arm.

Additionaly we could use the platform-depended integer and process 4bytes per once on i386 and arm and 8bytes on amd64 and aarm64.

With best regards, Georgy
[3 Mar 2020 13:27] MySQL Verification Team
Hi Mr. Krichenko,

Hi Kaamos, my friend,

I do agree with both of you, although I would have preferred a patch. Still, this is a good contribution.

Verified as reported.
[16 Mar 2020 5:31] Krunal Bauskar
Patch for the bug

Attachment: bug98737.patch (text/x-patch), 813 bytes.

[16 Mar 2020 5:35] Krunal Bauskar
I have OCA signed so I thought I will submit the relevant patch to get accelerate its adaption.

Sinisa,

Can you check with dev-team if they can review it and accept it or comment about it.
[16 Mar 2020 13:16] MySQL Verification Team
Hi Mr. Bauskar,

Thank you for your patch.

I have informed our Development that a patch is now available.

Thanks again.