MySQL Bugs: #110606: Improve crc32c code for arm64

Bug #110606	Improve crc32c code for arm64
Submitted:	4 Apr 2023 10:16	Modified:	17 Apr 2023 5:06
Reporter:	Cai Yibo (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Linux
Assigned to:		CPU Architecture:	ARM (aarch64)

Description:
Build mysql-8.0 on aarch64 with gcc-10.3, RelWithDebInfo.

Look into the machine code of sliced crc32c implementation on Arm, there are some unnecessary instructions in each iteration. As shown below.

18b5e40:       f9400006        ldr     x6, [x0]
18b5e44:       91002000        add     x0, x0, #0x8
18b5e48:       f94a9c05        ldr     x5, [x0, #5432]
18b5e4c:       f9553c03        ldr     x3, [x0, #10872]
18b5e50:       9ac65c21        crc32cx w1, w1, x6
18b5e54:       9ac55c42        crc32cx w2, w2, x5
18b5e58:       2a0103e1        mov     w1, w1       <-- unnecessary !!!
18b5e5c:       2a0203e2        mov     w2, w2       <-- unnecessary !!!
18b5e60:       9ac35c83        crc32cx w3, w4, x3
18b5e64:       2a0303e4        mov     w4, w3       <-- unnecessary !!!
18b5e68:       eb07001f        cmp     x0, x7
18b5e6c:       54fffea1        b.ne    18b5e40

Though crc32 value is always 32 bits, SSE _mm_crc32_u64 accepts and returns 64 bit crc32 values. As a result, the crc32_u64 interface uses 64 bit crc32 values.

>> static inline uint64_t update(uint64_t crc, uint64_t data);

Arm __crc32cd accepts and returns 32 bit crc32 values. Type casting is required to be compatible with the interface. Looks it confuses the compiler and generates sub-optimal code.

This patch changes the crc32_u64 parameter and return types to uint32_t on Arm. It eliminates the redundant opcode. The machine code is more concise than original one.

18b5e38:       f9400005        ldr     x5, [x0]
18b5e3c:       91002000        add     x0, x0, #0x8
18b5e40:       f94a9c04        ldr     x4, [x0, #5432]
18b5e44:       f9553c03        ldr     x3, [x0, #10872]
18b5e48:       9ac55d08        crc32cx w8, w8, x5
18b5e4c:       9ac45c21        crc32cx w1, w1, x4
18b5e50:       9ac35c42        crc32cx w2, w2, x3
18b5e54:       eb06001f        cmp     x0, x6
18b5e58:       54ffff01        b.ne    18b5e38

7% performance uplift from Microbenchmarks. BM_CRC32_0_508 is observed on Arm Neoverse-N1.

How to repeat:
Build mysql-8.0 on aarch64 with gcc-10.3, RelWithDebInfo.
Use objdump to disassemble crc32.cc.o and search "crc32cx".

patch (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-innobase-improve-crc32c-for-arm64.patch (text/x-patch), 4.64 KiB.

Hello Cai Yibo,

Thank you for the report and contribution.

regards,
Umesh

Wow, this is almost exactly like:
```
$ git show mysql-trunk-jlopusza-crc32-508
commit 354ce25ef816ab50e12df0b65575238f82fae66e (mysql-trunk-jlopusza-crc32-508)
Author: Jakub Łopuszański <jakub.lopuszanski@oracle.com>
Date:   Fri Sep 10 14:47:25 2021 +0200

    platform-dependent type of state

diff --git a/storage/innobase/ut/crc32.cc b/storage/innobase/ut/crc32.cc
index 8932f007766..ba66b8937d9 100644
--- a/storage/innobase/ut/crc32.cc
+++ b/storage/innobase/ut/crc32.cc
@@ -449,7 +449,12 @@ struct crc32_impl {
   static inline uint32_t update(uint32_t crc, unsigned char data);
   static inline uint32_t update(uint32_t crc, uint16_t data);
   static inline uint32_t update(uint32_t crc, uint32_t data);
-  static inline uint64_t update(uint64_t crc, uint64_t data);
+#ifdef CRC32_ARM64
+  using crc64state_t = uint32_t;
+#else
+  using crc64state_t = uint64_t;
+#endif
+  static inline crc64state_t update(crc64state_t crc, uint64_t data);
 };

 #ifdef CRC32_x86_64
@@ -493,8 +498,8 @@ uint32_t crc32_impl::update(uint32_t crc, uint32_t data) {
 #ifdef CRC32_ARM64_DEFAULT
 MY_ATTRIBUTE((target("+crc")))
 #endif /* CRC32_ARM64_DEFAULT */
-uint64_t crc32_impl::update(uint64_t crc, uint64_t data) {
-  return (uint64_t)__crc32cd((uint32_t)crc, data);
+uint32_t crc32_impl::update(uint32_t crc, uint64_t data) {
+  return __crc32cd(crc, data);
 }
 #endif /* CRC32_ARM64 */

@@ -604,7 +609,8 @@ slice_len bytes. */
 template <typename algo_to_use, size_t slice_len>
 struct Update_step_executor {
   template <size_t i>
-  static void run(uint64_t *crc, const uint64_t *data64) {
+  static void run(typename algo_to_use::crc64state_t *crc,
+                  const uint64_t *data64) {
     crc[i] = algo_to_use::update(crc[i], *(data64 + i * (slice_len / 8)));
   }
 };
@@ -616,7 +622,8 @@ are from the end of the i-th slice to the end of the chunk. */
 template <typename algo_to_use, size_t slice_len, size_t slices_count>
 struct Combination_step_executor {
   template <size_t i>
-  static void run(uint64_t &combined, const uint64_t *crc) {
+  static void run(uint64_t &combined,
+                  const typename algo_to_use::crc64state_t *crc) {
     combined ^= roll<slice_len *(slices_count - 1 - i), algo_to_use>(crc[i]);
   }
 };
@@ -635,7 +642,7 @@ static inline uint32_t consume_chunk(uint32_t crc0, const unsigned char *data) {
   /* crc[i] is the hash for i-th slice, data[i*slice_len...(i+1)*slice_len)
   where the initial value for each crc[i] is zero, except crc[0] for which we
   use the initial value crc0 passed in by the caller. */
-  uint64_t crc[slices_count]{crc0};
+  typename algo_to_use::crc64state_t crc[slices_count]{crc0};
   /* Each iteration of the for() loop will eat 8 bytes (single uint64_t) from
   each slice. */
   static_assert(
```
sitting in my repo. I wonder why I haven't pushed it. Was it because it overcomplicates the code? Or was the gain not worth it?
Let me run performance tests once again.

I've run tests on  VM.Standard.A1.Flex 8 cores, 128GB ram
```
cmake3 ../mysql -DWITH_BOOST=/usr/global/share  -DMINIMAL_RELWITHDEBINFO=OFF -G Ninja && ninja -j8
./runtime_output_directory/merge_innodb_tests-t --gtest_repeat=100 --gtest_filter=*BM_CRC32* | ack iterations
```
Results are extremely repeatable so this is just one run, but I did 10 and each iteration had exactly same ns/iter.
```
mysql-trunk :
BM_CRC32_6_16338                            1415608 iterations        698 ns/iter    21.81 GB/sec
BM_CRC32_0_508                             30487805 iterations         30 ns/iter    15.90 GB/sec

contribution patch:
BM_CRC32_6_16338                            1407638 iterations        697 ns/iter    21.84 GB/sec
BM_CRC32_0_508                             32894737 iterations         26 ns/iter    17.92 GB/sec

my old patch:
BM_CRC32_6_16338                            1428551 iterations        697 ns/iter    21.84 GB/sec
BM_CRC32_0_508                             33783784 iterations         26 ns/iter    17.92 GB/sec
```
So it looks like 13% gain.

I was able to get it to run even a little bit faster (additional 5%?):
```
BM_CRC32_6_16338                            1426920 iterations        696 ns/iter    21.86 GB/sec
BM_CRC32_0_508                             35714286 iterations         25 ns/iter    18.82 GB/sec
```
by resurrecting another patch from my repo, which basically checks if the buffer alignment and size looks like the one for redo log entries (0, 508 respectively) and inlines the method with these values as constants, so that all branches get eliminated:

```
diff --git a/storage/innobase/ut/crc32.cc b/storage/innobase/ut/crc32.cc
index c68bbf955bc..32d7d837027 100644
--- a/storage/innobase/ut/crc32.cc
+++ b/storage/innobase/ut/crc32.cc
@@ -96,6 +96,9 @@ external tools. */
 #include "univ.i"
 #include "ut0crc32.h"

+#include "log0log.h"  // LOG_BLOCK_TRL_SIZE
+#include "os0file.h"  // OS_FILE_LOG_BLOCK_SIZE
+
 /** Pointer to CRC32 calculation function. */
 ut_crc32_func_t ut_crc32;

@@ -689,6 +692,46 @@ static inline void consume_pow2(uint32_t &crc, const byte *&data, size_t len) {
     data += sizeof(Chunk);
   }
 }
+template <typename algo_to_use>
+#ifdef CRC32_x86_64_WIN
+__forceinline
+#else
+inline
+#endif /*CRC32_x86_64_WIN */
+    static uint32_t
+    crc32_opt(uint32_t crc, const byte *data, const size_t prefix_len,
+              const size_t rest_len) {
+  ut_ad(prefix_len < 8);
+  /* The suffix_len will be needed later, but we can compute it here already,
+     as len will only be modified by subtracting multiples of 8, and hopefully,
+     learning suffix_len sooner will make it easier for branch predictor
+     later.*/
+  const size_t suffix_len = rest_len & 7;
+  size_t body_len = rest_len & ~size_t{7};
+  /* data address may be unaligned, so we read just one byte if it's odd */
+  consume_pow2<byte, algo_to_use>(crc, data, prefix_len);
+  /* data address is now aligned mod 2, but may be unaligned mod 4*/
+  consume_pow2<uint16_t, algo_to_use>(crc, data, prefix_len);
+  /* data address is now aligned mod 4, but may be unaligned mod 8 */
+  consume_pow2<uint32_t, algo_to_use>(crc, data, prefix_len);
+  /* data is now aligned mod 8 */
+  /* A typical page is 16kb, but the part for which we compute crc32 is a bit
+  shorter, thus 5440*3 is the largest multiple of 8*3 that fits. For pages
+  larger than 16kb, there's not much gain from handling them specially. */
+  consume_chunks<5440, 3, algo_to_use>(crc, data, body_len);
+  /* A typical redo log block is 0.5kb, but 168*3 is the largest multiple of
+  8*3 which fits in the part for which we compute crc32. */
+  consume_chunks<168, 3, algo_to_use>(crc, data, body_len);
+  /* In general there can be some left-over (smaller than 168*3) which we
+  consume 8*1 bytes at a time. */
+  consume_chunks<8, 1, algo_to_use>(crc, data, body_len);
+  /* Finally, there might be unprocessed suffix_len < 8, which we deal with
+  minimal number of computations caring about proper alignment. */
+  consume_pow2<uint32_t, algo_to_use>(crc, data, suffix_len);
+  consume_pow2<uint16_t, algo_to_use>(crc, data, suffix_len);
+  consume_pow2<byte, algo_to_use>(crc, data, suffix_len);
+  return crc;
+}
 /** The hardware accelerated implementation of CRC32-C exploiting within-core
 parallelism on reordering processors, by consuming the data in large chunks
 split into 3 independent slices each. It's optimized for handling buffers of
@@ -707,33 +750,15 @@ static uint32_t crc32(uint32_t crc, const byte *data, size_t len) {
     On some  platforms unaligned reads are not allowed, on others they are
     slower, so we start by consuming the unaligned prefix of the data. */
     const size_t prefix_len = (8 - reinterpret_cast<uintptr_t>(data)) & 7;
-    /* data address may be unaligned, so we read just one byte if it's odd */
-    consume_pow2<byte, algo_to_use>(crc, data, prefix_len);
-    /* data address is now aligned mod 2, but may be unaligned mod 4*/
-    consume_pow2<uint16_t, algo_to_use>(crc, data, prefix_len);
-    /* data address is now aligned mod 4, but may be unaligned mod 8 */
-    consume_pow2<uint32_t, algo_to_use>(crc, data, prefix_len);
-    /* data is now aligned mod 8 */
-    len -= prefix_len;
-    /* The suffix_len will be needed later, but we can compute it here already,
-    as len will only be modified by subtracting multiples of 8, and hopefully,
-    learning suffix_len sooner will make it easier for branch predictor later.*/
-    const size_t suffix_len = len & 7;
-    /* A typical page is 16kb, but the part for which we compute crc32 is a bit
-    shorter, thus 5440*3 is the largest multiple of 8*3 that fits. For pages
-    larger than 16kb, there's not much gain from handling them specially. */
-    consume_chunks<5440, 3, algo_to_use>(crc, data, len);
-    /* A typical redo log block is 0.5kb, but 168*3 is the largest multiple of
-    8*3 which fits in the part for which we compute crc32. */
-    consume_chunks<168, 3, algo_to_use>(crc, data, len);
-    /* In general there can be some left-over (smaller than 168*3) which we
-    consume 8*1 bytes at a time. */
-    consume_chunks<8, 1, algo_to_use>(crc, data, len);
-    /* Finally, there might be unprocessed suffix_len < 8, which we deal with
-    minimal number of computations caring about proper alignment. */
-    consume_pow2<uint32_t, algo_to_use>(crc, data, suffix_len);
-    consume_pow2<uint16_t, algo_to_use>(crc, data, suffix_len);
-    consume_pow2<byte, algo_to_use>(crc, data, suffix_len);
+    constexpr auto redo_size = OS_FILE_LOG_BLOCK_SIZE - LOG_BLOCK_TRL_SIZE;
+    /* This is intentionally "tautological". The goal here is to hint the
+    compiler to inline the variant for redo, so that all ifs in all consume_pow2
+    in it can be resolved at compile time, leading to branchless code */
+    if (prefix_len == 0 && len == redo_size) {
+      crc = crc32_opt<algo_to_use>(crc, data, 0, redo_size);
+    } else {
+      crc = crc32_opt<algo_to_use>(crc, data, prefix_len, len - prefix_len);
+    }
   } else {
     while (len--) {
       crc = algo_to_use::update(crc, *data++);
```

I am not sure if this is worth it, as it makes the code more and more complicated.
Also, I would need to repeat tests on Apple M1, and various x64 machines to make sure there's no regression.

I think changing the type of accumulator to 32-bit on ARM64 - what the contributor proposes - is not risky.
But the change to inline the case for prefix_len == 0 && len == redo_size is a bit risky.

Also, I remember now, that I went into some rabbit hole trying to figure out why the layout of the assembly code impacts performance on ellex06 (which is also ARM64) - adding NOPs in various places seemed to change performance greatly. Perhaps this was the reason I didn't push these patches - random asm layout changes caused regression?

Thank you Jakub. It's interesting you already have an almost same patch.

From the microbenchmark result, I think this improvement is useful.
"mov w1, w1" clears higher 32 bits of x1. Compiler has to insert it if the result is expected to be 64 bits.

The problem I have is that, I've run the baseline vs a patch which uses narrower 32-bit state on several ARM64 machines, and results in GB/s are like this:

                    baseline    narrower
Apple M1:
BM_CRC32_0_508      23.30       23.30
BM_CRC32_6_16338    23.72       23.72
"Cavium":
BM_CRC32_0_508      9.28        7.70 
BM_CRC32_6_16338    12.05       8.87
VM.Standard.A1.Flex:
BM_CRC32_0_508      15.86       17.98
BM_CRC32_6_16338    21.75       21.83

I've dissembled the patched code on Cavium with gdb, and it looks innocent:
```
hardware::crc32_using_pclmul(unsigned char const*, unsigned long):
0xfffff5132048 <+136>  ldr    x5, [x0]
0xfffff513204c <+140>  add    x0, x0, #0x8
0xfffff5132050 <+144>  ldr    x4, [x0,#5432]
0xfffff5132054 <+148>  ldr    x3, [x0,#10872]
0xfffff5132058 <+152>  crc32cx        w8, w8, x5
0xfffff513205c <+156>  crc32cx        w1, w1, x4
0xfffff5132060 <+160>  crc32cx        w2, w2, x3
0xfffff5132064 <+164>  cmp    x0, x6
0xfffff5132068 <+168>  b.ne   0xfffff5132048 <hardware::crc32_using_pclmul(unsigned char const*, unsigned long)+136>
```

I've played with changing the:
```
-MY_ATTRIBUTE((target("+crc+crypto"), flatten))
+MY_ATTRIBUTE((target("+crc+crypto"), flatten, optimize("align-loops=$X")))
 #endif /* CRC32_ARM64_DEFAULT */
 uint32_t crc32_using_pclmul(const byte *data, size_t len) {
```

with various values of $X, but that didn't help much:
with align-loops=8:
	BM_CRC32_0_508 		7.70
	BM_CRC32_6_16338	8.88
with align-loops=16:
	BM_CRC32_0_508 		8.00
	BM_CRC32_6_16338	8.86
with align-loops=32:
	BM_CRC32_0_508 		8.00
	BM_CRC32_6_16338	8.85
with align-loops=64:
	BM_CRC32_0_508 		7.87
	BM_CRC32_6_16338	8.85

So, for some reason, the change in the "code layout" impacts performance on Cavium dramatically, even though in theory we are just eliminating some instructions.
This is "Cavium ThunderX2(R) CPU CN9980 v2.1 @ 2.20GHz".
From earlier work on this machine, I know that timings of tight loops on it are *very* dependent on exact placements of instructions, so that adding NOP operations in various places can speed up code 2x etc.

The questions I have:
1. Should we care about performance of this function on Cavium?
2. Is this CPU similar enough to other ARM64 CPUs in the wild, that above performance loss might generalize to other CPUs?
3. Is there a simple way to fix this problem?
4. What is the underlying reason of the slow down?

Hi Jakub,

I reproduced the performance regression on thunder-x2. What I observed is IPC drops from 2.81 to 1.86 and non-trivial increasing of stall_backend. Deeper analysis will require micro-architecture details.
I tested on Arm server CPUs including Neoverse-N1 (amazon graviton2) and V1 (graviton3). Saw small but stable performance improvement.
But given the improvement is not significant, I think we can hold on accepting this change.
Will update if there are new findings.

Hi Cai Yibo!

Apart from these micro-benchmarks, did you run any end-to-end tests to see the impact on the overall transactions per seconds?

I ran sysbench write-only benchmark, 1 and 32 threads. No obvious performance difference is found.

Just FYI.
Looks clang is able to identify the pattern and eliminate the redundant instructions.
Tested with clang-10 on thunder-x2, with or without the patch, the machine code and performance are the same.
Clang unrolls the 508 loop, it performs better than gcc on this benchmark, but slower on 16338 benchmark.

Thanks for this information!
In general if we can let the compiler do the hard work which it was supposed to be doing, instead of adding #ifdefs, then it's great.
I mean, the code already has a lot of #ifdefs...and removing them seems harder than adding them...and compilers get better each year. So, one natural approach is to keep the C++ code simple (for humans reading and maintaining it, and for compilers trying to optimize it), and just wait for compilers to get smarter.

Taking into account:
  1. that there is at least one compiler which knows how to produce fast assembly from existing C++ code
  2. that there is at least one family of CPUs and compilers for which the patch causes regression
  3. there's no obvious gain in end-to-end TPS
I think the reasonable thing to do is to not apply any patch now.

Thank you for patch and analysis.
I hope we will find places where the trade-off is more obvious.