MySQL Bugs: #110644: innobase: optimize hash_uint64 with hardware crc on Arm

Bug #110644	innobase: optimize hash_uint64 with hardware crc on Arm
Submitted:	10 Apr 2023 6:16	Modified:	13 Apr 2023 9:56
Reporter:	Cai Yibo (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Linux
Assigned to:		CPU Architecture:	ARM (aarch64)

Description:
hash_uint64 is listed high in the perf report of sysbench writeonly
tests. Arm SPE(Statistical Profiling Extension) reveals it's suffering
from non-trivial l1d-cache misses when reading the lookup table.

This patch combines hardware crc32c and crc32 instructions to calculate
hash values. With fewer instructions and no memory access, performance
improved significantly. All unit tests are passed.

x86 has no hardware crc32 instruction. It's not supported in this patch.
Using only crc32c doesn't introduce enough entropy. It's possible to
pass the unit tests with four crc32c instructions, but the benefit is
not huge.

Tested on Neoverse-N1, gcc-10.3, RelWithDebInfo, -march=armv8.2-a. Big
performance uplift is observed in related microbenchmarks.

- Microbenchmarks.BM_RND_GEN:          117 -> 433 MB/s
- Microbenchmarks.BM_HASH_UINT64:      121 -> 252 MB/s
- Microbenchmarks.BM_HASH_UINT64_PAIR: 57  -> 165 MB/s

In sysbench profiling result, hash_uint64 is not in the top list anymore
after applying this change.

NB: I also tried some fast universal hashing implementations, like
clhash [1] and google cityhash [2], both failed mysql unit tests.
[1] https://github.com/lemire/clhash
[2] https://github.com/google/cityhash

How to repeat:
Run related micro-benchmarks.

patch (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-innobase-optimize-hash_uint64-with-hardware-crc-on-A.patch (text/x-patch), 4.98 KiB.

Hello Cai Yibo,

Thank you for the report and contribution.

regards,
Umesh

Hello Cai,
Thank you for your report.
We have identified the same problem internally, not only on ARM. In next MySQL Server versions we will use the CRC32 for hash if it is hardware accelerated, and use the current Tabular Hashing only in rare cases it is otherwise.

Thanks Marcin, this is great!

FWIW, I find with 4 crc32c instructions can easily pass the unit test. Might be useful on x86 without hardware crc32.

// Remix bits
value = (value ^ (value >> 30)) * 0xbf58476d1ce4e5b9ULL;
value = value ^ (value >> 31);
// Mix results from 4 crc32c_u16
uint64_t r3 = _mm_crc32_u16(0x01234567, uint16_t(value >> 48));
uint64_t r2 = _mm_crc32_u16(0x89abcdef, uint16_t(value >> 32));
uint64_t r1 = _mm_crc32_u16(0xfedcba98, uint16_t(value >> 16));
uint64_t r0 = _mm_crc32_u16(0x76543210, uint16_t(value));
return ((r3 ^ r1) << 32) | (r2 ^ r0);

Hi Cai Yibo.

The solution we've ended up with, is using two crc32 instructions:

```
static inline uint64_t crc32_update_uint64(uint64_t crc, uint64_t data) {
#ifdef CRC32_x86_64
  return _mm_crc32_u64(crc, data);
#elif defined(CRC32_ARM64)
  return (uint64_t)__crc32cd((uint32_t)crc, data);
#endif /* CRC32_x86_64 */
}

static inline uint64_t crc32_hash_uint64(uint64_t w) {
  ut_ad(ut_crc32_cpu_enabled);
  w *= 0x1a2be5a5fa110e23;
  return (crc32_update_uint64(0, w) ^ w) << 32 |
         crc32_update_uint64(0, (w>> 32 | w<< 32));
}
```
The `w *= 0x1a2be5a5fa110e23` part is a bit ugly and not needed in theory, but in practice was needed to pass the stringent unit tests. It's a bijection, so doesn't lose any bit of info.

The idea here is to treat the 64-bit input w as consisting of two 32-bit parts ( w == u << 32 ^ d ) representing two polynomials of degree 31 over GF(2) (also known as Z_2):
W(x) = U(x)*x^32 + D(x)

and the result of my proposed hash function consists of two parts as well:
H(x) = (D(x)*x^32+U(x))*x^32 mod CRC32c + D(x) 
L(x) = (U(x)*x^32+D(x))*x^32 mod CRC32c

Note: the names are short for Up, Down, High and Low.
Note: the asymmetry between H(x) 's and L(x) 's definitions (the " +D(x) " part) is crucial and intentional, as I'll explain later.
(Basically we need odd number of polynomials to avoid divisibility by (x+1) )
Note: the _mm_crc32_u64(0,w) is meant to process 8 bytes of data in little endian, so the lower 4 bytes of w are (conceptually) treated as "higher coefficients" of the polynomial, and vice-versa, which is why the math formula has D(x)*x^32+U(x) where the code has w.
Note: technically, the _mm_crc32_u64(0,w)  *does not* compute W(x) * x^32 mod CRC32c . Due to "historical" reasons it computes an "equivalent" of it: you'd have to xor the result with _mm_crc32_u64(0,0)  to get W(x)*x^32 mod CRC32c . However xoring with a constant doesn't change any properties of this solution, and just wastes CPU cycles, so in what follows, I'll analyse the mathematical expression, instead of the real expression computed by CPU.

Above definition has very nice properties:

1. Both halves of the result depend on both halves of input.
(Actually, we could make stronger claims, like: each bit of L(x) depends on each bit of the input with ppb equal to fractions on ones in CRC32-C polynomial, which is 17/32 ~ 53%. For H(x) there is a slight loss: because D(x) appears in it twice, the result depends on D(x)*(x^64+1)  not D(x) , and unfortunately (x^64+1)  is divisible by (x+1)  which is a factor of CRC32c, too, which means that (x^64+1)  and CRC32c are not co-prime and thus multiplying by it is not reversible. The (x^64+1)/(x+1)  is co-prime with CRC32c/(x+1) , though, so we still get 31 bits of entropy.)
So, even if the caller is only interested in the lower half, or upper half of result, it will take advantage of (most of) entropy available in the input.

2. The proposed hash_uint64  is a bijection, i.e. no two different inputs map to the same output, so there are no "collisions" in this sense.
(A hand-waving way to convince yourself of this claim: the H(x) can be used to compute the checksum of [D,U,D,0] sequence, and L(x) can be used to compute the checksum of [0,U,D,0] , and xoring these two checksums would let you recover D(x), and from that you can figure out U(x), as well.)
A formal proof : observe that you can reconstruct D(x) and U(x) from the H(x) and L(x) in the following way.
H(x)*x^32 mod CRC32c + L(x) = 
= ((D(x)*x^32+U(x))*x^32 mod CRC32c + D(x))*x^32 mod CRC32c + L(x)
= ((D(x)*x^32+U(x))*x^32 + D(x))*x^32 mod CRC32c + L(x)
= D(x)*x^64+(U(x)*x^32 + D(x))*x^32 mod CRC32c + L(x)
= D(x)*x^64+L(x) mod CRC32c + L(x)
= D(x)*x^64 mod CRC32c

The CRC32c polynomial - the only one available in SEE4.2 - unfortunately, is not a primitive polynomial - it has two factors, one of them is (x+1) .
This complicates matters in our proof, as it means multiplication by some polynomials is not reversible mod CRC32c - and to recover D(x)  from above expression, we want to "divide both sides by x^64". Thus we have to argue this is possible in the specific case of x^64  polynomial.
Thankfully, x^64  has no common divisors with CRC32c, and thus multiplying by it mod CRC32c is reversible.
There are many ways to convince yourself of it: for starters the factorization of x^64  is just x*x*...*x . Also x^64  is "odd" (has odd number of ones) so can't be divisible by (x+1) . Another intuition is that CRC32c can correct a single bit errors, so it can't divide any x^k . Finally, it is a known fact that "rolling crc32c forward by 32 bits" can be reversed, i.e. you can recover a 32-bit value of s, from s(x)*x^32 mod crc32c  - and thus by doing it twice we can recover D(x) .
Finally, you can see for yourself that `` x^32*(1+x+x^3+x^5+x^6+x^11+x^16+x^17+x^19+x^21+x^22+x^25+x^26+x^27+x^28+x^29+x^30) mod (x^32+x^28+x^27+x^26+x^25+x^23+x^22+x^20+x^19+x^18+x^14+x^13+x^11+x^10+x^9+x^8+x^6+1) is 1, which means multiplication by x^32 mod CRC32c is reversible, and therefore, so is multiplying by it twice as we do when multiplying by x^64 . At any rate, you can recover D(x) from H(x)*x^32 mod CRC32c + L(x) . Once we have D(x) , we can compute D(x)*x^32 and add it to L(x)`, yielding:
L(x)+D(x)*x^32 = 
= (U(x)*x^32+D(x))*x^32 mod CRC32c + D(x)*x^32 
= U(x)*x^64 mod CRC32c

which can be used to reconstruct U(x)  in similar fashion.

3. Very fast to compute on modern machines as it uses two independent crc32 instructions.
  return  ((_mm_crc32_u64(0,w)^w)<<32)|
           _mm_crc32_u64(0,w>>32|w<<32);

translates to (gcc12.2 -std=c++17 -O2 -march=native ):
        xor     eax, eax
        mov     rcx, rax
        rorx    rsi, rdi, 32
        crc32   rax, rsi
        mov     rdx, rax
        crc32   rcx, rdi
        mov     rax, rdi
        xor     rax, rcx
        sal     rax, 32
        or      rax, rdx
        ret

Note, how the inputs to second crc32 ( rcx  and rdi ) do not depend on output of the first one ( rax ), so can be executed in parallel. In practice, there is just one ALU capable of computing crc32 which requires 3 cycles to complete but it works like a pipeline which can accept a new job in every cycle, as up to 3 independent computations can be at various stages of computation. So the second crc32 will start one cycle after the first one, for a total of 1+3 cycles for both crc32 computations in total.

4. The upper and lower halves of the result (H(x) and L(x)) differ in non-trivial way from each other, even in edge cases such as 
a) D(x)=0  : lower half of the input being always 0
b) U(x)=0  : upper half of the input being always 0
c) D(x)=U(x)  : lower half being equal to upper half
For these edge cases we will still get pretty decent results:
a)
H(x) = U(x)*x^32 mod CRC32c
L(x) = U(x)*x^64 mod CRC32c

b)
H(x) = D(x)*x^64+D(x) mod CRC32c
L(x) = D(x)*x^32 mod CRC32c

c)
H(x) = D(x)*(x^64+x^32+1) mod CRC32c
L(x) = D(x)*(x^64+x^32) mod CRC32c

Contrast this to other ad-hoc "simpler ideas" like having H(x)=U(x)+D(x) , or H(x) = ((D(x)+U(x))*x^32+U(x))*x^32 mod CRC32c , which seem to have the properties 1-3, but not 4.

5. It seems to perform well in sysbench OLTP-RW tests on tetra02, which I've conducted. I used distribution in {pareto,uniform}  and clients number in {64,1024}

Thanks Jakub. This is cool! Very solid analysis.

One thing I'm not very sure is how to compute U(x) per D(x).

In your equation:
L(x)+D(x)*x^32 = 
= (U(x)*x^32+D(x))*x^32 mod CRC32c + D(x)*x^32 
= U(x)*x^64 mod CRC32c

Looks to me it leads to "(U(x)*x^64 mod CRC32c) + (D(x) * x32 mod CRC32c) + D(x) * x^32". The latter two factors don't cancel?

Shall we compute "L(x) + (D(x)*x^32 mod CRC32c)" instead?