Bug #110644 | innobase: optimize hash_uint64 with hardware crc on Arm | ||
---|---|---|---|

Submitted: | 10 Apr 2023 6:16 | Modified: | 13 Apr 2023 9:56 |

Reporter: | Cai Yibo (OCA) | Email Updates: | |

Status: | Verified | Impact on me: | |

Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |

Version: | 8.0 | OS: | Linux |

Assigned to: | CPU Architecture: | ARM (aarch64) |

[10 Apr 2023 6:16]
Cai Yibo

[10 Apr 2023 6:16]
Cai Yibo

patch (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-innobase-optimize-hash_uint64-with-hardware-crc-on-A.patch (text/x-patch), 4.98 KiB.

[10 Apr 2023 7:27]
MySQL Verification Team

Hello Cai Yibo, Thank you for the report and contribution. regards, Umesh

[11 Apr 2023 8:56]
Marcin Babij

Hello Cai, Thank you for your report. We have identified the same problem internally, not only on ARM. In next MySQL Server versions we will use the CRC32 for hash if it is hardware accelerated, and use the current Tabular Hashing only in rare cases it is otherwise.

[11 Apr 2023 9:08]
Cai Yibo

Thanks Marcin, this is great! FWIW, I find with 4 crc32c instructions can easily pass the unit test. Might be useful on x86 without hardware crc32. // Remix bits value = (value ^ (value >> 30)) * 0xbf58476d1ce4e5b9ULL; value = value ^ (value >> 31); // Mix results from 4 crc32c_u16 uint64_t r3 = _mm_crc32_u16(0x01234567, uint16_t(value >> 48)); uint64_t r2 = _mm_crc32_u16(0x89abcdef, uint16_t(value >> 32)); uint64_t r1 = _mm_crc32_u16(0xfedcba98, uint16_t(value >> 16)); uint64_t r0 = _mm_crc32_u16(0x76543210, uint16_t(value)); return ((r3 ^ r1) << 32) | (r2 ^ r0);

[11 Apr 2023 16:57]
Jakub Lopuszanski

Hi Cai Yibo. The solution we've ended up with, is using two crc32 instructions: ``` static inline uint64_t crc32_update_uint64(uint64_t crc, uint64_t data) { #ifdef CRC32_x86_64 return _mm_crc32_u64(crc, data); #elif defined(CRC32_ARM64) return (uint64_t)__crc32cd((uint32_t)crc, data); #endif /* CRC32_x86_64 */ } static inline uint64_t crc32_hash_uint64(uint64_t w) { ut_ad(ut_crc32_cpu_enabled); w *= 0x1a2be5a5fa110e23; return (crc32_update_uint64(0, w) ^ w) << 32 | crc32_update_uint64(0, (w>> 32 | w<< 32)); } ``` The `w *= 0x1a2be5a5fa110e23` part is a bit ugly and not needed in theory, but in practice was needed to pass the stringent unit tests. It's a bijection, so doesn't lose any bit of info. The idea here is to treat the 64-bit input w as consisting of two 32-bit parts ( w == u << 32 ^ d ) representing two polynomials of degree 31 over GF(2) (also known as Z_2): W(x) = U(x)*x^32 + D(x) and the result of my proposed hash function consists of two parts as well: H(x) = (D(x)*x^32+U(x))*x^32 mod CRC32c + D(x) L(x) = (U(x)*x^32+D(x))*x^32 mod CRC32c Note: the names are short for Up, Down, High and Low. Note: the asymmetry between H(x) 's and L(x) 's definitions (the " +D(x) " part) is crucial and intentional, as I'll explain later. (Basically we need odd number of polynomials to avoid divisibility by (x+1) ) Note: the _mm_crc32_u64(0,w) is meant to process 8 bytes of data in little endian, so the lower 4 bytes of w are (conceptually) treated as "higher coefficients" of the polynomial, and vice-versa, which is why the math formula has D(x)*x^32+U(x) where the code has w. Note: technically, the _mm_crc32_u64(0,w) *does not* compute W(x) * x^32 mod CRC32c . Due to "historical" reasons it computes an "equivalent" of it: you'd have to xor the result with _mm_crc32_u64(0,0) to get W(x)*x^32 mod CRC32c . However xoring with a constant doesn't change any properties of this solution, and just wastes CPU cycles, so in what follows, I'll analyse the mathematical expression, instead of the real expression computed by CPU. Above definition has very nice properties: 1. Both halves of the result depend on both halves of input. (Actually, we could make stronger claims, like: each bit of L(x) depends on each bit of the input with ppb equal to fractions on ones in CRC32-C polynomial, which is 17/32 ~ 53%. For H(x) there is a slight loss: because D(x) appears in it twice, the result depends on D(x)*(x^64+1) not D(x) , and unfortunately (x^64+1) is divisible by (x+1) which is a factor of CRC32c, too, which means that (x^64+1) and CRC32c are not co-prime and thus multiplying by it is not reversible. The (x^64+1)/(x+1) is co-prime with CRC32c/(x+1) , though, so we still get 31 bits of entropy.) So, even if the caller is only interested in the lower half, or upper half of result, it will take advantage of (most of) entropy available in the input. 2. The proposed hash_uint64 is a bijection, i.e. no two different inputs map to the same output, so there are no "collisions" in this sense. (A hand-waving way to convince yourself of this claim: the H(x) can be used to compute the checksum of [D,U,D,0] sequence, and L(x) can be used to compute the checksum of [0,U,D,0] , and xoring these two checksums would let you recover D(x), and from that you can figure out U(x), as well.) A formal proof : observe that you can reconstruct D(x) and U(x) from the H(x) and L(x) in the following way. H(x)*x^32 mod CRC32c + L(x) = = ((D(x)*x^32+U(x))*x^32 mod CRC32c + D(x))*x^32 mod CRC32c + L(x) = ((D(x)*x^32+U(x))*x^32 + D(x))*x^32 mod CRC32c + L(x) = D(x)*x^64+(U(x)*x^32 + D(x))*x^32 mod CRC32c + L(x) = D(x)*x^64+L(x) mod CRC32c + L(x) = D(x)*x^64 mod CRC32c The CRC32c polynomial - the only one available in SEE4.2 - unfortunately, is not a primitive polynomial - it has two factors, one of them is (x+1) . This complicates matters in our proof, as it means multiplication by some polynomials is not reversible mod CRC32c - and to recover D(x) from above expression, we want to "divide both sides by x^64". Thus we have to argue this is possible in the specific case of x^64 polynomial. Thankfully, x^64 has no common divisors with CRC32c, and thus multiplying by it mod CRC32c is reversible. There are many ways to convince yourself of it: for starters the factorization of x^64 is just x*x*...*x . Also x^64 is "odd" (has odd number of ones) so can't be divisible by (x+1) . Another intuition is that CRC32c can correct a single bit errors, so it can't divide any x^k . Finally, it is a known fact that "rolling crc32c forward by 32 bits" can be reversed, i.e. you can recover a 32-bit value of s, from s(x)*x^32 mod crc32c - and thus by doing it twice we can recover D(x) . Finally, you can see for yourself that `` x^32*(1+x+x^3+x^5+x^6+x^11+x^16+x^17+x^19+x^21+x^22+x^25+x^26+x^27+x^28+x^29+x^30) mod (x^32+x^28+x^27+x^26+x^25+x^23+x^22+x^20+x^19+x^18+x^14+x^13+x^11+x^10+x^9+x^8+x^6+1) is 1, which means multiplication by x^32 mod CRC32c is reversible, and therefore, so is multiplying by it twice as we do when multiplying by x^64 . At any rate, you can recover D(x) from H(x)*x^32 mod CRC32c + L(x) . Once we have D(x) , we can compute D(x)*x^32 and add it to L(x)`, yielding: L(x)+D(x)*x^32 = = (U(x)*x^32+D(x))*x^32 mod CRC32c + D(x)*x^32 = U(x)*x^64 mod CRC32c which can be used to reconstruct U(x) in similar fashion. 3. Very fast to compute on modern machines as it uses two independent crc32 instructions. return ((_mm_crc32_u64(0,w)^w)<<32)| _mm_crc32_u64(0,w>>32|w<<32); translates to (gcc12.2 -std=c++17 -O2 -march=native ): xor eax, eax mov rcx, rax rorx rsi, rdi, 32 crc32 rax, rsi mov rdx, rax crc32 rcx, rdi mov rax, rdi xor rax, rcx sal rax, 32 or rax, rdx ret Note, how the inputs to second crc32 ( rcx and rdi ) do not depend on output of the first one ( rax ), so can be executed in parallel. In practice, there is just one ALU capable of computing crc32 which requires 3 cycles to complete but it works like a pipeline which can accept a new job in every cycle, as up to 3 independent computations can be at various stages of computation. So the second crc32 will start one cycle after the first one, for a total of 1+3 cycles for both crc32 computations in total. 4. The upper and lower halves of the result (H(x) and L(x)) differ in non-trivial way from each other, even in edge cases such as a) D(x)=0 : lower half of the input being always 0 b) U(x)=0 : upper half of the input being always 0 c) D(x)=U(x) : lower half being equal to upper half For these edge cases we will still get pretty decent results: a) H(x) = U(x)*x^32 mod CRC32c L(x) = U(x)*x^64 mod CRC32c b) H(x) = D(x)*x^64+D(x) mod CRC32c L(x) = D(x)*x^32 mod CRC32c c) H(x) = D(x)*(x^64+x^32+1) mod CRC32c L(x) = D(x)*(x^64+x^32) mod CRC32c Contrast this to other ad-hoc "simpler ideas" like having H(x)=U(x)+D(x) , or H(x) = ((D(x)+U(x))*x^32+U(x))*x^32 mod CRC32c , which seem to have the properties 1-3, but not 4. 5. It seems to perform well in sysbench OLTP-RW tests on tetra02, which I've conducted. I used distribution in {pareto,uniform} and clients number in {64,1024}

[13 Apr 2023 9:56]
Cai Yibo

Thanks Jakub. This is cool! Very solid analysis. One thing I'm not very sure is how to compute U(x) per D(x). In your equation: L(x)+D(x)*x^32 = = (U(x)*x^32+D(x))*x^32 mod CRC32c + D(x)*x^32 = U(x)*x^64 mod CRC32c Looks to me it leads to "(U(x)*x^64 mod CRC32c) + (D(x) * x32 mod CRC32c) + D(x) * x^32". The latter two factors don't cancel? Shall we compute "L(x) + (D(x)*x^32 mod CRC32c)" instead?