Bug #85819 Optimize AARCH64 CRC32c implementation
Submitted: 6 Apr 2017 3:04 Modified: 7 Apr 2017 8:22
Reporter: Yuqi Gu (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S5 (Performance)
Version:5.7 OS:Linux
Assigned to: CPU Architecture:ARM
Tags: Contribution

[6 Apr 2017 3:04] Yuqi Gu
Description:
ARMv8 defines a set of optional CRC32/CRC32C instructions.

The CRC32 function for AArch64 that uses these instructions will optimized the performance rather than uses table-based lookup.

How to repeat:
The benchmark App source: 
/***********************************************************************
 *
 */
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>
 
#ifdef ARMV8_CRC32
extern bool ut_crc32_sse2_enabled;
extern uint32_t ut_crc32_aarch64(const uint8_t* buf, uint64_t len);
#else
extern uint32_t ut_crc32_sw(const uint8_t* buf, uint64_t len);
extern void ut_crc32_slice8_table_init(void);
#endif
 
long int GetTickCount() {
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec * 1000000 + tv.tv_usec;
}
 
int main() {
        static const uint64_t kSize = 1024 * 1024 + 29;
        uint8_t* buf = (uint8_t *)malloc(sizeof(uint8_t) * kSize);
        uint32_t i;
 
#ifdef ARMV8_CRC32
        ut_crc32_sse2_enabled = true;
#else
        ut_crc32_slice8_table_init();
#endif
 
        srand(0);
        for (i = 0; i < kSize; i++) {
                buf[i] = (uint8_t)(rand() % 256u);
        }
 
        uint32_t kLoop = 1024;
        long int start, end;
 
        uint32_t crc = 0;
 
        start = GetTickCount();
        for (i = 0; i < kLoop; i++) {
#ifdef ARMV8_CRC32
                crc = ut_crc32_aarch64(buf, kSize);
#else
                crc = ut_crc32_sw(buf, kSize);
#endif
        }
        end = GetTickCount();
 
        if (kSize < 20) {
                for (i = 0; i < kSize; i++) {
                        printf("%3u,", (uint32_t)buf[i]);
                }
                printf("\n");
        }
 
        printf("crc result = %x, time cost per loop:%f ms\n", crc, (double)(end - start) / kLoop);
 
        free(buf);
        return 0;
}
/*
 *
 ****************************************************************************/

Build benchmark App on AArch64:

/***********************************************************************
 *
 */
linux@wls-ci-arm:~/mysql-server$ git diff extra/CMakeLists.txt
diff --git a/extra/CMakeLists.txt b/extra/CMakeLists.txt
index 3adf988..4fa1004 100644
--- a/extra/CMakeLists.txt
+++ b/extra/CMakeLists.txt
@@ -141,6 +141,11 @@ IF(WITH_INNOBASE_STORAGE_ENGINE)
   MYSQL_ADD_EXECUTABLE(innochecksum innochecksum.cc ${INNOBASE_SOURCES})
   TARGET_LINK_LIBRARIES(innochecksum mysys mysys_ssl ${LZ4_LIBRARY})
   ADD_DEPENDENCIES(innochecksum GenError)
+
+  SET(CRC_TEST_SOURCES
+      ../storage/innobase/ut/ut0crc32.cc
+     )
+  MYSQL_ADD_EXECUTABLE(mycrctest mycrctest.cc ${CRC_TEST_SOURCES})
 ENDIF()
/*
 *
 **************************************************************************/

AArch64 Platform: 
Platform \ Case (millisecond)	Software CRC	AArch64 CRC Intrinsics
AMD seattle (Softiron)	        1101.783	200.535
Cavium ThunderX	                1504.497	479.690
Hisilicon Taishan(Huawei)	1035.202	232.984
[6 Apr 2017 5:41] Yuqi Gu
Contribution submitted via Github - Bug #85819 Add AArch64 optimized crc32c implementation #136n

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug85819-5.7.txt (text/plain), 5.40 KiB.

[6 Apr 2017 5:59] Yuqi Gu
ARMv8 defines a set of optional CRC32/CRC32C instructions.
The CRC32 function for AArch64 that uses these instructions will optimize the performance rather than that uses table-based lookup.
[6 Apr 2017 6:13] Alexey Kopytov
Duplicate of bug #79144 ?
[7 Apr 2017 5:22] Yuqi Gu
ARMv8 defines PMULL crypto instruction. The new patch optimizes crc32c calculate with the instruction when available rather than

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug85819-02-5.7.txt (text/plain), 8.38 KiB.

[7 Apr 2017 5:22] Yuqi Gu
I updated the crc32 optimization code. 

ARMv8 defines PMULL crypto instruction. The new patch optimizes crc32c calculate with the PMULLinstruction when available rather than original linear crc32 instructions.

The result of benchmark:
Platform \ Case (millisecond)	| Software CRC	| AArch64 CRC Intrinsics	| AArch64 Crypto instruction
AMD seattle (Softiron)	|1101.783	|200.535	|114.509
Cavium ThunderX	|1504.497	|479.690	|286.274
Hisilicon Taishan( Huawei)	|1035.202	|232.984	|115.580

It shows that the performance for CRC32 of innodb on AArch64 is better than linear crc32 instruction.
[7 Apr 2017 5:31] Yuqi Gu
GH PR: https://github.com/mysql/mysql-server/pull/136
[7 Apr 2017 5:34] Yuqi Gu
ARMv8 defines PMULL crypto instruction.This patch optimizes crc32c calculate with the instruction.
[7 Apr 2017 8:22] Umesh Shastry
Hello Yuqi Gu,

Thank you for the report and contribution.

Thanks,
Umesh
[28 Apr 2017 16:10] OCA Admin
Contribution submitted via Github - Bug #85819 Add AArch64 optimized crc32c implementation 
(*) Contribution by Yuqi Gu (Github guyuqi, mysql-server/pull/136#issuecomment-292444031): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_114523745.txt (text/plain), 11.05 KiB.