Bug #98520 Suboptimal implementations for some load/store functions for little-endian arch
Submitted: 8 Feb 2020 9:37 Modified: 5 May 2020 14:51
Reporter: Alexey Kopytov Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:ARM

[8 Feb 2020 9:37] Alexey Kopytov
The following commit in MySQL 8.0 has converted most legacy *int*korr()
and int*store() functions to memcpy() calls on little-endian
architectures, which is generally more efficient:


However, there are still some functions from that family that use
platform-independent implementations that are correct on both
little-endian and big-endian architectures, but are less efficient than
they could be on little-endian ones, since no endianness data conversion
is required:


static inline int32 sint3korr(const uchar *A);

static inline uint32 uint3korr(const uchar *A);

static inline ulonglong uint5korr(const uchar *A);

static inline ulonglong uint6korr(const uchar *A);

static inline void int3store(uchar *T, uint A);

static inline void int5store(uchar *T, ulonglong A);

static inline void int6store(uchar *T, ulonglong A);

The following Godbolt link demonstrates how said functions can be
modified to provide more efficient implementations for little-endian
architectures such as x86-64 and ARM64. In some cases the size of code
in terms of CPU instructions is cut by half:

Even though those functions are not used as frequently in the MySQL code
as their power-of-two versions that have been optimized with memcpy(),
they are still used in a few places in the binlog code, Performance
Schema, the client library and SQL-level code. Since they are inlined,
they contribute to code bloat in the mysqld binary.

This is request to optimize those functions for little-endian
architectures by providing specialized implementations as has already
been done for other functions in the same family.

How to repeat:
Code analysis + https://godbolt.org/z/shNgg8
[11 Feb 2020 7:02] MySQL Verification Team
Hello Alexey,

Thank you for the report and feedback.

[16 Mar 2020 1:20] Daniel Black
Very nice Alexey. Even on x86_64 gcc-4.1.2 this correctly inlines memcpy and generates less code than the original.

Save our CPU L1 instruction caches for other stuff. Keep up the good work.
[5 May 2020 14:51] Alexey Kopytov
I noticed that CPU architecture for this issue has been updated from "Any" to "ARM". I'd like to emphasize that the optimization proposed here applies to any little-endian architecture such as x86, amd64, arm64, ppc64le and likely others.