MySQL Bugs: #98520: Suboptimal implementations for some load/store functions for little-endian arch

Bug #98520	Suboptimal implementations for some load/store functions for little-endian arch
Submitted:	8 Feb 2020 9:37	Modified:	5 May 2020 14:51
Reporter:	Alexey Kopytov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	ARM

Description:
The following commit in MySQL 8.0 has converted most legacy *int*korr()
and int*store() functions to memcpy() calls on little-endian
architectures, which is generally more efficient:

https://github.com/mysql/mysql-server/commit/536ea313a6a71f9ed87f14d95e03e04e40ff5605

However, there are still some functions from that family that use
platform-independent implementations that are correct on both
little-endian and big-endian architectures, but are less efficient than
they could be on little-endian ones, since no endianness data conversion
is required:

include/my_byteorder.h:

static inline int32 sint3korr(const uchar *A);

static inline uint32 uint3korr(const uchar *A);

static inline ulonglong uint5korr(const uchar *A);

static inline ulonglong uint6korr(const uchar *A);

static inline void int3store(uchar *T, uint A);

static inline void int5store(uchar *T, ulonglong A);

static inline void int6store(uchar *T, ulonglong A);

The following Godbolt link demonstrates how said functions can be
modified to provide more efficient implementations for little-endian
architectures such as x86-64 and ARM64. In some cases the size of code
in terms of CPU instructions is cut by half:
https://godbolt.org/z/shNgg8

Even though those functions are not used as frequently in the MySQL code
as their power-of-two versions that have been optimized with memcpy(),
they are still used in a few places in the binlog code, Performance
Schema, the client library and SQL-level code. Since they are inlined,
they contribute to code bloat in the mysqld binary.

This is request to optimize those functions for little-endian
architectures by providing specialized implementations as has already
been done for other functions in the same family.

How to repeat:
Code analysis + https://godbolt.org/z/shNgg8

Hello Alexey,

Thank you for the report and feedback.

regards,
Umesh

Very nice Alexey. Even on x86_64 gcc-4.1.2 this correctly inlines memcpy and generates less code than the original.

Save our CPU L1 instruction caches for other stuff. Keep up the good work.

I noticed that CPU architecture for this issue has been updated from "Any" to "ARM". I'd like to emphasize that the optimization proposed here applies to any little-endian architecture such as x86, amd64, arm64, ppc64le and likely others.