Bug #101369 Remapping .text and .data application segments to huge pages
Submitted: 29 Oct 2020 8:12 Modified: 18 Dec 2020 12:05
Reporter: Dmitriy Philimonov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Compiling Severity:S5 (Performance)
Version:8.0 OS:Linux
Assigned to: CPU Architecture:Any
Tags: huge pages

[29 Oct 2020 8:12] Dmitriy Philimonov
Description:
The applications usually benefit from remapping .text and .data ELF sections to huge pages. The performance speedup comes form significant reduce of iTLB and dTLB misses. Of course, the approach isn't new, the example implementations at the moment are:
  * libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs/blob/master/elflink.c ('remap_segments' function)
  * Google: https://chromium.googlesource.com/chromium/src/+/refs/heads/master/chromeos/hugepage_text/... ('RemapHugetlbText*' functions)
  * Facebook: https://github.com/facebook/hhvm/blob/master/hphp/runtime/base/program-functions.cpp ('HugifyText' function)

libhugetlbfs uses huge pages, meanwhile Google/Facebook rely on transparent huge pages. We decided to follow the approach which is used by libhugetlbfs, since it has less dependency on the particular kernel allocation/defragmentation algorithm, so provides more persistent results.

We tried libhugetlbfs, however currently it has four major drawbacks:
  1. A bug with position independent executables (linked with '--pie' parameter): https://github.com/libhugetlbfs/libhugetlbfs/issues/49
  2. It might potentially unmap heap segment which immediately follows data segment in popular OS systems (e.g. Linux).
  3. It supports remapping of maximum 3 ELF segments.
  4. No integration with the target application: it works silently right during the startup.

So the custom implementation is provided, well adjusted for the MySQL code base:
  1. No issues with position independent code / additional virtual memory randomization.
  2. Tested with lld/gold/bfd linkers.
  3. Preserves heap segment from unmapping, tested with standard glibc and jemalloc allocators.
  4. Since it's a part of mysqld code now, any number of segments could be specified (currently = 16).
  5. Integration with the MySQL code base: configuration variable is used to turn the functionality on and current logging system for error/notification messages.

Performance increase is up to 9% in sysbench OLTP_PS.

Restrictions:
  1. Currently works with 2mb huge pages only.
  2. Needs to be linked in a specific way (additional alignment for ELF segments).
  3. Support is provided only for Linux systems (tested for kernels >= 3.10).

For more information refer to the documentation inside the sql/huge.cc (contains in the patch).

How to repeat:
Run sysbench OLTP benchmark and compare the results before and after the patch is applied.

Suggested fix:
Remap .text and .data to ELF segments using the patch applied.
[29 Oct 2020 13:12] MySQL Verification Team
Hi Mr. Philimonov,

Thank you for your performance improvement report.

However, we would like to thank you for your exemplary contribution to our project. We are truly grateful for your efforts and your sharing of your work.

Verified as reported.
[18 Dec 2020 12:05] Dmitriy Philimonov
We provide a bug fix to the current contribution. It changes huge page mapping flag from MAP_SHARED to MAP_PRIVATE.

So, after the remapping to huge pages is done, it:
  * fixes sporadic SIGSEGVs in child and parent processes after fork() syscall 
    if fork() is called after remapping is done;
  * gdb attach starts servicing breakpoints;
  * gdb starts work correctly with produced core dumps.

Change:

--- a/sql/huge.cc
+++ b/sql/huge.cc
@@ -675,12 +675,27 @@ class Huge_page_remapper {
       Final mmap with MAP_FIXED to the existing virtual address,
       any already existing and overlapping virtual memory regions
       are going to be unmapped by kernel silently.
-      MAP_SHARED is used to prevent reserving additional huge pages
-      by kernel (MAP_PRIVATE reserve it due to copy-on-write strategy).
+
+      Using MAP_PRIVATE is the correct classic expected by kernel usage of
+      code/data mappings. The side effect is the additional reservation of
+      huge pages made by kernel. That's why number of used pages reported by the
+      application log is smaller than the number of pages reported by the kernel
+      by about 20% (depends on the exact Linux kernel version). We can use
+      MAP_PRIVATE | MAP_NORESERVE to prevent reservation, however, if kernel
+      can't find a free huge page (e.g. after fork()), the application might get
+      SIGSEGV.
+
+      Using MAP_SHARED prevents the addtitional huge pages reservation too,
+      however, you must know that sharing the code/data mapping results in:
+        * fork() doesn't copy these mappings, so both child and parent continue
+        to use the same mappings leading you to sporatic SIGSEGV crashes;
+        * gdb attach stops serving break points;
+        * produced core dumps are corrupted, gdb postmortem analysis is broken;
+      These effects are the only observed by us, but there could be more.
     */
     haddr = static_cast<char *>(
         mmap(reinterpret_cast<void *>(vaddr_start_aligned), hsize,
-             mem_protection, MAP_SHARED | MAP_FIXED, fd, 0));
+             mem_protection, MAP_PRIVATE | MAP_FIXED, fd, 0));
     if (haddr == MAP_FAILED) {
       PRINT_ERROR_ERRNO("Can't mmap '%s' -> [%#lx %#lx)", mp,
                         vaddr_start_aligned, vaddr_end_aligned);
[18 Dec 2020 12:07] Dmitriy Philimonov
Updated version. The patch is tested with the commit "f8cdce86448a211511e8a039c62580ae16cb96f5".

Attachment: remapping_text_and_data_to_huge_pages.v8_0_21.git.diff.v1 (application/octet-stream, text), 40.47 KiB.

[18 Dec 2020 14:14] MySQL Verification Team
Thank you very much for your patch contribution, we appreciate it!

In order for us to continue the process of reviewing your contribution to MySQL, please send us a signed copy of the Oracle Contributor Agreement (OCA) as outlined in http://www.oracle.com/technetwork/community/oca-486395.html

Signing an OCA needs to be done only once and it's valid for all other Oracle governed Open Source projects as well.

Getting a signed/approved OCA on file will help us facilitate your contribution - this one, and others in the future.  

Please let me know, if you have any questions.

Thank you for your interest in MySQL.