Bug #94747 4GB Limit on large_pages shared memory set-up
Submitted: 22 Mar 2019 10:06 Modified: 27 Mar 2019 16:00
Reporter: Nikolai Ikhalainen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7.25 OS:Any
Assigned to: CPU Architecture:Any

[22 Mar 2019 10:06] Nikolai Ikhalainen
Description:
Similar to https://bugs.mysql.com/bug.php?id=43606

If large pages are enabled on mysqld side the maximum size for shared memory allocation is 4GB (3GB allocating 3298820096 bytes in segment, 4GB allocating just 102760448 )

How to repeat:
ulimit -l unlimited
echo 22000 > /proc/sys/vm/nr_hugepages
wget https://dev.mysql.com/get/Downloads/MySQL-5.7/mysql-5.7.25-linux-glibc2.12-x86_64.tar.gz
tar xzf mysql-5.7.25-linux-glibc2.12-x86_64.tar.gz
mv mysql-5.7.25-linux-glibc2.12-x86_64 m57
cd m57
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --initialize-insecure --skip-networking

Works fine:
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=3G --innodb_buffer_pool_size=36G --innodb_numa_interleave=1

ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 0          zabbix     600        1825056    6          dest         
0x00000000 3670017    root       600        8388608    1          dest         
0x00000000 3702786    root       600        3298820096 1          dest         
0x00000000 3735555    root       600        3298820096 1          dest         
0x00000000 3768324    root       600        3298820096 1          dest         
0x00000000 3801093    root       600        3298820096 1          dest         
0x00000000 3833862    root       600        3298820096 1          dest         
0x00000000 3866631    root       600        3298820096 1          dest         
0x00000000 3899400    root       600        3298820096 1          dest         
0x00000000 3932169    root       600        3298820096 1          dest         
0x00000000 3964938    root       600        3298820096 1          dest         
0x00000000 3997707    root       600        3298820096 1          dest         
0x00000000 4030476    root       600        3298820096 1          dest         
0x00000000 4063245    root       600        3298820096 1          dest

cat /proc/meminfo | grep -i huge
AnonHugePages:  14602240 kB
ShmemHugePages:        0 kB
HugePages_Total:   22000
HugePages_Free:    21552
HugePages_Rsvd:    18432
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        45056000 kB

Incorrect allocation:
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=4G --innodb_buffer_pool_size=40G --innodb_numa_interleave=1

ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 0          zabbix     600        1825056    6          dest         
0x00000000 4096001    root       600        8388608    1          dest         
0x00000000 4128770    root       600        102760448  1          dest         
0x00000000 4161539    root       600        102760448  1          dest         
0x00000000 4194308    root       600        102760448  1          dest         
0x00000000 4227077    root       600        102760448  1          dest         
0x00000000 4259846    root       600        102760448  1          dest         
0x00000000 4292615    root       600        102760448  1          dest         
0x00000000 4325384    root       600        102760448  1          dest         
0x00000000 4358153    root       600        102760448  1          dest         
0x00000000 4390922    root       600        102760448  1          dest         
0x00000000 4423691    root       600        102760448  1          dest
cat /proc/meminfo | grep -i huge
AnonHugePages:  13846528 kB
ShmemHugePages:        0 kB
HugePages_Total:   22000
HugePages_Free:    21977
HugePages_Rsvd:      471
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        45056000 kB

Error log:
2019-03-22T10:00:24.829785Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2019-03-22T10:00:24.829909Z 0 [Note] --secure-file-priv is set to NULL. Operations related to importing and exporting data are disabled
2019-03-22T10:00:24.829937Z 0 [Note] bin/mysqld (mysqld 5.7.25) starting as process 1660 ...
2019-03-22T10:00:24.831610Z 0 [Warning] Using pre 5.5 semantics to load error messages from /home/nickolay.ihalainen/m57/share/english/.
2019-03-22T10:00:24.831617Z 0 [Warning] If this is not intended, refer to the documentation for valid usage of --lc-messages-dir and --language parameters.
2019-03-22T10:00:24.834553Z 0 [Note] InnoDB: PUNCH HOLE support available
2019-03-22T10:00:24.834582Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2019-03-22T10:00:24.834589Z 0 [Note] InnoDB: Uses event mutexes
2019-03-22T10:00:24.834595Z 0 [Note] InnoDB: GCC builtin __sync_synchronize() is used for memory barrier
2019-03-22T10:00:24.834600Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2019-03-22T10:00:24.834606Z 0 [Note] InnoDB: Using Linux native AIO
2019-03-22T10:00:24.835360Z 0 [Note] InnoDB: Number of pools: 1
2019-03-22T10:00:24.835498Z 0 [Note] InnoDB: Using CPU crc32 instructions
2019-03-22T10:00:24.838270Z 0 [Note] InnoDB: Initializing buffer pool, total size = 40G, instances = 2, chunk size = 4G
2019-03-22T10:00:24.838305Z 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_INTERLEAVE
2019-03-22T10:00:24.908792Z 0 [Note] InnoDB: Setting NUMA memory policy to MPOL_DEFAULT
2019-03-22T10:00:24.908861Z 0 [Note] InnoDB: Completed initialization of buffer pool
2019-03-22T10:00:25.002287Z 0 [Note] InnoDB: page_cleaner worker priority: -20
2019-03-22T10:00:25.005050Z 0 [Note] InnoDB: page_cleaner coordinator priority: -20
2019-03-22T10:00:25.014526Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2019-03-22T10:00:25.037173Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2019-03-22T10:00:25.037363Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2019-03-22T10:00:25.052100Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2019-03-22T10:00:25.053657Z 0 [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active.
2019-03-22T10:00:25.053675Z 0 [Note] InnoDB: 32 non-redo rollback segment(s) are active.
2019-03-22T10:00:25.054769Z 0 [Note] InnoDB: 5.7.25 started; log sequence number 2524213
2019-03-22T10:00:25.055011Z 0 [Note] InnoDB: Loading buffer pool(s) from /home/nickolay.ihalainen/m57/data/ib_buffer_pool
2019-03-22T10:00:25.055179Z 0 [Note] Plugin 'FEDERATED' is disabled.
2019-03-22T10:00:25.056394Z 0 [Note] InnoDB: Buffer pool(s) load completed at 190322  6:00:25
2019-03-22T10:00:25.057553Z 0 [Warning] Failed to set up SSL because of the following SSL library error: SSL context is not usable without certificate and private key
2019-03-22T10:00:25.073135Z 0 [Note] Event Scheduler: Loaded 0 events
2019-03-22T10:00:25.074048Z 0 [Note] bin/mysqld: ready for connections.
[22 Mar 2019 14:13] MySQL Verification Team
Hi,

Thank you for your report.

Your report is very sparse on details. Just enabling large pages in MySQL is not enough.

Please read the following page and confirm that you have done EVERYTHING that is recommended there:

https://dev.mysql.com/doc/refman/8.0/en/large-page-support.html
[22 Mar 2019 15:59] Nikolai Ikhalainen
Hi Sinisa,

The problem could be reproduced on different hosts.
CentOS 7.6 with kernel 4.17.4-1.el7.elrepo.x86_64
Limits for shm are even bigger:
cat /proc/sys/kernel/shmmax /proc/sys/kernel/shmall
18446744073692774399
18446744073692774399

large_pages allocation works fine with 
--innodb_buffer_pool_chunk_size=1G, --innodb_buffer_pool_chunk_size=2G and --innodb_buffer_pool_chunk_size=3G (correctly listed in ipcs -m and /proc/meminfo)

8.0.15 shows the same issue:
# bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=2G --innodb_buffer_pool_size=8G
# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 0          zabbix     600        1825056    6          dest         
0x00000000 5013505    root       600        2199912448 1          dest         
0x00000000 5046274    root       600        2199912448 1          dest         
0x00000000 5079043    root       600        2199912448 1          dest         
0x00000000 5111812    root       600        2199912448 1          dest 

# bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=4G --innodb_buffer_pool_size=8G
ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 0          zabbix     600        1825056    6          dest         
0x00000000 5144577    root       600        102760448  1          dest         
0x00000000 5177346    root       600        102760448  1          dest

The problem is also easy to reproduce on Ubuntu 18.04 4.15.0-46-generic (20GB RAM):

as root:
sync; echo 3 > /proc/sys/vm/drop_caches
echo 5120 > /proc/sys/vm/nr_hugepages # 10GB
ulimit -l unlimited
wget https://dev.mysql.com/get/Downloads/MySQL-8.0/mysql-8.0.15-linux-glibc2.12-x86_64.tar.xz
tar xaf mysql-8.0.15-linux-glibc2.12-x86_64.tar.xz
mv mysql-8.0.15-linux-glibc2.12-x86_64 m80
cd m80
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --initialize-insecure --skip-networking
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=2G --innodb_buffer_pool_size=8G
# innodb_buffer_pool_chunk_size=2G allocates memory correctly:
ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 42598402   root       600        2199912448 1          dest         
0x00000000 42631172   root       600        2199912448 1          dest         
0x00000000 42663943   root       600        2199912448 1          dest         
0x00000000 42696712   root       600        2199912448 1          dest 

# The issue happens with 4G:
bin/mysqld --no-defaults --user=root --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=2 --large-pages --innodb_buffer_pool_chunk_size=4G --innodb_buffer_pool_size=8G
ipcs -m
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 42729474   root       600        102760448  1          dest         
0x00000000 42762244   root       600        102760448  1          dest

Instead of two 8GB segments mysql creates two 98MB segments.

shm limits are the same as on CentOS:
cat /proc/sys/kernel/shmmax /proc/sys/kernel/shmall
18446744073692774399
18446744073692774399
[26 Mar 2019 13:50] MySQL Verification Team
Hi,

First of all, you did not answer the question in my previous comment.

Second, this seems to be strictly LInux OS internal issue with pages and not related to MySQL, at all !!!!!!

Please, prove otherwise.
[27 Mar 2019 3:35] Nikolai Ikhalainen
Hi Sinisa,

I've followed https://dev.mysql.com/doc/refman/8.0/en/large-page-support.html . Large pages support is working if innodb_buffer_pool_chunk_size is not producing shared memory segments large than 4GB.

I'm also able to create shared memory segments on linux large than 4GB:
// gcc test_large_pages.c -o test_large_pages
#include <stdlib.h>
#include <sys/types.h>
#include <sys/shm.h>

int main() {
	int shmid;
	struct shmid_ds buf;
	void*   ptr;

	//shmid = shmget(IPC_PRIVATE, 1073741824UL, SHM_HUGETLB | SHM_R | SHM_W);
	//shmid = shmget(IPC_PRIVATE, 8589934592UL, SHM_HUGETLB | SHM_R | SHM_W);
	shmid = shmget(IPC_PRIVATE, 8692695040UL, SHM_HUGETLB | SHM_R | SHM_W);
	ptr = shmat(shmid, NULL, 0);
	shmctl(shmid, IPC_RMID, &buf);
	
	sleep(30);

	return 0;
}

$ ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 6029313    nickolay.i 600        8692695040 1          dest

If I check mysql with strace:
strace -o large_pages_2G.strace.log bin/mysqld --no-defaults --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=1 --large-pages --innodb_buffer_pool_chunk_size=2G --innodb_buffer_pool_size=8G
strace -o large_pages.strace.log bin/mysqld --no-defaults --datadir=$PWD/data --lc-messages-dir=$PWD/share/english --socket=$PWD/data/mysqld.sock --skip-networking --innodb_buffer_pool_instances=1 --large-pages --innodb_buffer_pool_chunk_size=4G --innodb_buffer_pool_size=8G
strace -o large_pages_test.strace.log ../test_large_pages

$ grep shmget large_pages*log
large_pages_2G.strace.log:shmget(IPC_PRIVATE, 8388608, SHM_HUGETLB|0600) = 5865473
large_pages_2G.strace.log:shmget(IPC_PRIVATE, 2199912448, SHM_HUGETLB|0600) = 5898242
large_pages_2G.strace.log:shmget(IPC_PRIVATE, 2199912448, SHM_HUGETLB|0600) = 5931011
large_pages_2G.strace.log:shmget(IPC_PRIVATE, 2199912448, SHM_HUGETLB|0600) = 5963780
large_pages_2G.strace.log:shmget(IPC_PRIVATE, 2199912448, SHM_HUGETLB|0600) = 5996549
large_pages.strace.log:shmget(IPC_PRIVATE, 8388608, SHM_HUGETLB|0600) = 5767169
large_pages.strace.log:shmget(IPC_PRIVATE, 102760448, SHM_HUGETLB|0600) = 5799938
large_pages.strace.log:shmget(IPC_PRIVATE, 102760448, SHM_HUGETLB|0600) = 5832707
large_pages_test.strace.log:shmget(IPC_PRIVATE, 8692695040, SHM_HUGETLB|0600) = 6062081

As you can see mysqld is trying to allocate 102760448 bytes. At the same time Linux 4.17.4-1.el7.elrepo.x86_64 accepts 8GB+98MB shm size value for my standalone program.

If I set a breakpoint at shmget call:
https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/os/os0proc.cc#L91

(gdb) p *n
$4 = 4397727744
(gdb) p size
$5 = 102760448

size is calculated as:
size = ut_2pow_round(*n + (os_large_page_size - 1),
			     os_large_page_size);

Round is defined as:
#define ut_2pow_round(n, m) ((n) & ~((m) - 1))

(gdb) p (( *n + (os_large_page_size - 1)  ) & ~(( os_large_page_size  ) - 1))
$6 = 102760448

The result became correct after converting ~((m) - 1) to uint64_t:
(gdb) p (( *n + (os_large_page_size - 1)  ) & ~(( os_large_page_size  ) - 1UL))
$7 = 4397727744
[27 Mar 2019 13:34] MySQL Verification Team
Hi Nikolai,

Thank you for your insight.

That actually means that our documentation is not complete and that we should obligatory add that chunk size should be smaller than 4 Gb.

Verified as a documentation bug.
[27 Mar 2019 16:00] Nikolai Ikhalainen
Hi Sinisa,

The same macro is used in a different place inside innodb, in buf_chunk_init function:
https://github.com/mysql/mysql-server/blob/5.7/storage/innobase/buf/buf0buf.cc#L1498

mem_size = ut_2pow_round(mem_size, UNIV_PAGE_SIZE);

It works correctly, because UNIV_PAGE_SIZE is intentionally defined in storage/innobase/include/univ.i as a 64 bit unsigned integer:
#define UNIV_PAGE_SIZE		((ulint) srv_page_size)

The behavior with 4GB limitation at mysqld side for large pages looks not intentional, especially because in https://bugs.mysql.com/bug.php?id=43606 exactly the same issue (it's not possible to create large memory segments bigger than 4GB) was marked as a defect and the issue was fixed.

Nowadays systems with several TB of RAM are not something unusual and artificial limitation at 4GB (actually just 3GB could be used, because BP allocation is slightly bigger than chunk-size) forcing mysql users to create many chunks with large-pages (e.g. 1300 chunks for 4TB BP). At the same time the manual suggests to keep the number of chunks less than 1000 ( see https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html )
[28 Mar 2019 13:20] MySQL Verification Team
Thank you.

Your request will be taken into consideration when this bug is processed internally.

Hence, I have left a comment in the internal bug, which can consider your feature request as well.
[4 Apr 2019 19:55] Mark Callaghan
I don't understand how this is a doc bug. AFAIK the code does the wrong thing and changing it to do the right thing doesn't seem like a major effort.
[5 Apr 2019 12:46] MySQL Verification Team
HI Mark,

I agree. I am changing a category of this bug.
[8 Apr 2019 2:19] Daniel Black
Was broken in https://github.com/mysql/mysql-server/commit/e5d9961b637f871b34d7741b9f3db336c59ddec4

when os_large_page_size changed from ulint -> uint.

Changing the type back also fixes it (this is 8.0.14):
diff --git a/storage/innobase/include/os0proc.h b/storage/innobase/include/os0proc.h
index 13633bb12d3..8a12f58e68f 100644
--- a/storage/innobase/include/os0proc.h
+++ b/storage/innobase/include/os0proc.h
@@ -52,7 +52,7 @@ extern ulint os_total_large_mem_allocated;
 extern bool os_use_large_pages;
 
 /** Large page size. This may be a boot-time option on some platforms */
-extern uint os_large_page_size;
+extern ulint os_large_page_size;
 
 /** Converts the current process id to a number.
 @return process id as a number */
diff --git a/storage/innobase/os/os0proc.cc b/storage/innobase/os/os0proc.cc
index ed466028590..9e1a3d2aea1 100644
--- a/storage/innobase/os/os0proc.cc
+++ b/storage/innobase/os/os0proc.cc
@@ -58,7 +58,7 @@ ulint os_total_large_mem_allocated = 0;
 bool os_use_large_pages;
 
 /** Large page size. This may be a boot-time option on some platforms */
-uint os_large_page_size;
+ulint os_large_page_size;
 
 /** Converts the current process id to a number.
 @return process id as a number */

$ runtime_output_directory/mysqld --version
/home/dan/repos/build-mysql-8.0/runtime_output_directory/mysqld  Ver 8.0.15 for Linux on x86_64 (Source distribution)

 gdb --args ./runtime_output_directory/mysqld --no-defaults  --datadir=/tmp/mysqldata  --innodb_buffer_pool_instances=1 --large-pages --innodb_buffer_pool_chunk_size=4G --innodb_buffer_pool_size=8G
(gdb) break os_mem_alloc_large(unsigned long*)
Breakpoint 1 at 0x1e54a20: file /home/dan/repos/mysql-server/storage/innobase/os/os0proc.cc, line 83.
(gdb) run
(gdb) p *n
$1 = 4397727744
(gdb) p os_large_page_size 
$2 = 2097152
(gdb) n
89	  size = ut_2pow_round(*n + (os_large_page_size - 1), os_large_page_size);
(gdb) 
91	  shmid = shmget(IPC_PRIVATE, (size_t)size, SHM_HUGETLB | SHM_R | SHM_W);
(gdb) p size
$3 = 4397727744
[8 Apr 2019 12:59] MySQL Verification Team
Hi Daniel,

Thank you very much to your contribution.

I have added your comment to our internal bug database.
[18 Jul 2019 13:03] MySQL Verification Team
This bug has a duplicate bug in the following one:

 https://bugs.mysql.com/bug.php?id=96197