Bug #72811 Set NUMA mempolicy for optimum mysqld performance
Submitted: 30 May 2014 7:49 Modified: 13 Jul 2015 12:08
Reporter: Stewart Smith (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Compiling Severity:S5 (Performance)
Version:5.6, 5.7 OS:Linux
Assigned to: CPU Architecture:Any
Tags: numa, PowerPC

[30 May 2014 7:49] Stewart Smith
Description:
Historically, the suggested way to run MySQL on NUMA machines has been to run with "numactl --interleave=all" so that things like the InnoDB buffer pool are spread across all NUMA nodes.

However, for thread local allocations, it makes little sense to interleave across NUMA nodes as there is only one thread per connection. It would make more sense to allocate this memory node local. Remote NUMA node memory access can be 25-100% slower (or more) than node local access.

I have a proof of concept patch (no CMake foo to detect correct headers are present, no error logging, no configuration option) that does this. If needed/wanted I can polish up this patch with the appropriate CMake foo, but I wanted to show it here first - I'd be interested to hear performance characteristics on non-POWER8 systems.

This has been written and tested on POWER8, but applies to any NUMA system.
It appears to be fairly solid and I've gotten my best benchmark results ever with this patch applied.

How to repeat:
code analysis, benchmarking on NUMA systems.

Suggested fix:
Set memory allocation policy to be interleaved for all initial server setup, set to default policy (node local) once server has started.

Patch attached:

--- mysql-5.7.4-m14.orig/sql/mysqld.cc
+++ mysql-5.7.4-m14/sql/mysqld.cc
@@ -21,6 +21,7 @@
 #include <list>
 #include <set>
 #include <string>
+#include <numaif.h>
 
 #include "sql_priv.h"
 #include "unireg.h"
@@ -4246,6 +4247,8 @@ int mysqld_main(int argc, char **argv)
   */
   my_progname= argv[0];
 
+  set_mempolicy(MPOL_INTERLEAVE, NULL, 0);
+
 #ifndef _WIN32
 #ifdef WITH_PERFSCHEMA_STORAGE_ENGINE
   pre_initialize_performance_schema();
@@ -4687,6 +4690,7 @@ int mysqld_main(int argc, char **argv)
   Service.SetRunning();
 #endif
 
+  set_mempolicy(MPOL_DEFAULT, NULL, 0);
 
   /* Signal threads waiting for server to be started */
   mysql_mutex_lock(&LOCK_server_started);
[30 May 2014 7:51] Stewart Smith
set memory policy interleaved for global structures, node local for runtime allocations

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: mempolicy.patch (text/x-patch), 789 bytes.

[30 May 2014 8:20] Umesh Shastry
Hello Stewart,

Thank you for the report and contribution.

Thanks,
Umesh
[2 Jun 2014 1:44] Stewart Smith
here's a little microbenchmark program that can work out the actual latency for doing memory operations from one NUMA node to another versus local:

wget http://ozlabs.org/~anton/junkcode/latency2001.c
gcc -O2 -o latency2001 latency2001.c

sudo -s
echo 2048 > /proc/sys/vm/nr_hugepages
echo 1000000000000 > /proc/sys/kernel/shmmax 

# Node local allocation
for i in `seq 0 4 127`
do
./latency2001 -a $i -c $i -l 128M
done

# Allocate on memory on CPU 0
for i in `seq 0 4 127`
do
./latency2001 -a 0 -c $i -l 128M
done

Output like following (these are made up numbers, just to ensure I don't divulge anything I'm not meant to):

134217728    500.00 cycles     500.00 ns
134217728    600.00 cycles     600.00 ns
134217728    800.00 cycles     800.00 ns
134217728    1000.00 cycles    1000.00 ns

(i.e. it's easy to see that as you access memory further away it gets more expensive)

You'll want to vary the parameters to seq there for the hardware you're running on - e.g. on Intel the 4 is probably going to be a 2 (rather than 4 for POWER7 and 8 for POWER8 - this is the amount of hardware threads per CPU core, no use in running benchmark on each SMT thread)
[4 Jun 2014 1:13] Stewart Smith
Explicitly set NUMA memory allocate policy. Now for 5.7 with CMake foo

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: numa-mempolicy-v2.patch (text/x-patch), 2.54 KiB.

[4 Jun 2014 1:16] Stewart Smith
My updated patch is for 5.7.4-m14 (although should work on 5.6 too).

If it finds numactl.h and libnuma, it will build with setting numa memory policy and output information to this effect on mysqld startup

If it doesn't find them, it will compile anyway and display an appropriate notice on mysqld startup.

Possible enhancements that people may want:
1) find out how to do this on Windows - I have no idea and don't have a current mysqld on win32 build environment.
2) server variable?
[4 Jun 2014 1:49] Stewart Smith
on further reading, using MAP_POPULATE flag to mmap when allocating buffer pool may be needed to help force the allocation of the pages at the right time.
[16 Sep 2014 11:32] Daniƫl van Eeden
This could be a duplicate of Bug #57241.
[17 Sep 2014 0:56] Stewart Smith
It possibly is. My patch is under OCA though, so it can be merged as-is or with any amendment needed.
[4 Jun 2015 21:40] Stewart Smith
This is still something that should be resolved.

there's two approaches I see as being possible:
1) buffer pool instances map to NUMA nodes
   this possibly gives better latency for a number of workloads, as a whole BP page will be node local, and if there is locality to a BP-instance for a query, that thread is likely to be scheduled on the node where the data is.

2) a patch like this, interleave everything except connection specific
This could be interim while (1) is sorted.

On larger NUMA machines, the links between nodes can be non-symmetrical too, with lower bandwidth over some links, so simply interleaving everything (especially connection specific memory) is unlikely to be the best thing to do.
[13 Jul 2015 12:08] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.6.27, 5.7.9, 5.8.0 releases, and here's the changelog entry:

The new innodb_numa_interleave read-only configuration option allows you
to enable the NUMA interleave memory policy for allocation of the InnoDB
buffer pool. When innodb_numa_interleave is enabled, the NUMA memory
policy is set to MPOL_INTERLEAVE for the mysqld process. After the InnoDB
buffer pool is allocated, the NUMA memory policy is set back to
MPOL_DEFAULT. This option is only available on NUMA-enabled systems.

Thanks to Stewart Smith for the patch.
[3 Jan 2016 6:18] Tim Smith
I install the latest version of MySQL 5.6.28 (x64) on Windows. 

 mysql> show global variables like '%version%'; 
 +-------------------------+------------------------------+ 
 | Variable_name | Value | 
 +-------------------------+------------------------------+ 
 | innodb_version | 5.6.28 | 
 | protocol_version | 10 | 
 | slave_type_conversions | | 
 | version | 5.6.28-log | 
 | version_comment | MySQL Community Server (GPL) | 
 | version_compile_machine | x86_64 | 
 | version_compile_os | Win64 | 
 +-------------------------+------------------------------+ 
 7 rows in set (0.00 sec) 

 mysql> show global variables like '%numa%'; 
 Empty set (0.00 sec) 

 The manual (http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_numa_interleav...) says: 

 "For the innodb_numa_interleave option to be available, MySQL must be compiled on a NUMA-enabled system." 

 Does it mean 
 1. MySQL 5.6.28 (x64) on Windows is not compiled on a NUMA-enabled system by Oracle? 
 2. MySQL 5.6.28 (x64) on Windows cannot get the optimal performance on a server with more than one CPU socket?
[5 Jan 2016 17:22] Joe Grasse
I went to try this out and ran into Bug #78953.