Bug #97777 | separate global variables (from hot variables) using linker script (ELF) | ||
---|---|---|---|
Submitted: | 26 Nov 2019 0:44 | Modified: | 29 Nov 2019 1:55 |
Reporter: | Daniel Black | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Compiling | Severity: | S5 (Performance) |
Version: | 5.7.28 | OS: | Linux (or any ELF platform) |
Assigned to: | CPU Architecture: | Any |
[26 Nov 2019 0:44]
Daniel Black
[26 Nov 2019 0:45]
Daniel Black
define linker section for global variables
Attachment: 0001-define-linker-section-for-global-variables-POC.patch (text/x-patch), 2.94 KiB.
[26 Nov 2019 13:56]
MySQL Verification Team
Hello Mr. Black, Thank you for your effort aimed at improving the performance of our server. I must admit that I like VERY much the ideas that you are proposing for our server. I can not, however, immediately verify it, since I think that we need some additional informations from you. First of all, your proof of concept is very rudimental. Can you send us one that would contain all three groups of the global variables and how would you group them. For the start, it would be enough just to provide us with the title of each group and then to send us a list of variables that would go into each one of those. We need that in order to have it implemented faster. Second, what kind of CPUs are we discussing here. Each CPU has its own way of organising caches. I used to work on some IBM mainframes, so if you had those in mind, beside ARM64, you could cite them. Have you considered the specific organisation for each distinct type of CPU ??? Third, I understood, from your opening comment, that not all global variables would find themselves in each of the groups, that are meant for better usage of the CPU cache. Can you help us by indicating how many should go into each group. Would be nice to list them, but if that is too much for you, then giving the number and / or size would help us. Thanks in advance.
[26 Nov 2019 13:58]
MySQL Verification Team
HI, As a further proof of concept, have you tried to measure the speed benefits of those changes ???
[26 Nov 2019 23:48]
Daniel Black
Sinisa, thanks for your interest. I was testing on a POWER9, two sockets, 20 cpus / socket. 4 threads per cpu. $ numactl --hardware available: 2 nodes (0,8) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 0 size: 257742 MB node 0 free: 489 MB node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 8 size: 130797 MB node 8 free: 84377 MB node distances: node 0 8 0: 10 40 8: 40 10 perf report -g --no-children same run during a REPEATABLE-READ run: + 7.70% mysqld mysqld [.] btr_cur_search_to_nth_level + 4.75% mysqld mysqld [.] buf_page_get_gen + 4.03% mysqld mysqld [.] rec_get_offsets_func + 3.90% mysqld mysqld [.] MVCC::view_open + 3.39% mysqld mysqld [.] PolicyMutex<TTASEventMutex<GenericPolicy> >::enter + 2.56% mysqld mysqld [.] MYSQLparse + 2.28% mysqld mysqld [.] mtr_t::release_block_at_savepoint + 2.07% mysqld mysqld [.] pfs_rw_lock_s_lock_func + 2.03% mysqld mysqld [.] cmp_dtuple_rec_with_match_low + 1.92% mysqld mysqld [.] row_search_mvcc + 1.91% mysqld mysqld [.] page_cur_search_with_match + 1.04% mysqld mysqld [.] pfs_rw_lock_s_unlock_func + 0.99% mysqld mysqld [.] pfs_rw_lock_s_lock_func + 0.93% mysqld [kernel.kallsyms] [k] _raw_spin_lock + 0.90% mysqld libc-2.26.so [.] __memcmp_power8 The generated asm for btr_cur_search_to_nth_level is: 0.01 │ ld r8,3464(r31) │ cursor->low_match = low_match; 0.05 │ std r10,96(r25) │ cursor->up_bytes = up_bytes; 0.00 │ ld r10,3456(r31) │ if (UNIV_LIKELY(btr_search_enabled) && !index->disable_ahi) { 24.08 │ lbz r9,0(r9) │ cursor->low_bytes = low_bytes; 0.01 │ std r7,104(r25) │ cursor->up_match = up_match; 0.00 │ std r8,80(r25) │ cursor->up_bytes = up_bytes; 0.01 │ std r10,88(r25) │ if (UNIV_LIKELY(btr_search_enabled) && !index->disable_ahi) { 0.00 │ cmpwi cr7,r9,0 0.00 │ ld r9,48(r29) 0.01 │ ↓ beq cr7,10eb88a8 <btr_cur_search_to_nth_level(dict_index_t*, 2348 There is a high contention on the load of btr_search_enabled which is odd, because as the c++ variable (for the SQL global variable adaptive_hash_index) it isn't changed and should be in the L1 cache on all of the cores being in such a hot path. note: I probably added UNIV_LIKELY in an previous misguided attempt to solve this. Looking at the address it was given: $ readelf -a bin/mysqld | grep btr_search_enabled 8522: 0000000011aa1b40 1 OBJECT GLOBAL DEFAULT 24 btr_search_enabled 17719: 0000000011aa1b40 1 OBJECT GLOBAL DEFAULT 24 btr_search_enabled Now looking at other variables within the same 256 byte area: $ readelf -a bin/mysqld | grep 0000000011aa1b 1312: 0000000011aa1be0 296 OBJECT GLOBAL DEFAULT 24 fts_default_stopword 8522: 0000000011aa1b40 1 OBJECT GLOBAL DEFAULT 24 btr_search_enabled 9580: 0000000011aa1b98 16 OBJECT GLOBAL DEFAULT 24 fil_addr_null 11434: 0000000011aa1b60 8 OBJECT GLOBAL DEFAULT 24 zip_failure_threshold_pct 12665: 0000000011aa1b70 40 OBJECT GLOBAL DEFAULT 24 dot_ext 13042: 0000000011aa1b30 8 OBJECT GLOBAL DEFAULT 24 ut_rnd_ulint_counter 13810: 0000000011aa1b48 8 OBJECT GLOBAL DEFAULT 24 srv_checksum_algorithm 18831: 0000000011aa1bb0 48 OBJECT GLOBAL DEFAULT 24 fts_common_tables 27713: 0000000011aa1b38 8 OBJECT GLOBAL DEFAULT 24 btr_ahi_parts 33183: 0000000011aa1b50 8 OBJECT GLOBAL DEFAULT 24 zip_pad_max 2386: 0000000011aa1b58 7 OBJECT LOCAL DEFAULT 24 _ZL9dict_ibfk 5961: 0000000011aa1b68 8 OBJECT LOCAL DEFAULT 24 _ZL8eval_rnd 10509: 0000000011aa1be0 296 OBJECT GLOBAL DEFAULT 24 fts_default_stopword 17719: 0000000011aa1b40 1 OBJECT GLOBAL DEFAULT 24 btr_search_enabled 18777: 0000000011aa1b98 16 OBJECT GLOBAL DEFAULT 24 fil_addr_null 20631: 0000000011aa1b60 8 OBJECT GLOBAL DEFAULT 24 zip_failure_threshold_pct 21862: 0000000011aa1b70 40 OBJECT GLOBAL DEFAULT 24 dot_ext 22239: 0000000011aa1b30 8 OBJECT GLOBAL DEFAULT 24 ut_rnd_ulint_counter 23007: 0000000011aa1b48 8 OBJECT GLOBAL DEFAULT 24 srv_checksum_algorithm 28028: 0000000011aa1bb0 48 OBJECT GLOBAL DEFAULT 24 fts_common_tables 36910: 0000000011aa1b38 8 OBJECT GLOBAL DEFAULT 24 btr_ahi_parts 42380: 0000000011aa1b50 8 OBJECT GLOBAL DEFAULT 24 zip_pad_max ut_rnd_ulint_counter is actually stored 16 bytes away. On POWER cache line size is 128 bytes however even on x86 with 64 byte cache lines there would been contention which is why I've avoided a CPU specific option. MVCC::view_open and PolicyMutex<TTASEventMutex<GenericPolicy> >::enter, number 3 and 4 on the top functions profile: MVCC::view_open has code: │ _ZN14TTASEventMutexI13GenericPolicyE17spin_and_try_lockEjjPKcj(): │ os_rmb; │ lwsync │ uint32_t n_waits = 0; 0.02 │ std r9,160(r31) 0.00 │ addis r9,r2,-98 0.00 │ addi r9,r9,22944 0.00 │ std r9,144(r31) 0.01 │988: ↓ bne cr4,10d8a560 <MVCC::view_open(ReadView*&, a10 │ ↓ b 10d8a5e0 <MVCC::view_open(ReadView*&, trx_t*)+0xa90> │ ut_rnd_gen_ulint(): │ ut_rnd_ulint_counter = UT_RND1 * ut_rnd_ulint_counter + UT_RND2; 0.06 │990: addis r7,r2,3 0.02 │ addi r7,r7,-14704 77.86 │ ld r8,0(r7) 0.01 │ mulld r8,r28,r8 0.02 │ addis r8,r8,1828 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter has code: │ _ZN14TTASEventMutexI13GenericPolicyE11set_waitersEv(): │ m_waiters = 1; 0.00 │ li r24,1 0.01 │170: ↓ bne cr4,10cbc180 <PolicyMutex<TTASEventMutex<GenericPolicy> >::enter(unsigned int, 200 │ ↓ b 10cbc230 <PolicyMutex<TTASEventMutex<GenericPolicy> >::enter(unsigned int, unsigned int, 2b0 │ nop │ ori r2,r2,0 │ ut_rnd_gen_ulint(): │ ut_rnd_ulint_counter = UT_RND1 * ut_rnd_ulint_counter + UT_RND2; 0.07 │180: addis r8,r2,3 0.02 │ addi r8,r8,-14704 69.35 │ ld r10,0(r8) 0.01 │ mulld r10,r28,r10 0.01 │ addis r10,r10,1828 0.01 │ addi r10,r10,-14435 Show the high contention in the `ld` (load) fo the ut_rnd_ulint_counter (which has been nicely fixed in 8.0 by making this a thread local). This contention on the load is because that a few instructions later their is a store │ ut_rnd_ulint_counter = UT_RND1 * ut_rnd_ulint_counter + UT_RND2; 0.05 │ std r10,0(r8) Note high in the cpu profile as we've already obtained the cache, but the store means the exclusive access on ut_rnd_ulint_counter (and btr_search_enabled and everything else in the address & 0x7F address range). note: MVCC::view_open contention on trx_sys->mutex is worthy of a a separate bug (that is coming soon after x86 tests).
[26 Nov 2019 23:49]
Daniel Black
Applying this patch (seems cmake-2.8.12.2 on RHEL7 doesn't know about TARGET_LINK_OPTIONS - fairly sure there's a more portable way to write it), $ readelf -a sql/mysqld | grep btr_search_enabled 1631: 000000001167e6e8 1 OBJECT GLOBAL DEFAULT 16 btr_search_enabled 18046: 000000001167e6e8 1 OBJECT GLOBAL DEFAULT 16 btr_search_enabled $ readelf -a sql/mysqld | grep 000000001167e6 [16] .data.read_mostly PROGBITS 000000001167e6e0 0167e6e0 1631: 000000001167e6e8 1 OBJECT GLOBAL DEFAULT 16 btr_search_enabled 15962: 000000001167e6e0 8 OBJECT GLOBAL DEFAULT 16 btr_ahi_parts 16: 000000001167e6e0 0 SECTION LOCAL DEFAULT 16 18046: 000000001167e6e8 1 OBJECT GLOBAL DEFAULT 16 btr_search_enabled 37177: 000000001167e6e0 8 OBJECT GLOBAL DEFAULT 16 btr_ahi_parts $ readelf -S sql/mysqld | more There are 40 section headers, starting at offset 0x1271c370: Section Headers: [Nr] Name Type Address Offset Size EntSize Flags Link Info Align [ 0] NULL 0000000000000000 00000000 0000000000000000 0000000000000000 0 0 0 [ 1] .interp PROGBITS 00000000100001c8 000001c8 000000000000001c 0000000000000000 A 0 0 1 [ 2] .note.ABI-tag NOTE 00000000100001e4 000001e4 0000000000000020 0000000000000000 A 0 0 4 [ 3] .note.gnu.build-i NOTE 0000000010000204 00000204 0000000000000024 0000000000000000 A 0 0 4 [ 4] .hash HASH 0000000010000228 00000228 0000000000041bac 0000000000000004 A 6 0 8 [ 5] .gnu.hash GNU_HASH 0000000010041dd8 00041dd8 00000000000492a8 0000000000000000 A 6 0 8 [ 6] .dynsym DYNSYM 000000001008b080 0008b080 00000000000ca590 0000000000000018 A 7 1 8 [ 7] .dynstr STRTAB 0000000010155610 00155610 00000000001f46d8 0000000000000000 A 0 0 1 [ 8] .gnu.version VERSYM 0000000010349ce8 00349ce8 0000000000010dcc 0000000000000002 A 6 0 2 [ 9] .gnu.version_r VERNEED 000000001035aab8 0035aab8 0000000000000220 0000000000000000 A 7 12 8 [10] .rela.dyn RELA 000000001035acd8 0035acd8 00000000000141f0 0000000000000018 A 6 0 8 [11] .rela.plt RELA 000000001036eec8 0036eec8 00000000000031e0 0000000000000018 AI 6 25 8 [12] .init PROGBITS 00000000103720c0 003720c0 000000000000005c 0000000000000000 AX 0 0 32 [13] .text PROGBITS 0000000010372120 00372120 0000000000e3f0d0 0000000000000000 AX 0 0 32 [14] .fini PROGBITS 00000000111b11f0 011b11f0 0000000000000024 0000000000000000 AX 0 0 4 [15] .rodata PROGBITS 00000000111b1220 011b1220 00000000004cd4c0 0000000000000000 A 0 0 16 [16] .data.read_mostly PROGBITS 000000001167e6e0 0167e6e0 0000000000000020 0000000000000000 WA 0 0 8 [17] .eh_frame_hdr PROGBITS 000000001167e700 0167e700 0000000000033b3c 0000000000000000 A 0 0 4 [18] .eh_frame PROGBITS 00000000116b223c 016b223c 00000000001e6db0 0000000000000000 A 0 0 4 [19] .gcc_except_table PROGBITS 0000000011898ff0 01898ff0 00000000000410c5 0000000000000000 A 0 0 8 [20] .init_array INIT_ARRAY 00000000118ed438 018ed438 0000000000000e60 0000000000000008 WA 0 0 8 [21] .fini_array FINI_ARRAY 00000000118ee298 018ee298 0000000000000008 0000000000000008 WA 0 0 8 [22] .data.rel.ro PROGBITS 00000000118ee2a0 018ee2a0 00000000000cf388 0000000000000000 WA 0 0 16 [23] .dynamic DYNAMIC 00000000119bd628 019bd628 00000000000002d0 0000000000000010 WA 7 0 8 [24] .got PROGBITS 00000000119bd900 019bd900 0000000000002610 0000000000000008 WA 0 0 256 [25] .plt NOBITS 00000000119c0000 019bff10 00000000000010b0 0000000000000008 WA 0 0 8 [26] .data PROGBITS 00000000119c10b0 019c10b0 00000000000a1428 0000000000000000 WA 0 0 16 [27] .bss NOBITS 0000000011a62500 01a624d8 00000000000ca1a8 0000000000000000 WA 0 0 64 [28] .comment PROGBITS 0000000000000000 01a624d8 0000000000000047 0000000000000001 MS 0 0 1 [29] .debug_aranges PROGBITS 0000000000000000 01a62520 000000000004b9e0 0000000000000000 0 0 16 [30] .debug_info PROGBITS 0000000000000000 01aadf00 00000000087e42f3 0000000000000000 0 0 1 [31] .debug_abbrev PROGBITS 0000000000000000 0a2921f3 0000000000281115 0000000000000000 0 0 1 [32] .debug_line PROGBITS 0000000000000000 0a513308 0000000001210da4 0000000000000000 0 0 1 [33] .debug_str PROGBITS 0000000000000000 0b7240ac 00000000014cdf48 0000000000000001 MS 0 0 1 [34] .debug_loc PROGBITS 0000000000000000 0cbf1ff4 00000000044c33c5 0000000000000000 0 0 1 [35] .debug_ranges PROGBITS 0000000000000000 110b53c0 00000000012e3cc0 0000000000000000 0 0 16 [36] .gnu.attributes LOOS+0xffffff5 0000000000000000 12399080 0000000000000010 0000000000000000 0 0 1 [37] .symtab SYMTAB 0000000000000000 12399090 00000000001027b0 0000000000000018 38 9581 8 [38] .strtab STRTAB 0000000000000000 1249b840 0000000000280998 0000000000000000 0 0 1 [39] .shstrtab STRTAB 0000000000000000 1271c1d8 0000000000000194 0000000000000000 0 0 1 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), p (processor specific) So we end up with a .data.read_mostly section after .rodata.
[27 Nov 2019 0:18]
Daniel Black
On variables that a both session and global, no special attention is needed as they are all in the `typedef struct system_variables` and therefore in adjacent memory locations. There's a small chance of contention with a hot variable placed before or after the global_system_variables but for the most part it looks safe. $ readelf -a sql/mysqld | grep global_system_variables 17128: 0000000011a658c8 816 OBJECT GLOBAL DEFAULT 27 global_system_variables 31599: 0000000011a658c8 816 OBJECT GLOBAL DEFAULT 27 global_system_variables $ readelf -a sql/mysqld | grep 0000000011a658 17128: 0000000011a658c8 816 OBJECT GLOBAL DEFAULT 27 global_system_variables 31599: 0000000011a658c8 816 OBJECT GLOBAL DEFAULT 27 global_system_variables The second patch attached (extends the first) puts global variables to the same section: $ readelf -a sql/mysqld | grep 000000001167e [16] .data.read_mostly PROGBITS 000000001167e6e0 0167e6e0 [17] .eh_frame_hdr PROGBITS 000000001167ee00 0167ee00 GNU_EH_FRAME 0x000000000167ee00 0x000000001167ee00 0x000000001167ee00 1631: 000000001167e6e8 1 OBJECT GLOBAL DEFAULT 16 btr_search_enabled 12556: 000000001167e6f0 816 OBJECT GLOBAL DEFAULT 16 max_system_variables 15962: 000000001167e6e0 8 OBJECT GLOBAL DEFAULT 16 btr_ahi_parts 17128: 000000001167ea20 816 OBJECT GLOBAL DEFAULT 16 global_system_variables 16: 000000001167e6e0 0 SECTION LOCAL DEFAULT 16 17: 000000001167ee00 0 SECTION LOCAL DEFAULT 17 So from the initial patch, there just need to be a proliferation of MY_GLOBAL to the other global variables and they will get into the .data.read_mostly section. Other variables will end up in the .data section as is the default linker behaviour.
[27 Nov 2019 0:19]
Daniel Black
extension to put global/session vars into the same section
Attachment: global_vars.patch (text/x-patch), 847 bytes.
[27 Nov 2019 6:20]
Daniel Black
Extracting global variables: git grep MYSQL_SYSVAR_ | fgrep -v .h: | cut -f 2 -d , | grep '[a-z]' | tee /tmp/l1.txt | grep : | cut -f 1 -d : | sort -u edit displayed file list to create l2.txt (attached) cleanup l1.txt * strip down to variable names * remove xpl:: sort /tmp/l[12].txt | sed -e 's/ //g' | uniq > /tmp/lall.txt replace statics with assigned values: for a in $(< /tmp/lall.txt) ; do echo looking for $a; file=$(git grep -l -E "^[[:space:]]*(static)?[[:space:]]+[a-z*]+[[:space:]]+\\**$a[^a-zA-Z0-9_][^M][^Y][^_][^G]" plugin/ storage/ sql | head -n 1); [ -n "${file}" ] && sed -i -e "/^ *static[ \t]*[a-z]*\\**[ \t]*\\**$a[^a-zA-Z0-9_]/s:\($a[ \t]*\)=:\1MY_GLOBAL =:" $file && echo replaced in $file; git diff $file | cat; done for a in $(< /tmp/lall.txt) ; do echo looking for $a; file=$(git grep -E "^[[:space:]]*(static)?[[:space:]]*[a-z_*]+[[:space:]]+\\**$a[[:space:]]*=?[^a-zA-Z_][[:space:]]*[^M]?[^Y]?[^_]?[^G]?" rapid/ plugin/ storage/ sql | egrep -v '\.(h|ic):' | fgrep -v MY_GLOBAL | cut -f 1 -d : | head -n 1); [ -n "${file}" ] && echo attempting change on $file pattern && git grep $a $file && sed -i -e "/^static[ \t]*[a-z][a-z_]*[ \t]*[^a-z_]$a[,;= \\t]/s:$a:$a MY_GLOBAL:" $file && echo replaced in $file; git diff $file | cat; done Lots more hacking and manual editing. And check: for a in $(< /tmp/lall.txt) ; do echo; echo looking for $a; git grep "\\**$a[^a-zA-Z_]" rapid/ plugin/ storage/ sql | grep MY_GLOBAL ; done
[27 Nov 2019 6:20]
Daniel Black
global variable list
Attachment: lall.txt (text/plain), 6.58 KiB.
[27 Nov 2019 6:21]
Daniel Black
patch to make all global variables with MY_GLOBAL
Attachment: globals.patch (text/x-patch), 65.54 KiB.
[27 Nov 2019 13:15]
MySQL Verification Team
Hello Mr. Black, Thank you so much for your very valuable contribution. You leave me no option, but to verify this very important set of the ideas to our Development team. Verified as a very important contribution to enhance the performance of our server. Thanks again.
[28 Nov 2019 13:26]
MySQL Verification Team
Thank you Mr. Black.
[29 Nov 2019 1:55]
Daniel Black
-DMUTEXTYPE=event achieved only ~1320000 tpm. 5.7.28 compiled with -DMUTEXTYPE=sys timestamp tpm avg_rt max_rt avg_db_rt max_db_rt average 1719652.00 59.36 819 59.35 With patch and -DMUTEXTYPE=sys timestamp tpm avg_rt max_rt avg_db_rt max_db_rt average 1810618.80 61.51 838 61.50
[29 Nov 2019 13:51]
MySQL Verification Team
Thank you. I will copy your comment to our internal database.