Bug #79454 Inefficient InnoDB row stats implementation
Submitted: 30 Nov 2015 11:22 Modified: 1 Mar 2017 14:06
Reporter: Alexey Kopytov Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:5.6, 5.7 OS:Any
Assigned to: CPU Architecture:Any

[30 Nov 2015 11:22] Alexey Kopytov
Description:
InnoDB maintains statistics for the total number of
read/updated/deleted/inserted rows as a number of objects of the
ib_counter_t class. Those objects are not protected by mutexes and do
not use atomics, so they use multiple slots to minimize cache line
contention.

The slot choosing policy (aka indexer) can be defined per counter. For
row counters it is thread_id_indexer_t (i.e. slot_number =
current_thread_id % 64), but it can also be overridden per
increment/decrement operation, which is what all row counter users
actually do and choose the slot based on the current transaction id
(i.e. slot_number = trx->id % 64)

It turns out in some cases it's not the most efficient implementation
from the cache coherency traffic perspective. Suppose your workload does
a lot of concurrent cached index scans, i.e. increments
srv_stats.n_rows_read lots of times within a single transaction. If the
transaction rate is also high, the chance of multiple cores choosing the
same slot and thus thrashing each other's caches is also high.

Using the default thread id based policy is not much better. Once the
number of concurrent threads exceeds the number of slots, the chance of
multiple threads using the same slot concurrently also becomes high,
especially on systems with many cores.

There is the get_sched_indexer_t policy which uses sched_getcpu() to get
the current core number and then picks the slot based on that value. But
that policy is not used anywhere in the code.

I could reproduce the negative impact from the trx id based policy for
row counters in sysbench OLTP "simple ranges" and "sum ranges"
benchmarks on a NUMA machine with many cores.

I got a ~40% speedup in those benchmarks by changing the slot choosing
policy to get_sched_indexer_t and modifying callers to use the default
policy rather than override it with trx id.

All of this is for 5.6. I see that in 5.7 get_sched_indexer_t has been
removed completely. I think that was a mistake, and I'm going to report
it separately.

How to repeat:
Look at the code. Run sysbench OLTP in the "simple ranges" or "sum
ranges" mode on a multicore NUMA machine.
[1 Dec 2015 2:49] Alexey Kopytov
The problem with missing get_sched_indexer_t in 5.7 has been reported as bug #79455.
[1 Dec 2015 15:23] MySQL Verification Team
Although, there is no any proof submitted, this seems to be a valid feature request.

Verified.
[1 Dec 2015 16:29] Alexey Kopytov
Sinisa, my dear friend!!!11

I can submit all kinds of proof, but it will all basically say what I
mentioned in the bug report: a 40% performance improvement by replacing
the trx ID based indexes with the CPU cored based one!
[19 Oct 2016 20:51] Mark Callaghan
thread_id_indexer_t is broken for me on 5.6.26 and x86_64 with Linux 4.x. From what I see pthread_self() % 64 is always 0. I am OK if this doesn't get fixed in 5.6 as that makes MyRocks look better in comparison. I am still trying to catch up to multiple changes to this code in 5.7.

We are sharing this code with MyRocks in the 5.6 branch and by switching to sched_getcpu() I almost double QPS for sysbench read-only/read-write for concurrent long range-scans-- 40 clients and --oltp-range-size=10000.

I asked local Linux gurus about this and was told:

pthread_t is the pointer to struct pthread, which is likely cacheline aligned, so that would explain %64 always being zero. Maybe use a fancier hash function or gettid(2)? Please note that gettid() isn't defined by glibc and you need to do something like the following.

rdtsc is another option. https://lwn.net/Articles/567055/
[1 Mar 2017 14:06] Alexey Kopytov
After experimenting with various approaches we found PRNG-based indexes
to result in lowest contention and highest single-thread performance on
many-core AArch64 machines.

Using RDTSC- (aka counter-based) indexes result in higher contention on
AArch64 due to relatively low timer resolution (100 MHz). Using
sched_getcpu()-based indexes leads to lower single-threaded
performance, because sched_getcpu() is expensive on AArch64 (it is not
implemented via VDSO like it is on x86).

Using PRNG for counter indexes also requires a better quality PRNG and
properly seeding it for each thread. I'm going to submit a few PRs on
GitHub to address this issues in separate patches.
[2 Mar 2017 15:56] OCA Admin
Contribution submitted via Github - Bug #79454: Inefficient InnoDB row stats implementation 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/123#issuecomment-283647778): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_108543124.txt (text/plain), 2.74 KiB.

[16 Mar 2017 13:11] OCA Admin
Contribution submitted via Github - Bug #79454: Inefficient InnoDB row stats implementation 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/127#issuecomment-287046977): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_110950249.txt (text/plain), 1.23 KiB.

[16 Mar 2017 22:17] Sunny Bains
Changes look ok, we need to run some tests ourselves before we decide to push them.
[20 Mar 2017 12:57] OCA Admin
Contribution submitted via Github - Bug #79454: Inefficient InnoDB row stats implementation 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/129#issuecomment-287547507): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_111362477.txt (text/plain), 9.35 KiB.

[29 May 2018 16:07] OCA Admin
Contribution submitted via Github - Bug #79454: Inefficient InnoDB row stats implementation 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/211#issuecomment-392573405): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_190863405.txt (text/plain), 9.67 KiB.