MySQL Bugs: #79455: Restore get_sched_indexer

Bug #79455	Restore get_sched_indexer_t in 5.7
Submitted:	30 Nov 2015 11:35	Modified:	20 Nov 2016 8:25
Reporter:	Alexey Kopytov	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.7	OS:	Any
Assigned to:		CPU Architecture:	ARM
Tags:	Contribution

Description:
This is related to bug #79454.

The sched_getcpu() based policy for choosing counter slots has been
removed in 5.7 with the following commit:

    Bug#20709391 SCHED_GETCPU() IN DEFAULT GET_RND_INDEX() COULD NOT
    GENERATE RANDOM COUNTERS
    
    The sched_getcpu() in default get_rnd_index() doesn't generate a
    random number as we expected, so if some threads occupy the same
    CPU, they would get the same counter. After the push of WL#8316
    which makes default purge threads from 1 to 4, the 4 purge threads
    would try to request the same slot in the sync_wait_array due to
    the not random generator.
    
    We replace existing get_rnd_index() with the RDTSC one, which
    generates more random numbers, to reduce collision. With RDTSC,
    we also make the for-loop in sync_array_get_and_reserve_cell()
    work correctly, instead of trying the same slot over and over.
    And we abandon sched_getcpu() totally, since it just generates
    limited numbers.
    
    RB: 8362
    Reviewed-by: Marko <marko.makela@oracle.com>

The reasoning was that array indexes provided by get_sched_indexer_t
were not random enough for some purposes, in particular for sync wait
arrays.

Which sounds reasonable, but as explained in bug #79454 in some cases
(such as row counters) you actually need quite specific array indexes
rather than random ones.

It looks like 5.7 uses thread IDs to pick counter slots for row
statistics. As explained in bug #79454, using CPU core numbers looks
much more efficient for this specific purpose.

This is a request to restore get_sched_indexer_t in 5.7 and use it for
row statistic counters instead of thread IDs.

How to repeat:
Inspect the code.

Contribution submitted via Github - Bug #79455: Restore get_sched_indexer_t in 5.7 (*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/52#issuecomment-181393955): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_58601311.txt (text/plain), 5.17 KiB.

Hello Alexey,

Thank you for the report and contribution.

Thanks,
Umesh

We borrowed the 5.6 version of this code for row counters in MyRocks. In 5.6 it was using the address of the value returned by pthread_self() mod 64 which always equals 0. I doubled QPS for sysbench read-only with --oltp-range-size=10000 after fixing that code to use an array per CPU. I agree with Alexey that for row counters we want an array entry per CPU, not random CPUs. I hope this code settles down and I wonder how many sysbench results for InnoDB are hurt by it today.

http://smalldatum.blogspot.com/2016/10/make-myrocks-2x-less-slow.html

array entry per CPU, not array per CPU
see https://github.com/facebook/mysql-5.6/commit/d83fe82c8f42023cda6a97021361d90bae39805c

Alexey,

All this makes sense (restore the per-core counter where appropriate).

I just wonder:

In the commit message in the provided patch (http://bugs.mysql.com/file.php?id=23673&bug_id=79455) you mention:

"This in itself does not introduce any measurable performance
improvements, but is rather a groundwork for further changes."

but in Bug#79454 you say:

"I got a ~40% speedup in those benchmarks by changing the slot choosing
policy to get_sched_indexer_t and modifying callers to use the default
policy rather than override it with trx id."

What do I miss?

Vasil,

Yes, on my hardware (a double-socket ARM server with 96 cores) simply
moving to per-core counters was only a part of the solution. I also had
to increase the number of slots from 64 to 96.

However, in further experiments I discovered that sched_getcpu() is
expensive on ARM, unlike x86, where it is implemented via VDSO. So even
though the per-core indexer improved scalability in certain workloads,
it also decreased single-threaded performance for us.

We are currently experimenting with other approaches to implement
scalable "update frequently, read infrequently" counters. Using
hardware-specific implementations (like Large System Extension
instructions on ARM) looks promising.

Anyway, I still think restoring get_sched_indexer_t is a good idea for
x86 which is also proved by Mark's experiments. The indexer framework
just need to be generalized to allow hardware-specific indexers.

Having the number of slots calculated automatically on startup based on
the number of available CPUs, rather than being hard-coded to 64 would
also be nice.

Contribution submitted via Github - Bug #79455: Restore get_sched_indexer_t (*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/209#issuecomment-392573345): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_190787766.txt (text/plain), 5.00 KiB.