Bug #99412 Threads_running becomes scalability bottleneck on multi-node NUMA topologies
Submitted: 30 Apr 2020 13:24 Modified: 6 May 2020 6:29
Reporter: Sergey Glushchenko Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Compiling Severity:S5 (Performance)
Version:8.0.19 OS:Any
Assigned to: CPU Architecture:Any

[30 Apr 2020 13:24] Sergey Glushchenko
Description:
The Threads_running counter is a hotspot in sysbench tests at high concurrency. The counter is modified twice per every SQL command in dispatch_command(): once before a command execution begins, and another time after it finishes.
Naturally, modifying a global variable at that rate can easily become a problem with many cores, complex NUMA topologies and short queries like those in sysbench Point Select.

The problem manifests itself as dispatch_command() being high in perf reports, for example:

     8.28%  mysqld           [kernel.kallsyms]          [k] __wake_up_common_lock
     4.89%  sysbench         [kernel.kallsyms]          [k] finish_task_switch
     4.75%  mysqld           [kernel.kallsyms]          [k] finish_task_switch
     3.09%  mysqld           mysqld                     [.] dispatch_command
     1.93%  sysbench         [kernel.kallsyms]          [k] prepare_to_wait
     1.85%  mysqld           [kernel.kallsyms]          [k] __sys_recvfrom

with perf annotate showing increments/decrements as a bottleneck:

         :              /**
         :                Increments thread running statistic variable.
         :              */
         :              void inc_thread_running()
         :              {
         :                my_atomic_add32(&num_thread_running, 1);
    0.00 :   c37b24:       mov     x10, #0x2060                    // #8288
    0.00 :   c37b28:       add     x24, x21, x10
         :            my_atomic_add32():
         :              return __atomic_fetch_add(a, v, __ATOMIC_SEQ_CST);
   10.53 :   c37b2c:       ldaxr   w0, [x24]
    0.00 :   c37b30:       add     w0, w0, #0x1
   22.99 :   c37b34:       stlxr   w1, w0, [x24]
    0.59 :   c37b38:       cbnz    w1, c37b2c <dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x20c>
         :            _Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command():

How to repeat:
Run in-memory sysbench oltp_ps and use perf to find bottlenecks.
[30 Apr 2020 13:32] Sergey Glushchenko
Attached patch removes global atomic_num_thread_running variable. Instead, the number of running threads is counted when p_s.global_status is populated. It brings behavior change. Now session status for threads_running is always 1.
[4 May 2020 12:50] MySQL Verification Team
Hello Mr. Glushchenko,

Thank you for your performance improvement report.

I have analysed your patch and it is my opinion that it makes lots of sense.

Verified as reported.

Thank you, so much, for your contribution.
[6 May 2020 6:29] Sergey Glushchenko
cleaner version of the patch

Attachment: bug99412.patch (application/octet-stream, text), 8.10 KiB.

[6 May 2020 6:29] Sergey Glushchenko
Thank you very much Sinisa!

I've attached cleaner version of the patch against MySQL 8.0.20
[6 May 2020 12:44] MySQL Verification Team
Thank you Mr. Glushchenko !!!!