MySQL Bugs: #116005: With NUMA and Read Committed, performance worsens under high concurrency.

Bug #116005	With NUMA and Read Committed, performance worsens under high concurrency.
Submitted:	4 Sep 2024 23:43	Modified:	9 Feb 14:10
Reporter:	Bin Wang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	all versions	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The issue lies in the inefficient design of the MVCC ReadView data structure. There are several solutions available. For detailed explanations, refer to: https://advancedmysql.github.io/blogs/innodb_storage.html.

How to repeat:
In a NUMA scenario with the isolation level set to Read Committed, both sysbench and BenchmarkSQL TPC-C tests perform poorly under high concurrency.

Hello Bin Wang!
Great insights, in particular:
* that being too long in critical section can risk time slice running out, and then all threads which wait for the latch will need to hope for the scheduler to guess that the only way to unblock them is to schedule the thread which holds the latch again which might be very difficult to do by chance if there are thousands of spinning threads
* that the chance of running of time is greater if you have to access a lot of memory locations from distant NUMA node
It so happens that I work on the same topic/area ATM and have a similar, yet different (in theory: faster) solution.
Also, both of these approaches seem to take inspiration from Paweł's Olchawa idea of separating "sparse" and "dense" regions of active_trx_ids.

One problem I am facing with my patch (which takes these ideas further), is that on REPEATABLE READ (which doesn't create/copy read views so often as READ COMMITTED so doesn't benefit much from improvements in this) on UPDATE KEY and UPDATE NO KEY workloads (which unlike OLTP RW do not perform SELECTS thus don't need read-views at all or rarely, yet commits and thus updates active trx ids set often) on machine which has 2 Sockets and many CPUs (and thus pays higher price for any form communication, be it writes, cache misses, or atomic operations) on scenario with many Clients...I see a slow down. I am still investigating what is the exact culprit (looks like bottleneck shifts to some other place which handles congestion even worse?).
I see you've mostly tested on TPC-C. 
Have you tried 
BMK/sb_exec/sb11-OLTP_RW_10M_8tab-uniform-upd_idx1-notrx.sh 1024 ?
In particular on a config like:
--user=root --log_error_verbosity=3 --back-log=0 --core-file --disable-log-bin --innodb-adaptive-hash-index=OFF --innodb-buffer-pool-instances=8 --innodb-flush-method=O_DIRECT --innodb-io-capacity=10000 --innodb-io-capacity-max=12000 --innodb-page-cleaners=8 --innodb-purge-threads=4 --innodb-read-io-threads=4 --innodb-change-buffering=none --innodb-numa-interleave=ON --innodb-undo-log-truncate=OFF --performance-schema=ON --max_connections=2000 --max_prepared_stmt_count=50000 --datadir=/nvm/jlopusza/data --innodb-redo-log-capacity=90G --innodb-write-io-threads=4 --innodb-log-group-home-dir=/ssd/jlopusza --innodb-undo-directory=/ssd/jlopusza --innodb-buffer-pool-size=128G --thread_cache_size=1200 --performance_schema=ON --innodb_monitor_enable=% --range_alloc_block_size=16384 --loose_temptable_use_mmap=OFF --loose_temptable_max_ram=4294967296 --tls-version= --require_secure_transport=OFF --tmpdir=/tmp/

I’ll test it out when I get the chance. Unusual issues are valuable to us because they are interesting. Regarding the transaction system, our strategy is to limit the number of threads interacting with it, which is why we avoided more complex solutions. Our current fix requires only 200 lines of code, is easy to validate, and works effectively with transaction throttling mechanisms. This approach was inspired by related research papers. We prefer BenchmarkSQL TPC-C tests to meet POC requirements.

The link shared in the bug report is broken, could you please share the updated link, just curious.

https://enhancedformysql.github.io/blogs/innodb_storage.html