Bug #116198 Addressing the compatibility issues of the CATS algorithm in NUMA scenarios.
Submitted: 23 Sep 2024 10:51 Modified: 5 Jul 6:35
Reporter: Bin Wang (OCA) Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: performance interference

[23 Sep 2024 10:51] Bin Wang
Description:
When evaluating Performance Gains in MySQL Lock Scheduling Algorithms, We discovered that the excessive output logs in Deadlock_notifier::notify have impacted the evaluation of the CATS algorithm.

In NUMA scenarios, during highly conflicting BenchmarkSQL tests, the performance of the CATS algorithm is worse than the FIFO algorithm if log interference is removed.

For detailed explanations, refer to: 
https://advancedmysql.github.io/blogs/cats.html

How to repeat:
In scenarios with severe contention, both sysbench tests and the improved BenchmarkSQL tests reveal that the excessive logging output interferes with performance evaluation.

Suggested fix:
Remove the excessive logs from Deadlock_notifier::notify to prevent performance interference, and address the incompatibility issue of CATS in NUMA scenarios.
[23 Sep 2024 11:35] MySQL Verification Team
Hi Mr. Wang,

Thank you for your feature request.

We consider it a valid one and therefore, we are verifying it.

Thank you, very much.
[23 Sep 2024 16:02] Jakub Lopuszanski
Hello Bin Wang!

What was the `innodb_print_all_deadlocks` setting that you have used?

When it is "OFF" (which is the default), InnoDB writes the information to `lock_latest_err_file` which is a temporary file, which AFAIR is never fsync()ed, and typically resides in tmpfs/ram disc/os cache, so I am not sure how it becomes a big problem. This file is meant to store the description of the deadlock which happens now, so that later you can see it as part of `SHOW ENGINE STATUS INNODB`s "LATEST DETECTED DEADLOCK" section, as emitted in `lock_print_info_summary(..)`. We have to store it somewhere. Perhaps we could store it in a string in RAM, but in general we can't tell how many bytes will be needed to keep the full description of a whole deadlock cycle, so we prefer to use a tmp file for that.

With ON, in addition to the above, we also emit it to the error log. This might be costly in various ways. This is why it is configurable.

As for your article, I think there's some confusion of various WLs and versions.
In one place you say:

> This time, testing on the improved MySQL 8.0.27 revealed significant bottlenecks under severe conflicts. 
> The perf screenshot is shown in the following figure:

yet show the perf chart which is for 8.0.17 or earlier, but certainly not 8.0.27, because DeadlockChecker does not exist in 8.0.18 and later.
But, then when you present the source code, you show Deadlock_notifier::notify which was introduced in 8.0.18.

And the charts from "[3] Sunny Bains. 2017. Contention-Aware Transaction Scheduling Arriving in InnoDB to Boost Performance. https://dev.mysql.com/blog-archive/", are obviously much older than all of that, as the blog post is from 2017, yet first GA of 8.0.11 was in 2018, so these charts are for a much older, and frankly, very buggy, implementation of CATS.

What is definitely interesting in your article, is the claim that CATS is slower than FIFO in 8.0.27 for TPC-C, once we remove reporting of deadlocks.
Here, it would be nice to know:
1. What is the exact patch you have applied to force FIFO behaviour?
2. What is the number of deadlocks you are were observing for FIFO vs CATS?

Note, that it was decided, that lowering number of deadlocks is (almost) always better than making TPS a little higher, because of the impact on typical application a deadlock has. 
We are well aware that there are workloads for which one can achieve higher TPS at the expense of introducing more deadlocks. One reason this "trick" works, is because higher number of deadlocks correlates with higher number of waits on locks, which in turn means more threads are in wait state off the CPU, which in turn means lower congestion on various parts of the system. Thus, perhaps counter-intuitively, sacrificing a lot of transactions by letting them wait or deadlock, can help others complete faster - which is just a symptom of a completely real, but different problem, namely that InnoDB has scalability issues (global trx_sys mutex, serialization_mutex, undo log spaces mutexes, etc.).
There were even more contrived situations I've analysed, in which deadlocks happening during sysbench test, were actually helping TPS because upon becoming a deadlock victim the sysbench client retried the transaction, but rand()omly selected different rows to update on retry, which with high probability meant this time it would not have to wait in the queue for the same row. This is "cheating", as usual application would rather retry a transaction on exactly same, congested, rows. 
There are other reasons why deadlocks can "help", but we've decided, they are all "eventually bad" for our users.
This is why I'd like to know answers to question 2.
[24 Sep 2024 9:21] MySQL Verification Team
Thank you, Jakub.
[24 Sep 2024 13:39] Bin Wang
A lot of good questions, with a keen eye to spot the issues in the perf.

I double-checked, and the perf graph captured performance data for version 5.7.39, which was a mistake in the screenshot. However, this does not affect the results.

The innodb_print_all_deadlocks configuration is set to on, but we removed all the logs, so there are no logs to affect the results.

As for the questions below, I will answer them one by one.

1. What is the exact patch you have applied to force FIFO behaviour?
Based on the git log hints, the weight settings are the same, consistent with the behavior observed in MySQL version 5.7.39.

2. What is the number of deadlocks you are were observing for FIFO vs CATS?
CATS indeed has fewer deadlocks than FIFO, but the CATS algorithm requires sorting, which is not favorable in NUMA scenarios.

We also apply our scalability patch mainly to prevent interference; otherwise, even strong PGO optimizations can lead to a decline in peak throughput in NUMA scenarios.
[24 Sep 2024 13:47] Bin Wang
You can consider our tests as having innodb_print_all_deadlocks enabled, but the logs were removed. In this scenario, we found that it does not necessarily improve performance.

In the SysBench Pareto tests, there was a slight improvement, but in the modified TPC-C tests (where BenchmarkSQL supports one warehouse with 100 concurrent users), there was a decline. This has been confirmed repeatedly.
[24 Sep 2024 13:58] Bin Wang
To accurately assess MySQL performance, it's crucial to address scalability issues, as they can easily skew test results. The MySQL blog can be misleading; FIFO cannot cause a performance collapse. I've read the papers and still can't understand why such a significant performance improvement is claimed—this issue has puzzled me for four years.
[24 Sep 2024 14:21] Bin Wang
To clarify and avoid misunderstanding, the behavior of version 5.7.39 and the FIFO version of MySQL 8.0 is similar due to log interference, creating a false impression. After removing the logs, the results are quite different.
[27 Sep 2024 15:38] Jakub Lopuszanski
Hello Bin Wang!

Your bug report spans several topics, in particular:
1. Logging information about deadlock is slow.
2. FIFO achieves faster TPS than CATS for some workloads if logging is completely removed.
The title of this bug report, and "Suggested fix:" section, look like we should concentrate on 1. here.

To asses the severity of the problem, I would need you to demonstrate that latest version of MySQL server, run with innodb_print_all_deadlocks=OFF has lower TPS (and same or higher deadlock/s) than the same version of MySQL recompiled with logging removed.

Once this is demonstrated in a reproducible scenario, there's a separate issue of what to do with it.
As you know `SHOW ENGINE INNODB STATUS` has a section titled `LATEST DETECTED DEADLOCK` which is needed for backward compatibility and diagnosing problems in production.
In order to be able to populate it, InnoDB writes information about latest deadlock to a temporary file.
In theory, this should be a purely in-memory operation, provided that std libary's buffers, and OS buffers, are never really written out, and even if they are, the tmpfs is often also in memory.
If you think we can do better than this, in backward-compatible way, please provide a patch.
[28 Sep 2024 2:50] Bin Wang
We conducted a sysbench Pareto comparison test in a different environment, using MySQL version 8.0.39, with innodb_print_all_deadlocks set to OFF. The test results were similar to those obtained by removing the log, as shown below:

Concurrency  CATS               FIFO
1            16597	        16523
2	     30910	        31486
4	     56148	        56073
8	    87409	        86971
16	    110993	       110298
32	    102639	       106889
64          104910             102393
128         107254              97804
256         104158              99321
512          95952              90774
[28 Sep 2024 3:01] Bin Wang
We have never been able to reproduce the results of the original CAT algorithm paper. We also theoretically analyzed that the FIFO algorithm alone cannot cause performance collapse, unless there are other bottlenecks interfering. The effectiveness of the CATS algorithm, as promoted by the official blog and the paper, is not as substantial as claimed. This discussion has indeed strayed from the main topic.
[28 Sep 2024 3:10] Bin Wang
Any global sort operation is NUMA-unfriendly under high concurrency, requiring a redesign of this part of the algorithm to mitigate the performance decline of the CATS algorithm caused by NUMA compatibility issues.
[28 Sep 2024 3:14] Bin Wang
If the issues with CATS in NUMA scenarios are not addressed, we can stick with the FIFO algorithm, which is simple and doesn't require sorting, or sorts quickly since all weights are the same and it is inherently ordered
[30 Sep 2024 8:57] MySQL Verification Team
Thank you, Mr. Wang.
[3 Apr 13:56] Jakub Lopuszanski
Posted by developer:
 
In 8.0.20, the  WL#13468 Improved CATS implementation improved performance of CATS so much, that it always over-performed or equalled FIFO (except some test cases, in which FIFO caused more deadlocks, and due to way sysbench test clients work, achieved higher TPS by some clients bypassing others many times etc.).

You are right that earlier versions had a lot of bugs and performance issues - I know, as I was fixing them.

However, for a bug report to be actionable, you need to demonstrate a reproducible problem on a most recent version of code.
On 28 Sep 2024 07:05:28 you've shared some results for 8.0.39, which is recent enough as for 8.0.x, but I have trouble understanding what you think these results mean.
To me they show that CATS is faster than FIFO for high concurrency, which is good and its goal.
Also, it's not clear to me what patch have you applied to 8.0.39 to achieve "FIFO" behaviour.
At any rate, there's no way we could remove FIFO from 8.0.x LTS release without risking stability issues.
We could perhaps, theoretically consider it for innovation release: 9.x

So, to move forward with this bug report, I think I need from you:
- a patch on top of MySQL 9.2, which replaces CATS with FIFO
- repro steps for a scenario where the patched version at the same time has higher Transactions Per Second and lower-or-equal Deadlocks Per Second than the un-patched one

I know this is a lot to ask, as reintroducing FIFO to 9.2 might be difficult.
One way to do that, might be to remove the call to `lock_wait_compute_and_publish_weights_except_cycles()` from `lock_wait_update_schedule_and_check_for_deadlocks()`.
I have not tried this, but in theory, this should stop updating `trx->schedule_weight`, so all transactions will have `trx->schedule_weight==0` in which case `lock_rec_grant_by_heap_no()` will simply push all waiters to the `low_priority_light` vector in FIFO order AFAIR.
The `low_priority_heavier` will stay empty, so `std::stable_sort` will be a no-op, but if you want you can remove it too to be sure.
At minimum please test what happens for 64, 128, 256, 512, 1024 clients in sysbench OLTP_RW pareto scenario.
For results to be reasonably useful, use at least 60 sec warmup, 300 sec duration, and at least 3 repeats for each condition and version, so we have any idea how repeatable it is.
[5 Jun 6:35] MySQL Verification Team
Hello Bin Wang ,

Could you please do the needful and provide requested details( in the previous note from Kuba) to investigate the issue further? Thank you.

regards,
Umesh
[6 Jul 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".