Bug #115729 | Sometimes rw_lock_s_lock_func takes too much CPU by analysing perf reports | ||
---|---|---|---|
Submitted: | 31 Jul 2024 1:43 | Modified: | 14 Aug 2024 5:03 |
Reporter: | Cheng Zhou | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |
Version: | 8.0.25 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[31 Jul 2024 1:43]
Cheng Zhou
[31 Jul 2024 15:00]
Cheng Zhou
select a version
[1 Aug 2024 7:59]
Jakub Lopuszanski
Hello Cheng Zhou! Thanks for conducting the experiment and sharing results. I find this kinds of reports from perf where it just lists time spent in each individual function to not be very useful, because the same low-level function might be used in various ways in various places, so what we really need is a callstack. Please obtain https://github.com/brendangregg/FlameGraph Then execute something like perf record -p $(pgrep mysqld) --call-graph=fp -o perf.${version}.data -- sleep 10 where --call-graph=fp is the crucial thing as it will capture call stacks for each sample. The "-- sleep 10" part can be different depending on how long you want to sample. Then please do: perf script -i perf.${version}.data | FlameGraph/stackcollapse-perf.pl > perf.${version}.perf-folded You can then share the resulting perf.${version}.perf-folded file with us, or better change it into svg: FlameGraph/flamegraph.pl < ${version}.perf.folded > ${version}.svg If you have multiple versions of the code you want to compare with each other, you can also try: FlameGraph/difffolded.pl -n perf.{$a,$b}.perf-folded | FG/flamegraph.pl >$b-$a.diff.svg FlameGraph/difffolded.pl -n perf.{$b,$a}.perf-folded | FG/flamegraph.pl > $a-$b.diff.svg to create flame diff graphs.
[2 Aug 2024 13:32]
Jakub Lopuszanski
Hi Cheng Zhou, thank you for providing the flame graphs. Looks like the s-latches are being acquired on BP pages when starting the scan over index (row_search_mvcc -> btr_cur_search_to_nth_level) - to perform a scan w need to first descend the b-tree to the first leaf matching the query, and then continue horizontally following links to siblings. Another possibility is that the main scan is over secondary index, and the btr_cur_search_to_nth_level is used to fetch the row's primary index record. From this I can deduce the problem is when s-latching the root page of primary index. IIUC in your scenario there are no "writes" (UPDATE, DELETE, INSERT) just "reads" (SELECT). Thus, there should be no thread trying to acquire x-latch on pages of the index, and indeed the `perf annotate` you've shown looks as if the problem is more about the interaction between readers, as opposed to readers having to wait for writers. Conceptually, each s-reader is trying to decrement the lock->lock_word by one, to mark its presence (so that in case another thread wants to acquire x-latch they would see the lock->lock_word indicating that there are some readers). Alas, we can't use a simple fetch_sub for that. Instead we use a CAS-loop: ``` while (local_lock_word > threshold) { if (lock->lock_word.compare_exchange_strong(local_lock_word, local_lock_word - amount)) { return (true); } } ``` The reason we have to do it this way, is because if *there is* a writer trying to obtain x-latch in parallel, then we should not let our decrement interfere with the changes to lock->lock_word made by x-latcher. The whole latching protocol is rather complicated, but one important thing is that we should not decrement the value if it is already below or equal to threshold=0. (This has to do with preventing starvation of x-latcher). This leads to the phenomenon you've seen as ``` 10.76 │12: test %eax,%eax │ ↓ jle 27ed730 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x50> 0.16 │ lea -0x1(%rax),%edx 0.11 │ lock cmpxchg %edx,(%r12) 41.29 │ ↑ jne 27ed6f2 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x12> ``` (which corresponds to the `while` loop above), in which several threads trying obtain same s-latch are interfering with each other, as whenever one of them succeeds the compare_exchange_strong, it makes compare_exchange_strong fail for someone else. I am afraid this is "by design" and not much we can do, without completely rewriting the rw_lock_t implementation which would be very risky change. You might be tempted to try something like: ``` while (lock->lock_word.fetch_sub(amount) <= threshold) { lock->lock_word.fetch_add(amount); } return (false); ``` but I warn you that this will almost certainly cause troubles in other parts of the implementation, which assume that lock->lock_word can not change once it <=0. What I'd try instead, is to see if https://dev.mysql.com/doc/refman/8.4/en/innodb-adaptive-hash.html can help you. This could eliminate the need for traversing from the root, instead letting InnoDB jump directly to the leaf.
[14 Aug 2024 5:03]
MySQL Verification Team
Hello Cheng Zhou, Thank you for the report and feedback. As our Kuba nicely explained, this is "by design" and not much we can do. Closing this report for now. regards, Umesh