MySQL Bugs: #115729: Sometimes rw_lock_s_lock_func takes too much CPU by analysing perf reports

Bug #115729	Sometimes rw_lock_s_lock_func takes too much CPU by analysing perf reports
Submitted:	31 Jul 2024 1:43	Modified:	14 Aug 2024 5:03
Reporter:	Cheng Zhou	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.25	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Using sysbench and perf to find thid performance problem.

--num-threads=1:
  10.28%  mysqld              [.] cmp_dtuple_rec_with_match_low                                                                         
   9.26%  mysqld              [.] rec_init_offsets                                                                                      
   7.93%  mysqld              [.] page_cur_search_with_match                                                                            
   5.68%  mysqld              [.] cmp_whole_field                                                                                       
   5.22%  mysqld              [.] rec_get_offsets_func                                                                                  
   4.72%  mysqld              [.] row_search_mvcc                                                                                      
   4.54%  mysqld              [.] my_strnncollsp_mb_bin                                                                                 
   4.21%  mysqld              [.] btr_cur_search_to_nth_level                                                                           
   3.73%  mysqld              [.] rw_lock_s_lock_func                                                                                   
   2.95%  mysqld              [.] Buf_fetch_normal::get                                                                                 
   2.55%  mysqld              [.] buf_page_get_gen                                                                                      
   2.04%  mysqld              [.] get_charset    

--num-threads=3:
  14.90%  mysqld              [.] rw_lock_s_lock_func                                                                                   
   9.44%  mysqld              [.] Buf_fetch_normal::get                                                                                 
   6.95%  mysqld              [.] btr_cur_search_to_nth_level                                                                           
   6.65%  mysqld              [.] Buf_fetch<Buf_fetch_normal>::single_page                                                              
   6.01%  mysqld              [.] rec_init_offsets                                                                                      
   5.90%  mysqld              [.] cmp_dtuple_rec_with_match_low                                                                         
   4.52%  mysqld              [.] page_cur_search_with_match                                                                            
   3.47%  mysqld              [.] mtr_t::s_lock                                                                                         
   3.18%  mysqld              [.] rec_get_offsets_func                                                                                  
   3.04%  mysqld              [.] row_search_mvcc                                                                                       
   2.97%  mysqld              [.] cmp_whole_field                                                                                       
   2.83%  mysqld              [.] mtr_t::release_block_at_savepoint                                                                     
   2.41%  mysqld              [.] my_strnncollsp_mb_bin                                                                                 
   2.03%  mysqld              [.] memo_slot_release                                                                                     
   1.92%  mysqld              [.] Buf_fetch<Buf_fetch_normal>::lookup                                                                   
   1.72%  mysqld              [.] 
                                                    
   1.34%  mysqld              [.] buf_page_get_gen                                                                                      
   1.14%  mysqld              [.] get_charset                      

--num-threads=5:
  18.20%  mysqld               [.] rw_lock_s_lock_func                                                                                                       
  11.81%  mysqld               [.] Buf_fetch_normal::get                                                                                                     
   7.33%  mysqld               [.] Buf_fetch<Buf_fetch_normal>::single_page                                                                                  
   7.12%  mysqld               [.] btr_cur_search_to_nth_level                                                                                               
   5.21%  mysqld               [.] rec_init_offsets                                                                                                          
   4.49%  mysqld               [.] cmp_dtuple_rec_with_match_low                                                                                             
   4.09%  mysqld               [.] mtr_t::s_lock                                                                                                             
   4.07%  mysqld               [.] mtr_t::release_block_at_savepoint                                                                                         
   3.27%  mysqld               [.] page_cur_search_with_match                                                                                                
   2.69%  mysqld               [.] memo_slot_release                                                                                                         
   2.45%  mysqld               [.] rec_get_offsets_func                                                                                                      
   2.30%  mysqld               [.] row_search_mvcc                                                                                                           
   2.09%  mysqld               [.] Buf_fetch<Buf_fetch_normal>::lookup                                                                                       
   1.96%  mysqld               [.] cmp_whole_field                                                                                                           
   1.90%  mysqld               [.] buf::Block_hint::buffer_fix_block_if_still_valid                                                                          
   1.61%  mysqld               [.] my_strnncollsp_mb_bin                                                                                                     
   1.42%  mysqld               [.] buf_page_t::was_stale                                                                                                     
   1.29%  mysqld               [.] buf_page_hash_get_low                                                                                                     
   1.06%  mysqld               [.] Buf_fetch<Buf_fetch_normal>::mtr_add_page                                                                                 
   0.94%  mysqld               [.] buf_page_get_gen                                                                                                          
   0.90%  mysqld               [.] construct_lookup_ref                                                                                                      
   0.74%  mysqld               [.] get_charset                                                                                                               
   0.68%  mysqld               [.] buf::Block_hint::store                                                                                                    
   0.58%  mysqld               [.] ha_innobase::index_read                                                                                                   
   0.55%  mysqld               [.] buf_page_make_young_if_needed 

How to repeat:
Use sysbench to run the single SQL repeatly like below, and increase the number of concurrent threads. Observe it using perf:
select  /*+ set_var(optimizer_switch='duplicateweedout=off') */ 1 from  ptest.a, ptest.u,(select k.pprcid AS pprcid,k.cid AS cid from  ptest.k, ptest.m where ((k.cid = m.cid) and ((m.cid = 'XXXXX') or (m.parent_classid = 'XXXXX')))) g where ((a.pprcid = g.pprcid) and (a.pid = u.pid) and (u.user_range like '__1%') and exists(select 1 from  ptest.x where ((a.pid = x.pid) and ((a.pprcid = x.pprcid) or (x.pprcid = 'F')) and (x.power_right <= '14') and ((not(exists(select 1 from  ptest.t1 where ((t1.pt_type = '1') and (t1.ftype = 'AAAA') and (x.rid = t1.rid))))) or exists(select 1 from  ptest.t2 where ((t2.pt_type = '1') and (t2.FACTOR_VALUE = 'SSSSSS') and (t2.ftype = 'BBBB') and (x.rid = t2.rid)))) and exists(select 1 from  ptest.y where ((y.parent_group_id = x.gid) and (y.gid = '12345678'))) and ((cc_limit = 'Z') or exists(select 1 from  ptest.z where ((z.parent_class_code = x.cc_limit) and (z.class_code = '256')))) and (sdate <= sysdate()) and (saleexp_date > sysdate()))));

As 

The physical machine's resources:
Thread(s) per core:    2
Core(s) per socket:    20
Socket(s):             4

--num-threads=1:
  3.73%  mysqld              [.] rw_lock_s_lock_func 

--num-threads=3:
  14.90%  mysqld              [.] rw_lock_s_lock_func

--num-threads=5:
  18.20%  mysqld               [.] rw_lock_s_lock_func

Annotate rw_lock_s_lock_func:
       │    _ZL19rw_lock_s_lock_funcP9rw_lock_tmPKcm.constprop.0():
  0.21 │      push   %rbp
  0.33 │      mov    %rdx,%rcx
  0.00 │      mov    %rsp,%rbp
  0.20 │      push   %r12
  0.02 │      mov    %rdi,%r12
  0.00 │      sub    $0x8,%rsp
  0.21 │      mov    (%rdi),%eax
 10.76 │12:   test   %eax,%eax
       │    ↓ jle    27ed730 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x50>
  0.16 │      lea    -0x1(%rax),%edx
  0.11 │      lock   cmpxchg %edx,(%r12)
 41.29 │    ↑ jne    27ed6f2 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x12>
  0.03 │      mov    $0x1,%eax
  0.18 │      mov    %rsi,0x38(%r12)
  0.00 │      mov    %cx,0x4c(%r12)
       │      cmpq   $0x0,_DYNAMIC+0x378
  0.21 │    ↓ je     27ed720 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x40>
       │    → callq  pthread_self@plt
  0.22 │40:   lock   xor %rax,0x18(%r12)
 45.85 │      add    $0x8,%rsp
  0.00 │      pop    %r12
  0.04 │      pop    %rbp
  0.17 │    ← retq
       │      xchg   %ax,%ax
       │50:   add    $0x8,%rsp
       │      mov    %rsi,%rdx
       │      mov    %r12,%rdi
       │      xor    %esi,%esi
       │      pop    %r12
       │      pop    %rbp
       │    ↑ jmpq   272eac0 <rw_lock_s_lock_spin(rw_lock_t*, unsigned long, char const*, unsigned long)>

select a version

Hello Cheng Zhou!
Thanks for conducting the experiment and sharing results.

I find this kinds of reports from perf where it just lists time spent in each individual function to not be very useful, because the same low-level function might be used in various ways in various places, so what we really need is a callstack.

Please obtain  https://github.com/brendangregg/FlameGraph
Then execute something like
perf record -p $(pgrep mysqld) --call-graph=fp -o perf.${version}.data -- sleep 10

where --call-graph=fp is the crucial thing as it will capture call stacks for each sample.
The "-- sleep 10" part can be different depending on how long you want to sample.

Then please do:

perf script -i perf.${version}.data | FlameGraph/stackcollapse-perf.pl > perf.${version}.perf-folded

You can then share the resulting perf.${version}.perf-folded file with us, or better change it into svg:

FlameGraph/flamegraph.pl <  ${version}.perf.folded >  ${version}.svg

If you have multiple versions of the code you want to compare with each other, you can also try:

FlameGraph/difffolded.pl -n perf.{$a,$b}.perf-folded | FG/flamegraph.pl >$b-$a.diff.svg

FlameGraph/difffolded.pl -n perf.{$b,$a}.perf-folded | FG/flamegraph.pl > $a-$b.diff.svg

to create flame diff graphs.

Hi Cheng Zhou,
thank you for providing the flame graphs.
Looks like the s-latches are being acquired on BP pages when starting the scan over index (row_search_mvcc -> btr_cur_search_to_nth_level) - to perform a scan w need to first descend the b-tree to the first leaf matching the query, and then continue horizontally following links to siblings.
Another possibility is that the main scan is over secondary index, and the btr_cur_search_to_nth_level is used to fetch the row's primary index record.
From this I can deduce the problem is when s-latching the root page of primary index.

IIUC in your scenario there are no "writes"  (UPDATE, DELETE, INSERT) just "reads" (SELECT). Thus, there should be no thread trying to acquire x-latch on pages of the index, and indeed the `perf annotate` you've shown looks as if the problem is more about the interaction between readers, as opposed to readers having to wait for writers.
Conceptually, each s-reader is trying to decrement the lock->lock_word by one, to mark its presence (so that in case another thread wants to acquire x-latch they would see the lock->lock_word indicating that there are some readers).
Alas, we can't use a simple fetch_sub for that.
Instead we use a CAS-loop:
```
  while (local_lock_word > threshold) {
    if (lock->lock_word.compare_exchange_strong(local_lock_word,
                                                local_lock_word - amount)) {
      return (true);
    }
  }
```
The reason we have to do it this way, is because if *there is* a writer trying to obtain x-latch in parallel, then we should not let our decrement interfere with the changes to lock->lock_word made by x-latcher. The whole latching protocol is rather complicated, but one important thing is that we should not decrement the value if it is already below or equal to threshold=0. (This has to do with preventing starvation of x-latcher).
This leads to the phenomenon you've seen as
```
 10.76 │12:   test   %eax,%eax
       │    ↓ jle    27ed730 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x50>
  0.16 │      lea    -0x1(%rax),%edx
  0.11 │      lock   cmpxchg %edx,(%r12)
 41.29 │    ↑ jne    27ed6f2 <rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long) [clone .constprop.0]+0x12>
```
(which corresponds to the `while` loop above), in which several threads trying obtain same s-latch are interfering with each other, as whenever one of them succeeds the compare_exchange_strong, it makes compare_exchange_strong fail for someone else.

I am afraid this is "by design" and not much we can do, without completely rewriting the rw_lock_t implementation which would be very risky change.
You might be tempted to try something like:
```
  while (lock->lock_word.fetch_sub(amount) <= threshold) {
    lock->lock_word.fetch_add(amount);
  }
  return (false);
```
but I warn you that this will almost certainly cause troubles in other parts of the implementation, which assume that lock->lock_word can not change once it <=0.

What I'd try instead, is to see if https://dev.mysql.com/doc/refman/8.4/en/innodb-adaptive-hash.html can help you.
This could eliminate the need for traversing from the root, instead letting InnoDB jump directly to the leaf.

Hello Cheng Zhou,

Thank you for the report and feedback.
As our Kuba nicely explained, this is "by design" and not much we can do.
Closing this report for now.

regards,
Umesh