MySQL Bugs: #104630: no need to make two conditional judgments

Bug #104630	no need to make two conditional judgments
Submitted:	16 Aug 2021 3:47	Modified:	17 Aug 2021 13:33
Reporter:	alex xing (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
no need to make two conditional judgments in buf_flush_page_and_try_neighbors,
which may reduce CPU execution efficiency.
And just one conditional judgment can satisfy the requirement.

How to repeat:
just read the code

Suggested fix:
just one conditional judgment can satisfy the requirement.

a simple patch to describe the optimization (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: commit.patch (text/plain), 793 bytes.

Hello Alex Xing,

Thank you for the report and contribution.

regards,
Umesh

Hello Alex Xing,

The transformation you propose seems correct, and simplifies the code for people reading the code. There are however some non-trivial procedures needed to include such a change to our product, so I wonder if it is worth it.
Could you share tests demonstrating the improvement?

I've performed a simplified experiment at https://godbolt.org/z/bMnnh6zTz where I compare 
void trunk(...) which is supposed to look like the code currently in mysql's trunk, to 
void patched(...) which is supposed to imitate your patched version, 
and looking at the assembly code generated by Clang, GCC and Visual Studio, I see that all of them use just a single `test` instruction in both trunk() and patched(). 
There *is* a difference in generated assembly between trunk(...) and patched(...), but it is not in eliminating the conditional branches, but rather in the layout of the code, and also in case of Clang the compiler has figured out that the do_something_with_id can be done just once.
OTOH, when I do a more direct experiment, and compile RelWithDebInfo build of mysqld with Visual Studio, then I indeed see two `test` instructions:
```
    if (flush_type == BUF_FLUSH_LRU) {
00007FF7369C5AB1  test        sil,sil  
00007FF7369C5AB4  jne         buf_flush_page_and_try_neighbors+0DFh (07FF7369C5ABFh)  
      mutex_exit(&buf_pool->LRU_list_mutex);
00007FF7369C5AB6  lea         rcx,[rbp+30h]  
00007FF7369C5ABA  call        PolicyMutex<TTASEventMutex<GenericPolicy> >::exit (07FF7368C6260h)  
    }

    const page_id_t page_id = bpage->id;
00007FF7369C5ABF  mov         rax,qword ptr [rbx]  
00007FF7369C5AC2  mov         qword ptr [page_id],rax  

    if (flush_type == BUF_FLUSH_LRU) {
00007FF7369C5AC7  test        sil,sil  
00007FF7369C5ACA  jne         buf_flush_page_and_try_neighbors+0F6h (07FF7369C5AD6h)  
      mutex_exit(block_mutex);
00007FF7369C5ACC  mov         rcx,rdi  
00007FF7369C5ACF  call        PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit (07FF73690F2F0h)  
    } else {
00007FF7369C5AD4  jmp         buf_flush_page_and_try_neighbors+14Bh (07FF7369C5B2Bh)  
      buf_flush_list_mutex_exit(buf_pool);
00007FF7369C5AD6  mov         rcx,qword ptr [rbp+2340h]  
```
OTOH, when I look at the assembly generated by GCC for release build (sorry, it's too long to paste here, because it has inlined some of mutex_exit calls along with PFS bookkeeping) it contains just a single `test`:
   0x000000000238d664 <+212>:   test   %r15b,%r15b
   0x000000000238d667 <+215>:   je     0x238dd20 <buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, ulint, ulint*)+1936>
which chooses one of two variants of the code, where none of them rechecks the flush_type (stored in register r15, and on stack at -0x84(%rbp)).

Taken together, results of these experiments seem to imply that modern compilers are capable of optimizing this code themselves, but they sometimes don't bother to do that.
Thus, I wonder is there really some provable performance gain from this change? Aren't CPUs smart enough to realize the second branch always goes the same way as the previous?

So, it would be nice if you could provide some performance tests which support the claim this change is important.

Hello Jakub Lopuszansk,
  Thank you for your analysis. I have just test the below demo ,and find there is no difference before and after optimization, so your conclusion should be correct 

  
os:Linux innosql2.dg.163.org 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux
gcc (GCC) 6.1.0
g++ (GCC) 6.1.0

//g++ -O3 -DTRUNK test.cpp -o test_trunck
//g++ -O3 -DPATCHED test.cpp -o test_patched
#include<stdio.h>
#include<sys/time.h>
#include<stdlib.h>
enum buf_flush_t  {
  /** Flush via the LRU list */
  BUF_FLUSH_LRU = 0,

  /** Flush via the flush list of dirty blocks */
  BUF_FLUSH_LIST,

  /** Flush via the LRU list but only a single page */
  BUF_FLUSH_SINGLE_PAGE,

  /** Index of last element + 1  */
  BUF_FLUSH_N_TYPES
};
typedef int space_id_t;
typedef int page_no_t;
class page_id_t{
  public:
    page_id_t(space_id_t space, page_no_t page_no)
      : m_space(space), m_page_no(page_no){};
  /** Tablespace id. */
  space_id_t m_space;

  /** Page number. */
  page_no_t m_page_no;
};
struct buf_page_t{
    //...
    page_id_t id;
	buf_page_t(int a,int b):id(a,b){};
    //...
};
void mutex_exit_a(){};
void mutex_exit_b(){};
void mutex_exit_c(){};
void do_something_with_id(const page_id_t & id){};
void trunk(buf_page_t *bpage, buf_flush_t flush_type){
    if (flush_type == BUF_FLUSH_LRU) {
      mutex_exit_a();
    }

    const page_id_t page_id = bpage->id;

    if (flush_type == BUF_FLUSH_LRU) {
      mutex_exit_b();
    } else {
      mutex_exit_c();
    }
    do_something_with_id(page_id);
}
void patched(buf_page_t *bpage, buf_flush_t flush_type){
    const page_id_t page_id = bpage->id;

    if (flush_type == BUF_FLUSH_LRU) {
      mutex_exit_a();
      mutex_exit_b();
    } else {
      mutex_exit_c();
    }
    do_something_with_id(page_id);
}

int main(){

 unsigned long long int count=ULLONG_MAX;

 float time_use=0;
 struct timeval start;
 struct timeval end;

 gettimeofday(&start,NULL);
 printf("start.tc_sec:%d\n",start.tv_sec);
 printf("start.tc_usec:%d\n",start.tv_usec);

 buf_page_t buf_page(1,1);
 
 #ifdef TRUNK
 while(count>0){
   --count;
   if(count%3==0){
    trunk(&buf_page,BUF_FLUSH_LRU);
   }else if(count%3==1){
    trunk(&buf_page,BUF_FLUSH_LIST);
   }else if(count%3==2){
    trunk(&buf_page,BUF_FLUSH_SINGLE_PAGE);
   }
 }
 #elif defined(PATCHED)
  while(count>0){
   --count;
   if(count%3==0){
    patched(&buf_page,BUF_FLUSH_LRU);
   }else if(count%3==1){
    patched(&buf_page,BUF_FLUSH_LIST);
   }else if(count%3==2){
    patched(&buf_page,BUF_FLUSH_SINGLE_PAGE);
   }
 }
#endif

 gettimeofday(&end,NULL);
 printf("end.tc_sec:%d \n",end.tv_sec);
 printf("end.tc_usec:%d \n",end.tv_usec);

 time_use=(end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec);

 printf("time_use is %.10f \n",time_use);
 
 return 0;
}

Hello Alex Xing,
thanks for taking time to perform experiment, and for sharing the code you've used for it.
Unfortunately, your approach to testing has a subtle flaw: you've provided the bodies of the functions
void mutex_exit_a(){};
void mutex_exit_b(){};
void mutex_exit_c(){};
void do_something_with_id(const page_id_t & id){};
in the same compilation unit as the one which defines trunk() and patched(), so a compiler can "see trough" their implementation, notice that they are empty and optimize the trunk() and patched() entirely to "no-op" as well.
See https://godbolt.org/z/MvodTxaf1 for demonstration of the problem - on the left I've pasted your code, on the right you can see that all three compilers were smart enough to entirely optimize out the code being tested.
(Note that some compilers are smart enough to "see trough" the definition even if it is in a separate compilation unit, during the global optimization step)
In other words your tests were just testing the spinning loop with empty body, or perhaps not even that (some compilers should be smart enough to remove the loop as well, if its body is empty).

In order to test it in a more realistic way, I think you need to put some atomic operations into mutex_exit_a/b/c, to mimic the behaviour of TTASEventMutex, which is:

  void exit() UNIV_NOTHROW {
    m_lock_word.store(false);

    if (m_waiters.load()) {
      signal();
    }
  }

Without such an atomic operations, I think a compiler can still decide to reorder/simplify/optimize the code entirely.
For example, see https://godbolt.org/z/8sTv4GEnK where I've declared mutex_exit_a/b/c to perform *non*-atomic writes, and Visual Studio was smart enough to rearrange the code so that do_something_with_id is actually the first thing it does in the function, even before it performs any tests.
OTOH, if I use std::atomic<bool> for a,b,c stored by mutex_exit_a/b/c (see https://godbolt.org/z/8rjGPWsMP) then compilers do various things: 
gcc tries to test the condition just once, and have two specialized variants of the function for each case, 
Visual Studio tests the condition twice, with the second test leading to conditional move, as opposed to conditional jump,
clang tries a different trick, where it does one conditional jump, and puts either the address of b, or c in the register just before the store, clever.

As you can see, it's worth looking into the generated assembly before running a synthetic benchmark, otherwise you risk running a tests which doesn't really test the relevant hypothesis.