Bug #104630 | no need to make two conditional judgments | ||
---|---|---|---|
Submitted: | 16 Aug 2021 3:47 | Modified: | 17 Aug 2021 13:33 |
Reporter: | alex xing (OCA) | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S5 (Performance) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[16 Aug 2021 3:47]
alex xing
[16 Aug 2021 3:48]
alex xing
a simple patch to describe the optimization (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: commit.patch (text/plain), 793 bytes.
[16 Aug 2021 5:36]
MySQL Verification Team
Hello Alex Xing, Thank you for the report and contribution. regards, Umesh
[17 Aug 2021 8:38]
Jakub Lopuszanski
Hello Alex Xing, The transformation you propose seems correct, and simplifies the code for people reading the code. There are however some non-trivial procedures needed to include such a change to our product, so I wonder if it is worth it. Could you share tests demonstrating the improvement? I've performed a simplified experiment at https://godbolt.org/z/bMnnh6zTz where I compare void trunk(...) which is supposed to look like the code currently in mysql's trunk, to void patched(...) which is supposed to imitate your patched version, and looking at the assembly code generated by Clang, GCC and Visual Studio, I see that all of them use just a single `test` instruction in both trunk() and patched(). There *is* a difference in generated assembly between trunk(...) and patched(...), but it is not in eliminating the conditional branches, but rather in the layout of the code, and also in case of Clang the compiler has figured out that the do_something_with_id can be done just once. OTOH, when I do a more direct experiment, and compile RelWithDebInfo build of mysqld with Visual Studio, then I indeed see two `test` instructions: ``` if (flush_type == BUF_FLUSH_LRU) { 00007FF7369C5AB1 test sil,sil 00007FF7369C5AB4 jne buf_flush_page_and_try_neighbors+0DFh (07FF7369C5ABFh) mutex_exit(&buf_pool->LRU_list_mutex); 00007FF7369C5AB6 lea rcx,[rbp+30h] 00007FF7369C5ABA call PolicyMutex<TTASEventMutex<GenericPolicy> >::exit (07FF7368C6260h) } const page_id_t page_id = bpage->id; 00007FF7369C5ABF mov rax,qword ptr [rbx] 00007FF7369C5AC2 mov qword ptr [page_id],rax if (flush_type == BUF_FLUSH_LRU) { 00007FF7369C5AC7 test sil,sil 00007FF7369C5ACA jne buf_flush_page_and_try_neighbors+0F6h (07FF7369C5AD6h) mutex_exit(block_mutex); 00007FF7369C5ACC mov rcx,rdi 00007FF7369C5ACF call PolicyMutex<TTASEventMutex<BlockMutexPolicy> >::exit (07FF73690F2F0h) } else { 00007FF7369C5AD4 jmp buf_flush_page_and_try_neighbors+14Bh (07FF7369C5B2Bh) buf_flush_list_mutex_exit(buf_pool); 00007FF7369C5AD6 mov rcx,qword ptr [rbp+2340h] ``` OTOH, when I look at the assembly generated by GCC for release build (sorry, it's too long to paste here, because it has inlined some of mutex_exit calls along with PFS bookkeeping) it contains just a single `test`: 0x000000000238d664 <+212>: test %r15b,%r15b 0x000000000238d667 <+215>: je 0x238dd20 <buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, ulint, ulint*)+1936> which chooses one of two variants of the code, where none of them rechecks the flush_type (stored in register r15, and on stack at -0x84(%rbp)). Taken together, results of these experiments seem to imply that modern compilers are capable of optimizing this code themselves, but they sometimes don't bother to do that. Thus, I wonder is there really some provable performance gain from this change? Aren't CPUs smart enough to realize the second branch always goes the same way as the previous? So, it would be nice if you could provide some performance tests which support the claim this change is important.
[17 Aug 2021 13:33]
alex xing
Hello Jakub Lopuszansk, Thank you for your analysis. I have just test the below demo ,and find there is no difference before and after optimization, so your conclusion should be correct os:Linux innosql2.dg.163.org 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1+deb8u2 (2017-03-07) x86_64 GNU/Linux gcc (GCC) 6.1.0 g++ (GCC) 6.1.0 //g++ -O3 -DTRUNK test.cpp -o test_trunck //g++ -O3 -DPATCHED test.cpp -o test_patched #include<stdio.h> #include<sys/time.h> #include<stdlib.h> enum buf_flush_t { /** Flush via the LRU list */ BUF_FLUSH_LRU = 0, /** Flush via the flush list of dirty blocks */ BUF_FLUSH_LIST, /** Flush via the LRU list but only a single page */ BUF_FLUSH_SINGLE_PAGE, /** Index of last element + 1 */ BUF_FLUSH_N_TYPES }; typedef int space_id_t; typedef int page_no_t; class page_id_t{ public: page_id_t(space_id_t space, page_no_t page_no) : m_space(space), m_page_no(page_no){}; /** Tablespace id. */ space_id_t m_space; /** Page number. */ page_no_t m_page_no; }; struct buf_page_t{ //... page_id_t id; buf_page_t(int a,int b):id(a,b){}; //... }; void mutex_exit_a(){}; void mutex_exit_b(){}; void mutex_exit_c(){}; void do_something_with_id(const page_id_t & id){}; void trunk(buf_page_t *bpage, buf_flush_t flush_type){ if (flush_type == BUF_FLUSH_LRU) { mutex_exit_a(); } const page_id_t page_id = bpage->id; if (flush_type == BUF_FLUSH_LRU) { mutex_exit_b(); } else { mutex_exit_c(); } do_something_with_id(page_id); } void patched(buf_page_t *bpage, buf_flush_t flush_type){ const page_id_t page_id = bpage->id; if (flush_type == BUF_FLUSH_LRU) { mutex_exit_a(); mutex_exit_b(); } else { mutex_exit_c(); } do_something_with_id(page_id); } int main(){ unsigned long long int count=ULLONG_MAX; float time_use=0; struct timeval start; struct timeval end; gettimeofday(&start,NULL); printf("start.tc_sec:%d\n",start.tv_sec); printf("start.tc_usec:%d\n",start.tv_usec); buf_page_t buf_page(1,1); #ifdef TRUNK while(count>0){ --count; if(count%3==0){ trunk(&buf_page,BUF_FLUSH_LRU); }else if(count%3==1){ trunk(&buf_page,BUF_FLUSH_LIST); }else if(count%3==2){ trunk(&buf_page,BUF_FLUSH_SINGLE_PAGE); } } #elif defined(PATCHED) while(count>0){ --count; if(count%3==0){ patched(&buf_page,BUF_FLUSH_LRU); }else if(count%3==1){ patched(&buf_page,BUF_FLUSH_LIST); }else if(count%3==2){ patched(&buf_page,BUF_FLUSH_SINGLE_PAGE); } } #endif gettimeofday(&end,NULL); printf("end.tc_sec:%d \n",end.tv_sec); printf("end.tc_usec:%d \n",end.tv_usec); time_use=(end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec); printf("time_use is %.10f \n",time_use); return 0; }
[17 Aug 2021 14:26]
Jakub Lopuszanski
Hello Alex Xing, thanks for taking time to perform experiment, and for sharing the code you've used for it. Unfortunately, your approach to testing has a subtle flaw: you've provided the bodies of the functions void mutex_exit_a(){}; void mutex_exit_b(){}; void mutex_exit_c(){}; void do_something_with_id(const page_id_t & id){}; in the same compilation unit as the one which defines trunk() and patched(), so a compiler can "see trough" their implementation, notice that they are empty and optimize the trunk() and patched() entirely to "no-op" as well. See https://godbolt.org/z/MvodTxaf1 for demonstration of the problem - on the left I've pasted your code, on the right you can see that all three compilers were smart enough to entirely optimize out the code being tested. (Note that some compilers are smart enough to "see trough" the definition even if it is in a separate compilation unit, during the global optimization step) In other words your tests were just testing the spinning loop with empty body, or perhaps not even that (some compilers should be smart enough to remove the loop as well, if its body is empty). In order to test it in a more realistic way, I think you need to put some atomic operations into mutex_exit_a/b/c, to mimic the behaviour of TTASEventMutex, which is: void exit() UNIV_NOTHROW { m_lock_word.store(false); if (m_waiters.load()) { signal(); } } Without such an atomic operations, I think a compiler can still decide to reorder/simplify/optimize the code entirely. For example, see https://godbolt.org/z/8sTv4GEnK where I've declared mutex_exit_a/b/c to perform *non*-atomic writes, and Visual Studio was smart enough to rearrange the code so that do_something_with_id is actually the first thing it does in the function, even before it performs any tests. OTOH, if I use std::atomic<bool> for a,b,c stored by mutex_exit_a/b/c (see https://godbolt.org/z/8rjGPWsMP) then compilers do various things: gcc tries to test the condition just once, and have two specialized variants of the function for each case, Visual Studio tests the condition twice, with the second test leading to conditional move, as opposed to conditional jump, clang tries a different trick, where it does one conditional jump, and puts either the address of b, or c in the register just before the store, clever. As you can see, it's worth looking into the generated assembly before running a synthetic benchmark, otherwise you risk running a tests which doesn't really test the relevant hypothesis.