Bug #74831 InnoDB table flags in bitfield is non-optimal
Submitted: 13 Nov 2014 4:02 Modified: 15 Jul 2016 0:47
Reporter: Stewart Smith (OCA) Email Updates:
Status: Not a Bug Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.7.7 OS:Linux
Assigned to:
Tags: innodb, performance, POWER8, PowerPC

[13 Nov 2014 4:02] Stewart Smith
From a sysbench point select benchmark run on POWER8, 2.14% of total execution time is spent in rec_get_offsets_func(), with an amazing 25% of that (so around 0.5% total execution time) on a single instruction:

2.14% rec_get_offsets_func
       │             if (dict_table_is_comp(index->table)) {                                                                                                           
 25.56 │       rldicl r3,r9,39,63                                   

This is the Rotate Left Doubleword Immediate then Clear Left instruction. Basically, this extracts a bit field.

After applying this small patch:
--- mysql-5.7.5-m15.orig/storage/innobase/include/dict0mem.h
+++ mysql-5.7.5-m15/storage/innobase/include/dict0mem.h
@@ -1081,7 +1081,7 @@ struct dict_table_t {
        DICT_TF_HAS_ATOMIC_BLOBS() and DICT_TF_HAS_DATA_DIR() to parse this
        flag. */
-       unsigned                                flags:DICT_TF_BITS;
+       unsigned                                flags;

I saw a small performance improvement:

Before: (avg 427595)

After: (avg 437671)

What I can't account for is the perf after the patch that shows a similar profile:

       │             if (dict_table_is_comp(index->table)) {                                                                                                           
 28.31 │       clrldi r3,r9,63                                  

Maybe there's gains elsewhere, I'm really not all to sure about exactly why this is the case, but my numbers were compelling enough for me to find this

How to repeat:
see description, run benchmarks.

Suggested fix:
stop using C bitfields.
[13 Nov 2014 4:20] Stewart Smith
I think I figured it out.

     408:       e9 29 00 28     ld      r9,40(r9)
     40c:       79 23 3f e3     rldicl. r3,r9,39,63

we have a double word load (64bit) while afterwards we have a lwx instruction instead, which means we have a 32bit load.

This is likely to be the source of the performance improvement - and it's likely going to be doing half the loads all over the place.
[11 Dec 2014 2:38] Stewart Smith
Actually.. since TF_BITS is only 7, if you used an unsigned char here you may be able to get away with a lb instruction (load byte) being generated, which could be even better!
[11 Jan 2015 22:24] Stewart Smith
This seems to also be dependant on compiler version. GCC 4.9 seems to make better decisions than GCC 4.8 (at least on POWER, I have not looked anywhere else).
[10 Apr 2015 7:59] Stewart Smith
Code is still like this in 5.7.7

On modern Ubuntu (e.g. Ubuntu Vivid) and the GCC it ships with, this is less of an issue.

however, the checking still does chew up a bit of time, although it's not as severe as it is on big endian without this patch on slightly older gcc.
[14 Jul 2016 17:53] Sinisa Milivojevic
Vast majority of our users utilize Intel or Intel-like CPUs.

With 5.7.13 and gcc > 4.8.4, I fail to see the problem:

560		if (dict_table_is_comp(index->table)) {
   0x0000000001a6557d <+152>:	mov    -0x20(%rbp),%rax
   0x0000000001a65581 <+156>:	mov    0x20(%rax),%rax
   0x0000000001a65585 <+160>:	mov    %rax,%rdi
   0x0000000001a65588 <+163>:	callq  0x1a6293a <dict_table_is_comp(dict_table_t const*)>

Just FYI, this is compiled on Linux .....

If there is a problem at all, it is a problem of GCC producing a code as you described on those two CPUs .....
[15 Jul 2016 0:47] Stewart Smith
560		if (dict_table_is_comp(index->table)) {
   0x0000000001a6557d <+152>:	mov    -0x20(%rbp),%rax
   0x0000000001a65581 <+156>:	mov    0x20(%rax),%rax
   0x0000000001a65585 <+160>:	mov    %rax,%rdi
   0x0000000001a65588 <+163>:	callq  0x1a6293a <dict_table_is_comp(dict_table_t const*)>

This is just doing a function call, you'd need to look at the ASM for dict_table_is_comp itself.

Anyway, because this bug has languished for so long, it's likely that everybody has moved to a more recent compiler... which isn't excusing C bitfields as a good idea.
[15 Jul 2016 14:46] Sinisa Milivojevic
Hi Stewart,

First of all, in 5.7.13 dict_table_is_comp() is extremely simple:

Check whether the table uses the compact page format.
@return TRUE if table uses the compact page format */
        const dict_table_t*     table)  /*!< in: table */

#error "DICT_TF_COMPACT must be 1"

        return(table->flags & DICT_TF_COMPACT);

I took a look at 5.5 and 5.6 and it is also the same.

Getting the code for the inlined function is not easy, but I got it ..... Here it is:

  0x0000000001a6293a <+0>:	55	push   %rbp
   0x0000000001a6293b <+1>:	48 89 e5	mov    %rsp,%rbp
   0x0000000001a6293e <+4>:	48 83 ec 10	sub    $0x10,%rsp
   0x0000000001a62942 <+8>:	48 89 7d f8	mov    %rdi,-0x8(%rbp)
   0x0000000001a62946 <+12>:	48 83 7d f8 00	cmpq   $0x0,-0x8(%rbp)
   0x0000000001a6294b <+17>:	0f 94 c0	sete   %al
   0x0000000001a6294e <+20>:	0f b6 c0	movzbl %al,%eax
   0x0000000001a62951 <+23>:	48 85 c0	test   %rax,%rax
   0x0000000001a62954 <+26>:	74 18	je     0x1a6296e <dict_table_is_comp(dict_table_t const*)+52>
   0x0000000001a62956 <+28>:	ba 7e 02 00 00	mov    $0x27e,%edx
   0x0000000001a6295b <+33>:	48 8d 35 d6 25 7c 00	lea    0x7c25d6(%rip),%rsi        # 0x2224f38
   0x0000000001a62962 <+40>:	48 8d 3d c9 26 7c 00	lea    0x7c26c9(%rip),%rdi        # 0x2225032
   0x0000000001a62969 <+47>:	e8 7e 52 12 00	callq  0x1b87bec <ut_dbg_assertion_failed(char const*, char const*, unsigned long)>
   0x0000000001a6296e <+52>:	48 8b 45 f8	mov    -0x8(%rbp),%rax
   0x0000000001a62972 <+56>:	0f b6 40 34	movzbl 0x34(%rax),%eax
   0x0000000001a62976 <+60>:	0f b6 c0	movzbl %al,%eax
   0x0000000001a62979 <+63>:	83 e0 01	and    $0x1,%eax
   0x0000000001a6297c <+66>:	c9	leaveq 
   0x0000000001a6297d <+67>:	c3	retq   

You can forget all the code until 0x1a62972, because it is there only in debug version.
As you can see the produced code is very, very simple on Intel. And this is done by using GCC 4.8.3 without any optimizations in the compile arguments, not even -O. And still, it is very, very simple ....

Hence, it is my conclusion that this is a problem with GCC compiler for POWER8 and PowerPC. This is a pity, since both are fantastic CPUs.