Bug #61179 Unoptimized versions of *korr/*store macros in my_global.h are used on x86_64
Submitted: 15 May 2011 15:45 Modified: 5 Jun 2013 12:27
Reporter: Alexey Kopytov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: General Severity:S3 (Non-critical)
Version:5.1, 5.5 OS:Any
Assigned to: Tor Didriksen CPU Architecture:Any
Triage: Needs Triage: D3 (Medium)

[15 May 2011 15:45] Alexey Kopytov
Description:
my_global.h defines a set of macros for reading/storing integers of various sizes in a machine-independent format. Since the format is little-endian, optimized versions of those macros are used on x86, with more heavy-weight implementations being used on big-endian architectures, for example:

#if defined(__i386__) || defined(_WIN32)
...
#define uint8korr(A)	(*((ulonglong *) (A)))
...
#define int8store(T,A)	*((ulonglong *) (T))= (ulonglong) (A)
...
#else
...
#define uint8korr(A)	((ulonglong)(((uint32) ((uchar) (A)[0])) +\
				    (((uint32) ((uchar) (A)[1])) << 8) +\
				    (((uint32) ((uchar) (A)[2])) << 16) +\
				    (((uint32) ((uchar) (A)[3])) << 24)) +\
			(((ulonglong) (((uint32) ((uchar) (A)[4])) +\
				    (((uint32) ((uchar) (A)[5])) << 8) +\
				    (((uint32) ((uchar) (A)[6])) << 16) +\
				    (((uint32) ((uchar) (A)[7])) << 24))) <<\
				    32))
...
#define int8store(T,A)       do { uint def_temp= (uint) (A), def_temp2= (uint) ((A) >> 32); \
                                  int4store((T),def_temp); \
                                  int4store((T+4),def_temp2); } while(0)
...
#endif

The problem is that the same optimized versions could be used on x86_64 as well, but GCC does not define __i386__ on x86_64 and hence, heavy-weight versions are used in binaries (and no, those implementations are not optimized by the compiler even in release binaries).

How to repeat:
1. Examine the code in my_global.h starting from:

/* Optimized store functions for Intel x86 */
#if defined(__i386__) || defined(_WIN32)

2. gcc -dM -E - </dev/null | grep 86
[17 May 2011 18:09] Sinisa Milivojevic
Kaamos, my friend,

I was not able to repeat the problem. I have new iMac, which has 64-bit CPU.

I was checking your bug by getting pre-processor output only , for mi_key.c source file, with the following command:

 gcc -DHAVE_CONFIG_H -I. -I../../include -I../../include -I../../include -I../../regex -I../../sql -I.    -g -DSAFE_MUTEX -DSAFEMALLOC -Wimplicit -Wreturn-type -Wswitch -Wtrigraphs -Wcomment -W -Wchar-subscripts -Wformat -Wparentheses -Wsign-compare -Wwrite-strings -Wunused-function -Wunused-label -Wunused-value -Wunused-variable -Wunused-parameter -mtune=nocona -DUNIV_MUST_NOT_INLINE -DEXTRA_DEBUG -DFORCE_INIT_OF_VARS  -DSAFEMALLOC -DPEDANTIC_SAFEMALLOC -DSAFE_MUTEX   -D_P1003_1B_VISIBLE -DSIGNAL_WITH_VIO_CLOSE -DSIGNALS_DONT_BREAK_READ -DIGNORE_SIGHUP_SIGQUIT  -DDONT_DECLARE_CXA_PURE_VIRTUAL -MT mi_key.o -MD -MP -MF .deps/mi_key.Tpo -E mi_key.c 

If you look at the bottom of the mi_key.c source file, you will see one switch command. The end of that command , after pre-processing, looks like this:

  case HA_KEYTYPE_LONGLONG:
    s_value= (*((longlong *) (key)));
    break;
  case HA_KEYTYPE_ULONGLONG:
    value= (*((ulonglong *) (key)));
    break;
  default:
    (__builtin_expect(!(0), 0) ? __assert_rtn(__func__, "mi_key.c", 557, "0") : (void)0);
    value=0;
    break;

which shows that it works nicely with those macros.

Tell me, please, if I am wrong and where am I wrong .....
[17 May 2011 19:02] Alexey Kopytov
Dear Sinisa,

This is what I get from the same command using 4.1.2 on a 64-bit Linux machine:

  case HA_KEYTYPE_LONGLONG:
    s_value= (longlong) ((ulonglong)(((uint32) ((uchar) (key)[0])) + (((uint32) ((uchar) (key)[1])) << 8) + (((uint32) ((uchar) (key)[2])) << 16) + (((uint32) ((uchar) (key)[3])) << 24)) + (((ulonglong) (((uint32) ((uchar) (key)[4])) + (((uint32) ((uchar) (key)[5])) << 8) + (((uint32) ((uchar) (key)[6])) << 16) + (((uint32) ((uchar) (key)[7])) << 24))) << 32));
    break;
  case HA_KEYTYPE_ULONGLONG:
    value= ((ulonglong)(((uint32) ((uchar) (key)[0])) + (((uint32) ((uchar) (key)[1])) << 8) + (((uint32) ((uchar) (key)[2])) << 16) + (((uint32) ((uchar) (key)[3])) << 24)) + (((ulonglong) (((uint32) ((uchar) (key)[4])) + (((uint32) ((uchar) (key)[5])) << 8) + (((uint32) ((uchar) (key)[6])) << 16) + (((uint32) ((uchar) (key)[7])) << 24))) << 32));
    break;
  default:
    ((0) ? (void) (0) : (__assert_fail ("0", "mi_key.c", 557, __PRETTY_FUNCTION__), (void) (0)));
    value=0;
    break;
  }

It is the same with gcc 4.2.1 on Mac OS X.

I also disassembled that function in mysql-5.5.10-linux2.6-x86_64.tar.gz downloaded from dev.mysql.com. It's hard to read optimized code, but there are clearly lots of shifts in that code:

...
0x00000000008e0cdc <retrieve_auto_increment+380>:	xchg   %ax,%ax
0x00000000008e0ce0 <retrieve_auto_increment+384>:	movzbl 0x5(%rsi),%ecx
0x00000000008e0ce4 <retrieve_auto_increment+388>:	movzbl 0x6(%rsi),%eax
0x00000000008e0ce8 <retrieve_auto_increment+392>:	movzbl 0x1(%rsi),%edx
0x00000000008e0cec <retrieve_auto_increment+396>:	shl    $0x10,%eax
0x00000000008e0cef <retrieve_auto_increment+399>:	shl    $0x8,%ecx
0x00000000008e0cf2 <retrieve_auto_increment+402>:	add    %eax,%ecx
0x00000000008e0cf4 <retrieve_auto_increment+404>:	movzbl 0x4(%rsi),%eax
0x00000000008e0cf8 <retrieve_auto_increment+408>:	shl    $0x8,%edx
0x00000000008e0cfb <retrieve_auto_increment+411>:	add    %eax,%ecx
0x00000000008e0cfd <retrieve_auto_increment+413>:	movzbl 0x7(%rsi),%eax
0x00000000008e0d01 <retrieve_auto_increment+417>:	shl    $0x18,%eax
0x00000000008e0d04 <retrieve_auto_increment+420>:	add    %eax,%ecx
0x00000000008e0d06 <retrieve_auto_increment+422>:	movzbl 0x2(%rsi),%eax
0x00000000008e0d0a <retrieve_auto_increment+426>:	shl    $0x20,%rcx
0x00000000008e0d0e <retrieve_auto_increment+430>:	shl    $0x10,%eax
0x00000000008e0d11 <retrieve_auto_increment+433>:	add    %eax,%edx
0x00000000008e0d13 <retrieve_auto_increment+435>:	movzbl (%rsi),%eax
0x00000000008e0d16 <retrieve_auto_increment+438>:	add    %eax,%edx
0x00000000008e0d18 <retrieve_auto_increment+440>:	movzbl 0x3(%rsi),%eax
0x00000000008e0d1c <retrieve_auto_increment+444>:	shl    $0x18,%eax
0x00000000008e0d1f <retrieve_auto_increment+447>:	add    %eax,%edx
0x00000000008e0d21 <retrieve_auto_increment+449>:	lea    (%rcx,%rdx,1),%rax
0x00000000008e0d25 <retrieve_auto_increment+453>:	jmpq   0x8e0bd0 <retrieve_auto_increment+112>
...

Are you sure the compiler on your machine generates 64-bit code with those options? What is the output of "gcc -dM -E - </dev/null | grep 86" in your case?
[18 May 2011 5:03] Jonas Oreland
I got the "unoptimized" versions...does it matter that I tried debug build...
probably not...

gcc (Gentoo 4.4.3-r2 p1.2) 4.4.3
Linux eel 2.6.34.5-jonas #5 SMP Mon Nov 22 14:03:35 CET 2010 x86_64 Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz GenuineIntel GNU/Linux

[fk-patches/]:jonas@eel:~/src/telco-7.0/storage/myisam> file mi_key.o 
mi_key.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
[18 May 2011 17:13] Sinisa Milivojevic
Kaamos, Jonas,

If I insist on the tuning to be done for Nocona, which is strictly 64-bit CPU, I get fully optimized code, with gcc 4.0.1. I will try 4.2.1 once I buy USB disk, do backup, install Snow Leopard etc .... ;o)

However, with -m64 I get non-optimized code, which is kind of strange. So, fixing is necessary, but could you tell me why do you think that this is a bug in our code and not in GCC ???

Why couldn't they (GCC) support i386 on 64-bit as well ????
[5 Jun 2013 12:27] Tor Didriksen
Fixed in 5.6 by patch
tor.didriksen@oracle.com-20120111093352-wca2tmilh4sq9sd5
[5 Jun 2013 16:45] Paul Dubois
Noted in 5.6.13 changelog.

Unoptimized versions of the xxxkorr() macros in my_global.h were used
on 64-bit x86 processors.