MySQL Bugs: #34175: Use GCC-provided atomic lock operations to replace pthread_mutex

Bug #34175	Use GCC-provided atomic lock operations to replace pthread_mutex_lock functions
Submitted:	30 Jan 2008 20:24	Modified:	17 Mar 2009 11:48
Reporter:	Larry Zhou	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.0	OS:	Linux (2.6.18.5-gg16-mixed64-32)
Assigned to:	Marko Mäkelä	CPU Architecture:	Any
Tags:	atomic, innodb, lock, mutex, pthread, sync

Description:
The current implementation of InnoDb lock for x86_64 is to use pthread_mutex_* by default.
Using atomic operation provided by GCC would significantly increase performance, especially for multi-core multi-thread applications.

How to repeat:

If you use current TAS macro, it doesn't work.
Also the micro specifies for x86 only, without x86_64.

Suggested fix:
Two files need to be patched, innobase/include/sync0sync.ic and innobase/sync/sync0sync.c

Here are the diffs.
=====================================
$ cat sync0sync.ic.diff
8a9,18
> #if defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86)
> /* %z0: Use the size of operand %0 which in our case is *m to determine
> instruction size, it should end up as xchgl. "1" in the input constraint,
> says that "in" has to go in the same place as "out".*/
> #define TAS(m, in, out) \
>       asm volatile ("xchg%z0 %2, %0" \
>       : "=g" (*(m)), "=r" (out) \
>       : "1" (in))     /* Note: "1" here refers to "=r" (out) */
> #endif
>
87,94c97,102
< #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \
<       (defined(__x86_64__) || defined(__i386__))
<       /* GNUC 4.1.0 and later versions provide built-in functions
<       for atomic memory access. see online doc for details at
<       http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html
<       */
<       ulint*  lw = &(mutex->lock_word);
<       return __sync_lock_test_and_set(lw, 1);
---
> #elif defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86)
>       ulint   res;
>
>       TAS(&mutex->lock_word, 1, res);
>
>       return(res);
131,142c139,142
< #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \
<       (defined(__x86_64__) || defined(__i386__))
<       /* GNUC 4.1.0 and later versions provide built-in functions
<       for atomic memory access. see online doc for details at
<       http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html
<       */
<       ulint*  lw = &(mutex->lock_word);
<       /* __sync_lock_release(lw); */
<       /* In theory __sync_lock_release should be used to release the lock.
<       Unfortunately, it does not work properly alone. The workaround is
<       that more conservative __sync_lock_test_and_set is used instead. */
<       __sync_lock_test_and_set(lw, 0);
---
> #elif defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86)
>       ulint   res;
>
>       TAS(&mutex->lock_word, 0, res);
==================================================================
$ cat sync0sync.c.diff
243,245d242
< #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \
<       (defined(__x86_64__) || defined(__i386__))
<       mutex_reset_lock_word(mutex);
331,335c328,329
< #if defined(_WIN32) && defined(UNIV_CAN_USE_X86_ASSEMBLER)
< #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \
<       (defined(__x86_64__) || defined(__i386__))
< #else
<       os_fast_mutex_free("sync0sync mutex_free", &(mutex->os_fast_mutex));
---
> #if !defined(_WIN32) || !defined(UNIV_CAN_USE_X86_ASSEMBLER)
>       os_fast_mutex_free(&(mutex->os_fast_mutex));
=================================================================

change OS details.

Many thanks for writing a feature request.

Here are other things I did:
1. make lock_word smaller, using "byte" instead of "ulint".
2. remove os_fast_mutex_t in mutex_struct if lock_word is used for lock. That saved 40 bytes on 64-bit machine.

I have tested above changes, and all tests I am aware of passed.

My comment 3 ( private to MySQL developer) is not quite right.
For WIN32 and WIN64, different inline assembly code is needed, since we are talking about size of pointer. I was mistaken.

Thank you, Mr. Zhou,

This is very interesting contribution.

Few questions for me:

* are there any solutions for GCC 3.*.* ???
* are there any solutions for Windows
* are there any solutions for other CPU's, like SPARC ???

Larry,

does changing the lock word to a byte help? The processor may actually do more work to change a byte in memory than to change a 32-bit or 64-bit unsigned integer.

I replaced the inline assembly for test-and-set around 2003 because gcc-2.96 (?) was not able to compile it. And using pthread_mutex_trylock() as 'test-and-set' is portable.

But looks like it is time to go back to inline assembly again. I am assigning this feature request to Marko, who is right now working on this matter.

Regards,

Heikki

For Sinisa Milivojevic:

* are there any solutions for GCC 3.*.* ???

Yes, here is the code. Not thoroughly tested.

+#if defined(__GNUC__)
+/* The following piece of code can be found
+ in function AtomicExchange of file atomicops-internals-x86.h
+ under http://google-perftools.googlecode.com/svn/trunk/src/base/
+*/
+#if defined(__x86_64__)
+#define TAS(_lw, _res) \
+ asm volatile("xchgq %1,%0":"=r"(_res):"m"(*_lw),"0"(_res):"memory")
+#elif defined(__i386__)
+#define TAS(_lw, _res) \
+ asm volatile("xchgl %1,%0":"=r"(_res):"m"(*_lw),"0"(_res):"memory")
+#endif
+/* TAS is only defined for GNUC on x86_64 and i386,
+ that is where GNUC x86 inline assembly can be used */
+#endif

For lock, use TAS( lw, 1);
for unlock, use TAS(lw, 0);

We did very limited tests, it works.
Since we will never, ever move back to GCC 3.x, we do bother to do more testing.

* are there any solutions for Windows

Yes, for _WIN32

LOCK:

        __asm   MOV     ECX, lw
        __asm   MOV     EDX, 1
        __asm   XCHG    DL,  BYTE PTR [ECX]
        __asm   MOV     res, DL

UNLOCK:
        __asm   MOV     EDX, 0
        __asm   MOV     ECX, lw
        __asm   XCHG    DL,  BYTE PTR [ECX]

It should be very similar for _WIN64.

* are there any solutions for other CPU's, like SPARC ???

Sort of. 
Whenever the compiler is GCC with version greater than
4.1.0, __sync_lock_test_and_set can be used throughout.
No tests had been done. 

I have been reading a lot lately.
There were AMD hardware bug, and GCC bug reported by Boehm.
So using __sync_lock_test_and_set is most portable one than other implementation with full memory barrier. ( for example, mfence in X86 )

Let me know if you have more questions.

For Heikki Tuuri:

> does changing the lock word to a byte help? The processor may actually do more
> work to change a byte in memory than to change a 32-bit or 64-bit unsigned integer.

At least it doesn't hurt. Since we use xchg for both lock and unlock.
Here is the assembly code:

lock:
00000000000005b0 <mutex_enter_noninline>:
     5b0:	b8 01 00 00 00       	mov    $0x1,%eax
     5b5:	86 47 08             	xchg   %al,0x8(%rdi)
     5b8:	84 c0                	test   %al,%al
     5ba:	75 04                	jne    5c0 <mutex_enter_noninline+0x10>  

unlock:
00000000000003f0 <mutex_exit_noninline>:
     3f0:	48 8d 47 08          	lea    0x8(%rdi),%rax
     3f4:	31 d2                	xor    %edx,%edx
     3f6:	86 10                	xchg   %dl,(%rax)
     3f8:	48 8b 47 10          	mov    0x10(%rdi),%rax
     3fc:	48 85 c0             	test   %rax,%rax
     3ff:	75 02                	jne    403 <mutex_exit_noninline+0x13>           

>I replaced the inline assembly for test-and-set around 2003 because gcc-2.96 (?) was not
> able to compile it. And using pthread_mutex_trylock() as 'test-and-set' is portable.

I think so.

> But looks like it is time to go back to inline assembly again. I am assigning this feature
> request to Marko, who is right now working on this matter.

Let me know if you have any questions.

Thanks,

Larry

check also gcc's __sync_* builtins. Should be as good as assembly but work on more platforms.

yes, I would guess that GCC's __sync_ would work with versions greater than 4.1.0.
regardless of CPUs.

But, yes, here is the but part. Only __sync_lock_test_and_set, as lock and unlock, works reliably on 64-bit Opteron machines compiled with GCC 4.1.1.

interesting, could you elaborate please ?
what does it mean that __sync_lock_test_and_set works "unreliably" ?

eh, sorry. not __sync_lock_test_and_set bot others work unreliably. anyway, what does it mean ?

Sergei,

In the ideal world, I would just use __sync_lock_test_and_set for lock and __sync_lock_release for unlock. It doesn't work now, at least GCC 4.1.1 with 64-bit opteron. reload tests got deadlocked pretty quickly.

There are all sorts of sync primitives and helpers. For example, compiler barrier, memory barrier, here is google implementation:
http://google-perftools.googlecode.com/svn/trunk/src/base/atomicops-internals-x86.h

Hans Boehm's claims ( from his talk and PDF file )

http://www.youtube.com/watch?v=mrvAqvtWYb4

http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++threads.pdf

Page 11:
    *  gcc __sync_synchronize() “full memory barrier”. erroneously generates no-op (& perhaps has to).
    * P4 lfence, sfence instructions appeared to be no-ops in most cases, but the spec didn't say.

After having gone thru all this, we just want to be on the safe side of it.
From _WIN32 implementation, both lock and unlock use xchg on x86, we think that is probably the safest way at present time.

The InnoDB Plugin 1.0.3 makes use of GCC built-in functions for atomic memory access where available. It was released on March 11, 2009.