| Bug #34175 | Use GCC-provided atomic lock operations to replace pthread_mutex_lock functions | ||
|---|---|---|---|
| Submitted: | 30 Jan 2008 20:24 | Modified: | 17 Mar 2009 11:48 |
| Reporter: | Larry Zhou | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S4 (Feature request) |
| Version: | 5.0 | OS: | Linux (2.6.18.5-gg16-mixed64-32) |
| Assigned to: | Marko Mäkelä | CPU Architecture: | Any |
| Tags: | atomic, innodb, lock, mutex, pthread, sync | ||
[30 Jan 2008 20:26]
Larry Zhou
change OS details.
[31 Jan 2008 9:19]
Susanne Ebrecht
Many thanks for writing a feature request.
[5 Feb 2008 23:37]
Larry Zhou
Here are other things I did: 1. make lock_word smaller, using "byte" instead of "ulint". 2. remove os_fast_mutex_t in mutex_struct if lock_word is used for lock. That saved 40 bytes on 64-bit machine. I have tested above changes, and all tests I am aware of passed.
[5 Feb 2008 23:40]
Larry Zhou
My comment 3 ( private to MySQL developer) is not quite right. For WIN32 and WIN64, different inline assembly code is needed, since we are talking about size of pointer. I was mistaken.
[6 Feb 2008 15:23]
MySQL Verification Team
Thank you, Mr. Zhou, This is very interesting contribution. Few questions for me: * are there any solutions for GCC 3.*.* ??? * are there any solutions for Windows * are there any solutions for other CPU's, like SPARC ???
[6 Feb 2008 15:28]
Heikki Tuuri
Larry, does changing the lock word to a byte help? The processor may actually do more work to change a byte in memory than to change a 32-bit or 64-bit unsigned integer. I replaced the inline assembly for test-and-set around 2003 because gcc-2.96 (?) was not able to compile it. And using pthread_mutex_trylock() as 'test-and-set' is portable. But looks like it is time to go back to inline assembly again. I am assigning this feature request to Marko, who is right now working on this matter. Regards, Heikki
[6 Feb 2008 17:03]
Larry Zhou
For Sinisa Milivojevic: * are there any solutions for GCC 3.*.* ??? Yes, here is the code. Not thoroughly tested. +#if defined(__GNUC__) +/* The following piece of code can be found + in function AtomicExchange of file atomicops-internals-x86.h + under http://google-perftools.googlecode.com/svn/trunk/src/base/ +*/ +#if defined(__x86_64__) +#define TAS(_lw, _res) \ + asm volatile("xchgq %1,%0":"=r"(_res):"m"(*_lw),"0"(_res):"memory") +#elif defined(__i386__) +#define TAS(_lw, _res) \ + asm volatile("xchgl %1,%0":"=r"(_res):"m"(*_lw),"0"(_res):"memory") +#endif +/* TAS is only defined for GNUC on x86_64 and i386, + that is where GNUC x86 inline assembly can be used */ +#endif For lock, use TAS( lw, 1); for unlock, use TAS(lw, 0); We did very limited tests, it works. Since we will never, ever move back to GCC 3.x, we do bother to do more testing. * are there any solutions for Windows Yes, for _WIN32 LOCK: __asm MOV ECX, lw __asm MOV EDX, 1 __asm XCHG DL, BYTE PTR [ECX] __asm MOV res, DL UNLOCK: __asm MOV EDX, 0 __asm MOV ECX, lw __asm XCHG DL, BYTE PTR [ECX] It should be very similar for _WIN64. * are there any solutions for other CPU's, like SPARC ??? Sort of. Whenever the compiler is GCC with version greater than 4.1.0, __sync_lock_test_and_set can be used throughout. No tests had been done. I have been reading a lot lately. There were AMD hardware bug, and GCC bug reported by Boehm. So using __sync_lock_test_and_set is most portable one than other implementation with full memory barrier. ( for example, mfence in X86 ) Let me know if you have more questions.
[6 Feb 2008 17:33]
Larry Zhou
For Heikki Tuuri:
> does changing the lock word to a byte help? The processor may actually do more
> work to change a byte in memory than to change a 32-bit or 64-bit unsigned integer.
At least it doesn't hurt. Since we use xchg for both lock and unlock.
Here is the assembly code:
lock:
00000000000005b0 <mutex_enter_noninline>:
5b0: b8 01 00 00 00 mov $0x1,%eax
5b5: 86 47 08 xchg %al,0x8(%rdi)
5b8: 84 c0 test %al,%al
5ba: 75 04 jne 5c0 <mutex_enter_noninline+0x10>
unlock:
00000000000003f0 <mutex_exit_noninline>:
3f0: 48 8d 47 08 lea 0x8(%rdi),%rax
3f4: 31 d2 xor %edx,%edx
3f6: 86 10 xchg %dl,(%rax)
3f8: 48 8b 47 10 mov 0x10(%rdi),%rax
3fc: 48 85 c0 test %rax,%rax
3ff: 75 02 jne 403 <mutex_exit_noninline+0x13>
>I replaced the inline assembly for test-and-set around 2003 because gcc-2.96 (?) was not
> able to compile it. And using pthread_mutex_trylock() as 'test-and-set' is portable.
I think so.
> But looks like it is time to go back to inline assembly again. I am assigning this feature
> request to Marko, who is right now working on this matter.
Let me know if you have any questions.
Thanks,
Larry
[6 Feb 2008 18:12]
Sergei Golubchik
check also gcc's __sync_* builtins. Should be as good as assembly but work on more platforms.
[6 Feb 2008 18:17]
Larry Zhou
yes, I would guess that GCC's __sync_ would work with versions greater than 4.1.0. regardless of CPUs. But, yes, here is the but part. Only __sync_lock_test_and_set, as lock and unlock, works reliably on 64-bit Opteron machines compiled with GCC 4.1.1.
[6 Feb 2008 18:39]
Sergei Golubchik
interesting, could you elaborate please ? what does it mean that __sync_lock_test_and_set works "unreliably" ?
[6 Feb 2008 18:39]
Sergei Golubchik
eh, sorry. not __sync_lock_test_and_set bot others work unreliably. anyway, what does it mean ?
[6 Feb 2008 19:21]
Larry Zhou
Sergei, In the ideal world, I would just use __sync_lock_test_and_set for lock and __sync_lock_release for unlock. It doesn't work now, at least GCC 4.1.1 with 64-bit opteron. reload tests got deadlocked pretty quickly. There are all sorts of sync primitives and helpers. For example, compiler barrier, memory barrier, here is google implementation: http://google-perftools.googlecode.com/svn/trunk/src/base/atomicops-internals-x86.h Hans Boehm's claims ( from his talk and PDF file ) http://www.youtube.com/watch?v=mrvAqvtWYb4 http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++threads.pdf Page 11: * gcc __sync_synchronize() “full memory barrier”. erroneously generates no-op (& perhaps has to). * P4 lfence, sfence instructions appeared to be no-ops in most cases, but the spec didn't say. After having gone thru all this, we just want to be on the safe side of it. From _WIN32 implementation, both lock and unlock use xchg on x86, we think that is probably the safest way at present time.
[17 Mar 2009 11:48]
Marko Mäkelä
The InnoDB Plugin 1.0.3 makes use of GCC built-in functions for atomic memory access where available. It was released on March 11, 2009.

Description: The current implementation of InnoDb lock for x86_64 is to use pthread_mutex_* by default. Using atomic operation provided by GCC would significantly increase performance, especially for multi-core multi-thread applications. How to repeat: If you use current TAS macro, it doesn't work. Also the micro specifies for x86 only, without x86_64. Suggested fix: Two files need to be patched, innobase/include/sync0sync.ic and innobase/sync/sync0sync.c Here are the diffs. ===================================== $ cat sync0sync.ic.diff 8a9,18 > #if defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86) > /* %z0: Use the size of operand %0 which in our case is *m to determine > instruction size, it should end up as xchgl. "1" in the input constraint, > says that "in" has to go in the same place as "out".*/ > #define TAS(m, in, out) \ > asm volatile ("xchg%z0 %2, %0" \ > : "=g" (*(m)), "=r" (out) \ > : "1" (in)) /* Note: "1" here refers to "=r" (out) */ > #endif > 87,94c97,102 < #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \ < (defined(__x86_64__) || defined(__i386__)) < /* GNUC 4.1.0 and later versions provide built-in functions < for atomic memory access. see online doc for details at < http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html < */ < ulint* lw = &(mutex->lock_word); < return __sync_lock_test_and_set(lw, 1); --- > #elif defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86) > ulint res; > > TAS(&mutex->lock_word, 1, res); > > return(res); 131,142c139,142 < #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \ < (defined(__x86_64__) || defined(__i386__)) < /* GNUC 4.1.0 and later versions provide built-in functions < for atomic memory access. see online doc for details at < http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html < */ < ulint* lw = &(mutex->lock_word); < /* __sync_lock_release(lw); */ < /* In theory __sync_lock_release should be used to release the lock. < Unfortunately, it does not work properly alone. The workaround is < that more conservative __sync_lock_test_and_set is used instead. */ < __sync_lock_test_and_set(lw, 0); --- > #elif defined(not_defined) && defined(__GNUC__) && defined(UNIV_INTEL_X86) > ulint res; > > TAS(&mutex->lock_word, 0, res); ================================================================== $ cat sync0sync.c.diff 243,245d242 < #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \ < (defined(__x86_64__) || defined(__i386__)) < mutex_reset_lock_word(mutex); 331,335c328,329 < #if defined(_WIN32) && defined(UNIV_CAN_USE_X86_ASSEMBLER) < #elif ( (__GNUC__ == 4) && (__GNUC_MINOR__ >= 1) || __GNUC__ > 4) && \ < (defined(__x86_64__) || defined(__i386__)) < #else < os_fast_mutex_free("sync0sync mutex_free", &(mutex->os_fast_mutex)); --- > #if !defined(_WIN32) || !defined(UNIV_CAN_USE_X86_ASSEMBLER) > os_fast_mutex_free(&(mutex->os_fast_mutex)); =================================================================