Bug #88399 Using CAS for trylock in place of TAS for EventMutex (arm64)
Submitted: 8 Nov 2017 8:39 Modified: 20 Dec 2017 3:30
Reporter: Debayan Ghosh (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.7,8.0 OS:Linux
Assigned to: CPU Architecture:ARM
Tags: Contribution, mysql-5.7*, mysql-8.0

[8 Nov 2017 8:39] Debayan Ghosh
Description:
Hi,

Currently MYSQL event mutex code uses test and set semantics to try acquiring a lock.

file:: storage/innobase/include/ib0mutex.h

bool tas_lock() UNIV_NOTHROW
{
      return(TAS(&m_lock_word, MUTEX_STATE_LOCKED)
			== MUTEX_STATE_UNLOCKED);
}

When the contention is high with several threads attempting the atomic_exchange , I find  compare and swap (__atomic_compare_exchange) to be performing quite better on some arm64 platforms compared to test and set.

bool cas_lock() UNIV_NOTHROW
{
    return (CAS(&m_lock_word, MUTEX_STATE_UNLOCKED, MUTEX_STATE_LOCKED)
             == MUTEX_STATE_UNLOCKED);
}

I also see the Futexlock implementation to also use a CAS for try lock. 

I used the latest sysbench 1.1 oltp update/write only benchmarks to test the impact. The improvement was seen with 32 or more number of threads.

In addition to this, reducing the strength of the atomic_compare_exchange barrier from ATOMIC_SEQ_CST/ATOMIC_SEQ_CST to ATOMIC_ACQUIRE/ATOMIC_RELAXED gives some additional improvement but this may need to be verified on all other platforms and scenarios.

How to repeat:
Used SysBench 1.1 oltp writes/update on ARM64 platforms with number of threads 32 and more.
[8 Nov 2017 8:53] Debayan Ghosh
patch 

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-InnoDB-Use-CAS-for-Eventmutex-trylock.patch (application/octet-stream, text), 2.40 KiB.

[8 Nov 2017 9:10] MySQL Verification Team
Hello Debayan,

Thank you for the report and contribution.

Thanks,
Umesh
[21 Nov 2017 15:37] Debayan Ghosh
Any comments on this one ? 

Has someone seen the impact of this on other platforms including PPC64 ?
[8 Dec 2017 15:43] Eric Anger
I have tested this patch on several different Arm platforms and have seen it improve performance for large core counts under high lock contention.
[15 Dec 2017 17:25] Daniel Frazier
I tested on a 128 CPU ppc64le system, and I also see improvements there.

My commandline:
$ sysbench --max-requests=0 --test=oltp --num-threads=128 --max-time=60 --mysql-user=testuser --mysql-password=testpassword run

sysbench.orig.1:    transactions:                        60718  (1005.43 per sec.)
sysbench.orig.1:    deadlocks:                           8173   (135.34 per sec.)
sysbench.orig.1:    read/write requests:                 1284365 (21267.85 per sec.)
sysbench.orig.1:    other operations:                    129609 (2146.20 per sec.)

sysbench.cas.1:    transactions:                        70365  (1170.24 per sec.)
sysbench.cas.1:    deadlocks:                           9346   (155.43 per sec.)
sysbench.cas.1:    read/write requests:                 1486625 (24724.10 per sec.)
sysbench.cas.1:    other operations:                    150076 (2495.92 per sec.)