MySQL Bugs: #72829: LOCK_grant is major contention point, leaves 30-40% idle CPU.

Bug #72829	LOCK_grant is major contention point, leaves 30-40% idle CPU.
Submitted:	2 Jun 2014 2:02	Modified:	27 Mar 2015 14:52
Reporter:	Stewart Smith	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Security: Privileges	Severity:	S5 (Performance)
Version:	5.7.4-m14	OS:	Linux
Assigned to:	Dmitry Lenev	CPU Architecture:	Any
Tags:	contention, mutex, performance, PowerPC

Description:
In running key lookup read-only benchmarks on a dual socket POWER8, similar to Dimitri's setup that got 500kQPS here: http://dimitrik.free.fr/blog/archives/2013/10/mysql-performance-the-road-to-500k-qps-with-... I encountered some contended mutexes when trying to use all of the CPU in the system.

With my patches I have posted in other bug reports, I could get good performance that many people would be extremely happy with, but there was still 30-40% idle CPU on the system as reported by top.

Analysis of this with perf led to the LOCK_grant rwlock as being the major culprit. This is due to contention on gaining a rdlock for the LOCK_grant rwlock.

Once commenting out the acquisition of the rdlock (safe enough for a benchmark, likely not production where GRANT queries are executed), I got a 50% increase in performance without even trying (and roughly 60% if I pushed the server).

i.e. a 50% increase in number of queries per second simply by alleviating the contention point of acquiring a single read lock to check table permissions.

(i'll extract the exact patch snippet a bit later, that machine is currently down).

How to repeat:
Benchmark on powerful system, I used a dual socket POWER8 system. You may be more likely to see this issue on a system with several NUMA nodes, but I have not investigated this very far.

I am not sure how this contention would present itself on Intel systems, but on decently large Intel SMP systems you may be able to get a gain by doing the same (it depends how visible the bouncing of the cacheline with the rwlock on it between CPUs is).

Suggested fix:
There are several approaches that can be taken:
1) As this is mostly a read only structure (unless you go and modify grants), you could split the mutex into several with a writer having to aquire all of them. This would likely solve the problem pretty well, although looks a bit funny :)
2) RCU is made for solving this problem.

I have observed a similar  stall while testing mysql 5.6.16 using sysbench (oltp_simple.lua, simple query according to pk), 100 threads.

And bellow is the backtrace of stalled threads:
     26 __lll_lock_wait(libpthread.so.0),pthread_rwlock_rdlock(libpthread.so.0),inline_mysql_rwlock_rdlock(mysql_thread.h:865),check_grant(mysql_thread.h:865),check_table_access(sql_parse.cc:6073),select_precheck(sql_parse.cc:7817),mysql_execute_command(sql_parse.cc:3025),mysql_parse(sql_parse.cc:6719),dispatch_command(sql_parse.cc:1629),do_handle_one_connection(sql_connect.cc:982),handle_one_connection(sql_connect.cc:898),pfs_spawn_thread(pfs.cc:1858),start_thread(libpthread.so.0),clone(libc.so.6)

mysql> SELECT COUNT_STAR, SUM_TIMER_WAIT, AVG_TIMER_WAIT, EVENT_NAME FROM events_waits_summary_global_by_event_name where COUNT_STAR > 0 and EVENT_NAME like 'wait/synch/%' order by SUM_TIMER_WAIT desc limit 10;
+------------+-----------------+----------------+-------------------------------------------+
| COUNT_STAR | SUM_TIMER_WAIT  | AVG_TIMER_WAIT | EVENT_NAME                                |
+------------+-----------------+----------------+-------------------------------------------+
|  253517384 | 693956170419380 |        2737208 | wait/synch/mutex/innodb/trx_sys_mutex     |
|   84200619 | 170156170825292 |        2020424 | wait/synch/rwlock/sql/LOCK_grant          |
|  169394249 | 162045674872324 |         956584 | wait/synch/mutex/sql/LOCK_table_cache     |
|  169266465 | 102097289542516 |         602988 | wait/synch/rwlock/sql/MDL_lock::rwlock    |
|  158900882 |  85391894604336 |         537152 | wait/synch/mutex/sql/MDL_map::mutex       |
|  422538610 |  47745352665600 |         112924 | wait/synch/mutex/sql/THD::LOCK_thd_data   |
|   86263377 |  40824446804108 |         473060 | wait/synch/rwlock/innodb/btr_search_latch |
|  168992808 |  35817481899112 |         211896 | wait/synch/mutex/mysys/THR_LOCK::mutex    |
|    1204410 |   2312177205408 |        1919708 | wait/synch/mutex/innodb/dict_sys_mutex    |
|     800543 |   2139313523784 |        2672244 | wait/synch/mutex/sql/LOCK_open            |
+------------+-----------------+----------------+-------------------------------------------+
10 rows in set (0.27 sec)

The difference for sysbench point selects is the difference between ~600kQPS and 1 million kQPS. See https://www.flamingspork.com/blog/2014/06/03/1-million-sql-queries-per-second-mysql-5-7-on...

Basically, LOCK_grant is what is in the way of 1 million queries per second with ("unmodified".. apart from my "make it work on POWER" patches) MySQL 5.7.

Is this the situation for a few user system or are there many users? I think there is a linear search hiding in there too if I'm not mistaken.

In my benchmark there is only root, with no password and the default setup of "root can access everything".

The connections are merely sysbench.

The overhead is in having the cacheline with the rwlock bounce between CPUs.

On POWER, you can easily see this is perf by looking at how long the load instruction takes for the rwlock itself. I suspect there's a way to extract this data on intel too.

Any chance of having this Verified? It seems that others can reproduce with different workloads and simply by code review you can see that the rwlock can easily be the bottleneck.

While testing 5.7.5...LOCK_grant became the major contention during PK-QUERY test after THR_LOCK is removed completely.

Using sysbench , query by pk , 200 tables with 10 record.

From performance schema:

root@performance_schema 05:37:18>SELECT COUNT_STAR, SUM_TIMER_WAIT, AVG_TIMER_WAIT, EVENT_NAME FROM events_waits_summary_global_by_event_name where COUNT_STAR > 0 and EVENT_NAME like 'wait/synch/%' order by SUM_TIMER_WAIT desc limit 20;
+------------+----------------+----------------+---------------------------------------------------+
| COUNT_STAR | SUM_TIMER_WAIT | AVG_TIMER_WAIT | EVENT_NAME                                        |
+------------+----------------+----------------+---------------------------------------------------+
|   26086077 | 17068952969200 |         654000 | wait/synch/rwlock/sql/LOCK_grant                  |
|   78512461 | 16715117127344 |         212768 | wait/synch/sxlock/innodb/hash_table_locks         |
|  130770402 | 15605301433540 |         119028 | wait/synch/mutex/sql/THD::LOCK_query_plan         |
|   52314260 | 12553580466128 |         239800 | wait/synch/mutex/sql/LOCK_table_cache             |
|   78446571 |  9825008327284 |         125132 | wait/synch/mutex/sql/THD::LOCK_thd_data           |
|   26175928 |  6455036525220 |         246340 | wait/synch/sxlock/innodb/index_tree_rw_lock       |
|   52299088 |  5974745787424 |         114232 | wait/synch/mutex/sql/THD::LOCK_thd_query          |
|       7568 |     1095036672 |         144316 | wait/synch/mutex/innodb/flush_list_mutex          |
|       7656 |      885055584 |         115540 | wait/synch/mutex/innodb/buf_pool_mutex            |
|       7603 |      638888240 |          83712 | wait/synch/mutex/sql/LOCK_global_system_variables |
|        242 |      354649376 |        1465396 | wait/synch/sxlock/innodb/dict_operation_lock      |
|       3445 |      351421232 |         101588 | wait/synch/mutex/innodb/dict_sys_mutex            |
|       1602 |      339106848 |         211460 | wait/synch/mutex/innodb/innobase_share_mutex      |
|       1602 |      321693008 |         200560 | wait/synch/mutex/sql/LOCK_open                    |
|       2672 |      315876768 |         118156 | wait/synch/mutex/sql/LOCK_plugin                  |
|       1050 |      308335712 |         293428 | wait/synch/mutex/sql/LOCK_connection_count        |
|       1602 |      235623120 |         146932 | wait/synch/mutex/innodb/file_format_max_mutex     |
|       1281 |      217084400 |         169168 | wait/synch/mutex/sql/LOCK_thd_list                |
|        983 |      209930512 |         213204 | wait/synch/mutex/sql/LOCK_status                  |
|       1380 |      199285136 |         144316 | wait/synch/mutex/sql/LOCK_user_conn               |
+------------+----------------+----------------+---------------------------------------------------+
20 rows in set (0.50 sec)

We've also seen this in our performance testing with MySQL 5.7. We want to do some refactoring around this mutex. 

Marking it as a verified and known issue.

Thanks, guys! :)

Posted by developer:
 
This bug was fixed in WL#8355, and pushed to 5.7.

I look forward to testing the fix (I can't see anything more recent than 5.7.5 and the worklog is private, so I cannot provide feedback yet).

Noted in 5.7.7 changelog.

To address a scalability bottleneck for some workloads where
LOCK_grant is locked in read-mode, LOCK_grant locks are now
partitioned. Read lock requests on LOCK_grant now acquire one of
multiple LOCK_grant partitions. Write locks must acquire all
partitions.