MySQL Bugs: #66473: MySQL 5.6.6-m9 has more mutex contention than MySQL 5.1

Bug #66473	MySQL 5.6.6-m9 has more mutex contention than MySQL 5.1
Submitted:	21 Aug 2012 1:49	Modified:	25 Feb 2013 16:51
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Locking	Severity:	S5 (Performance)
Version:	5.6.6-m9	OS:	Any
Assigned to:	Jon Olav Hauglid	CPU Architecture:	Any

Description:
This surprised me given the blog posts from Oracle.

I ran sysbench for a read-only workload with all data cached in InnoDB buffer pool. QPS was better for MySQL 5.1 until very high concurrency. There was a lot more mutex contention in MySQL 5.6 and most of it appears to come from MDL. I will know more once I figure out how to use the performance schema (wish that were easier to use).

My test server has 16/32 CPU/vCPU and hyperthread was on so it reports 32 cores.

The test used sysbench where each query fetches all columns from one row with a lookup by PK. The results below are for MySQL 5.1.6X with the Facebook patch and unmodified MySQL 5.6. The numbers below are QPS for 8 -> 1024 concurrent clients. The test uses 8 tables with clients uniformly distributed across the tables (each client is limited to one table). Each table has 1M rows.

    8      16       32      64     128     256    512    1024   clients
56104   113944  212192  277937  177672  118408  67525   32031   5.1fb
53726   107174  183502  233566  238878  210891  180060  148195  5.6.5m8
49960   105759  182709  237572  240187  228502  220709  215144  5.6.6m9

How to repeat:
run sysbench

I used this command line to build MySQL 5.6:
~/bin/cmake . -DCMAKE_INSTALL_PREFIX=/data/56orig -DBUILD_CONFIG=mysql_release -DMYSQL_UNIX_ADDR=/data/56orig/data/mysql.sock -DWITH_EMBEDDED_SERVER=0 -DWITH_PERFSCHEMA_STORAGE_ENGINE=0 -DIGNORE_AIO_CHECK=1

Suggested fix:
reduce mutex contention

my.cnf options for MySQL 5.6 and 5.1

For 5.1:

innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=1
innodb_doublewrite=1
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table

max_connections=2000
table_cache=2000
key_buffer_size=200M

innodb_io_capacity=1000
skip_log_bin
query_cache_size=0
query_cache_type=0
innodb_thread_concurrency=0
innodb_buffer_pool_size=32G
For 5.6.6

innodb_buffer_pool_size=16G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=1
innodb_doublewrite=1
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80

innodb_file_format=barracuda
innodb_file_per_table

max_connections=2000
table_open_cache=2000
table_open_cache_instances=4

key_buffer_size=200M
innodb_io_capacity=1000

bind-address = 0.0.0.0
skip_log_bin
query_cache_size=0
query_cache_type=0
innodb_thread_concurrency=0
innodb_buffer_pool_size=32G
innodb_buffer_pool_instances=4

MySQL 5.6 isn't faster because of mutex contention. I think this is above InnoDB but I might need to use the performance schema to identify the problem. All InnoDB mutex contention for MySQL 5.1 is from kernel_mutex which gets hot from transaction stop and start. For 5.6 the context switch rate is much higher and the system CPU time is also higher. The test uses auto-commit SELECT statements, so the 5.6 fast path for InnoDB transaction start should be used.

vmtat for MySQL 5.6.6 at 64 clients, there are 530k context switches/second

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
52  1      0 83037752 608844 44932396    0    0     0   102 400039 432435 49 23 28  1  0
55  1      0 83053400 608844 44932428    0    0     0   108 492252 536340 60 30  9  1  0
45  0      0 83051464 608844 44932504    0    0     0   125 490456 531527 61 29  9  1  0

vmstat for MySQL 5.1fb at 64 clients, there are ~340k context switches/second

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
51  0      0 115455424 107292 15152028    0    0     3    80 573477 337272 72 22  6  0  0
53  0      0 115456008 107292 15152176    0    0     3    77 575436 337216 72 22  6  0  0
56  0      0 115288560 107292 15152340    0    0     3    68 570336 335162 72 22  6  0  0

I then ran the Linux "perf" utility to get call graphs for the largest sources of CPU time and the results are why I think the problem is MDL

This is the #1 source of CPU time

                     |
                     --- _raw_spin_lock
                        |          
                        |--45.79%-- futex_wake
                        |          do_futex
                        |          sys_futex
                        |          system_call
                        |          |          
                        |          |--98.43%-- __lll_mutex_unlock_wake
                        |          |          |          
                        |          |          |--61.73%-- MDL_map::find_or_insert(MDL_key const*)
                        |          |          |          MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
                        |          |          |          MDL_context::acquire_lock(MDL_request*, unsigned long)
                        |          |          |          _ZL23open_table_get_mdl_lockP3THDP18Open_table_contextP11MDL_requestjPP10MDL_ticket
                        |          |          |          open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*)
                        |          |          |          open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
                        |          |          |          open_normal_and_derived_tables(THD*, TABLE_LIST*, unsigned int)
                        |          |          |          _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST
                        |          |          |          mysql_execute_command(THD*)
                        |          |          |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          |          |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          |          |          do_handle_one_connection(THD*)
                        |          |          |          handle_one_connection
                        |          |          |          start_thread
                        |          |          |          
                        |          |          |--35.23%-- MDL_context::release_lock(enum_mdl_duration, MDL_ticket*)
                        |          |          |          MDL_context::release_locks_stored_before(enum_mdl_duration, MDL_ticket*)
                        |          |          |          mysql_execute_command(THD*)
                        |          |          |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          |          |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          |          |          do_handle_one_connection(THD*)
                        |          |          |          handle_one_connection
                        |          |          |          start_thread

...

And this is the #2 source of CPU time

                       |--39.47%-- futex_wait_setup
                        |          futex_wait
                        |          do_futex
                        |          sys_futex
                        |          system_call
                        |          |          
                        |          |--99.32%-- __lll_mutex_lock_wait
                        |          |          |          
                        |          |           --100.00%-- select_precheck(THD*, LEX*, TABLE_LIST*, TABLE_LIST*)
                        |          |                     mysql_execute_command(THD*)
                        |          |                     mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          |                     dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          |                     do_handle_one_connection(THD*)
                        |          |                     handle_one_connection
                        |          |                     start_thread

I also confirmed that my foreground threads had INSTRUMENTED='YES' when sysbench ran.

Mark, can you please also run a single threaded test for both 5.1 and 5.6 and post the results. That should rule out mutex contention (usually).

I tried to enable the mutex monitoring after server start. Adding that to my.cnf fixed it. There is a huge amount of documentation on the PS. It would be nice to have more of a users guide in addition to the current content which is a reference.

And the problem appears to be the MDL locks:

mysql> SELECT COUNT_STAR, SUM_TIMER_WAIT, AVG_TIMER_WAIT, EVENT_NAME FROM events_waits_summary_global_by_event_name where COUNT_STAR > 0 and EVENT_NAME like 'wait/synch/%' order by SUM_TIMER_WAIT desc limit 10;
+------------+-----------------+----------------+-------------------------------------------+
| COUNT_STAR | SUM_TIMER_WAIT  | AVG_TIMER_WAIT | EVENT_NAME                                |
+------------+-----------------+----------------+-------------------------------------------+
|   59754588 | 921695778424248 |       15424656 | wait/synch/mutex/sql/MDL_map::mutex       |
|   20482862 | 845991708402096 |       41302200 | wait/synch/rwlock/innodb/btr_search_latch |
|  107961831 | 577757435771256 |        5351160 | wait/synch/rwlock/sql/MDL_lock::rwlock    |
|  157672459 | 255148746921936 |        1617888 | wait/synch/mutex/innodb/trx_sys_mutex     |
|  158603974 |  77097481832376 |         486096 | wait/synch/mutex/innodb/buf_pool_mutex    |
|  144060420 |  59050906585872 |         409488 | wait/synch/mutex/innodb/fil_system_mutex  |
|  253952141 |  44797824799248 |         176016 | wait/synch/rwlock/innodb/hash table locks |
|  105113929 |  44280621423168 |         420888 | wait/synch/mutex/sql/LOCK_table_cache     |
|   52313292 |  35203238215536 |         672600 | wait/synch/rwlock/sql/LOCK_grant          |
|  262783571 |  34965961792944 |         132696 | wait/synch/mutex/sql/THD::LOCK_thd_data   |

With 1 client thread MySQL 5.1.63 + the FB patch is faster:
5994 -- 5.6.6m9 
6442 -- 5.1.63 + FB 

This workload, short lookup queries, is one for which you are marketing MySQL 5.6 so we should make this better. Table caches were large enough that nothing was evicted. So why is the MDL the top source of contention? Unfortunately I don't have anything to count mutex contention for non-innodb code in MySQL 5.1 to determine whether LOCK_open was as bad.

Open_table_definitions  76
Open_tables     128
Opened_files    124
Opened_table_definitions        76
Opened_tables   135
Table_open_cache_hits   56309066
Table_open_cache_misses 135
Table_open_cache_overflows      0

How is MDL_map::m_mutex different from LOCK_open from a performance perspective? Both are global mutexes guarding access to a HASH. Both cause a lot of mutex contention.

does metadata-locks-cache-size variable exist on non-windows builds?  could increasing it help?

Hi Shane!

Yes it does. But since its default value is 1024 I doubt that increasing it will help for case when only 8 tables are used (well unless it is set to some lower value in Mark's case).

Hello!

First of all I would like to emphasize that MDL was added
to solve some problems with correctness of isolation between
normal transactions and DDL, like infamous MySQL bug #989.
It is not a replacement for the Table Definition Cache and
its mutexes/locks are not a replacement for LOCK_open (though
it has allowed to reduce the size of some of LOCK_open's
critical sections).

So it should be expected that this new synchronization 
mechanism introduces some extra overhead compared to
5.1. Especially in the case with just 1 connection since
in this case the new synchronization layer just adds pure
overhead.

Having said the above, I should also say that we have tried
and continue to try to reduce the negative effect on
performance caused by MDL introduction.

So what you have reported above indeed looks like a problem,
which we would like to investigate.

Could you please clarify a few points for us:

Do you observe the same picture when you use single table
in your test? 

The reason why I am asking is that when 8 different
tables are used in the way like you have described above
(i.e. each connection uses only one specific table), 
situation with 8 connections using 8 tables will be
pretty similar to situation with one connection using
one table.

Also I suspect that in case then you have 8 tables and
fairly low number of connections (16, 32) you can probably
observe some overhead caused by extra code being executed
under MDL_map::m_mutex.

MDL_map::m_mutex works best (i.e. is taken for shortest
time) when a table is constantly used in some connection,
which might be not the case for a low number of connections
and several tables.

Now to your question:

"
How is MDL_map::m_mutex different from LOCK_open from a performance
perspective? Both are global mutexes guarding access to a HASH. Both cause
a lot of mutex contention.
"

Yes you are correct that both MDL_map::m_mutex and LOCK_open are global
mutexes. But there are some significant differences between them as well:

1) - In the best case MDL_map::m_mutex is supposed to be taken once for
     each table instance used in a transaction.
   - In the best case in 5.1/5.5 LOCK_open is taken twice for each table
     instance used in the statement (in 5.6 partitioned LOCK_table_cache
     plays a similar role).

So the pressure on MDL_map::m_mutex in the best case should be at least
twice lower than on LOCK_open in 5.1/5.5. And it will be even lower
if you are using transactions and not running in auto-commit mode.

2) In the best, the case size of the critical section for MDL_map::m_mutex
   should be smaller than the best case for LOCK_open in 5.1/5.5.

As I have said above it is likely that you are not observing this best
case in scenario for a low number of connections/several tables.

Indeed, the above doesn't mean that MDL_map::m_mutex is perfect!
Definitely, there is a room for improvement!

Results for 1, 8, 16, 32, 64, 128, 256, 512 and 1024 threads are listed in this thread. QPS for 5.6.6-m9 is lower than 5.1.63 at 1 to 64 threads. Only at 128 threads and above is 5.6 doing better than 5.1.

Hi Mark.

First, thank you for making the effort to use the performance_schema,
and for the data provided.

About your comment:
"
There is a huge amount of documentation on the PS. It would be nice to have
more of a users guide in addition to the current content which is a
reference.
"

Point well taken.

Now, about the specifics:

mysql> select 921695778424248/power(10, 12) as mdlmap_wait_in_sec;
+--------------------+
| mdlmap_wait_in_sec |
+--------------------+
| 921.695778424248 |
+--------------------+
1 row in set (0.00 sec)

mysql> select 921695778424248/power(10, 12)/32 as mdlmap_wait_in_sec_per_core;
+-----------------------------+
| mdlmap_wait_in_sec_per_core |
+-----------------------------+
| 28.80299307575775 |
+-----------------------------+
1 row in set (0.00 sec)

From the performance_schema data collected, the total time spent on
wait/synch/mutex/sql/MDL_map::mutex is 15 minutes for all, or 29 sec per core.
This is significant, and is indeed the biggest bottleneck found.

MDL_map::mutex, LOCK_table_cache, and THD::LOCK_thd_data are all mutexes,
implemented using the same underlying code (pthread_mutex_t),
so comparing the data gives interesting results:

a)
THD::LOCK_thd_data is not a singleton, and likely to be not contended at all.
Yet these mutexes are locked 262 Million times during the benchmark.
An AVG_TIMER_WAIT of 132696 gives the lower end of what can be expected for a mutex lock, for this hardware and this load, on a non contended mutex.

b)
LOCK_table_cache is not a singleton, but there are only 4 table cache instances.
Assuming a uniform distribution, 105113929/4 gives 26 Millions locks per mutex on average, with an AVG_TIMER_WAIT of 420888 per LOCK_table_cache mutex.

c)
AVG_TIMER_WAIT for MDL_map::mutex is 116 times and 36 times bigger compared to
THD::LOCK_thd_data and LOCK_table_cache, respectively.
Given that the underlying implementation:
- mysql_mutex_lock (the performance schema code)
- pthread_mutex_lock (the lock itself),
is the same in all cases,
and given that this code already scales well with:
- either a great number of calls (262 M) on many instances,
- or a great number of calls using the same instance (26 M),
I think this points to the actual content of the critical section
protected by the MDL_map mutex, which is causing the bottleneck here.

Based on the data reported in this report,
the contention in MDL_map::m_mutex is indeed verified (for the load used)

Dimitry:

Possible suggestions for a fix, based on early analysis:

- Change MDL_map::m_locks to a LF_HASH instead of HASH
- Keep MDL_map::m_mutex to protect MDL_map::m_unused_locks_cache if needed,
but at least avoid locking this mutex when doing a search in the hash,
so that multiple threads can search the hash in parallel.

Regards,
-- Marc

Hi Mark!

"Results for 1, 8, 16, 32, 64, 128, 256, 512 and 1024 threads are listed in this thread. QPS for 5.6.6-m9 is lower than 5.1.63 at 1 to 64 threads. Only at 128 threads and above is 5.6 doing better than 5.1.".

The above you say that these data are for the test that "uses 8 tables with clients uniformly distributed across the tables".

My question was "do you observe the same issue for case when only 1 table is used and not 8"? 

Do I miss any additional data that you have already provided?

If you are asking for results with 1 -> 1024 client threads all using one table, I will get results for that later this week.

Hi Mark!

Yes, this is what I am asking for!

Thank you a lot for your feedback and all the efforts!

Original data I provided, innodb_buffer_pool_instances=4 for MySQL 5.6
    8      16       32      64     128     256    512    1024   clients
56104   113944  212192  277937  177672  118408  67525   32031   5.1fb
53726   107174  183502  233566  238878  210891  180060  148195  5.6.5m8
49960   105759  182709  237572  240187  228502  220709  215144  5.6.6m9

Results with only 1 table, and innodb_buffer_pool_instances=1,2,4,8 (bpi=X below)
51326   105491  183253  239894  250153  246685  237081  203625  5.6.6-m9, bpi=1
52315   106299  183950  239474  249353  245024  243746  192644  5.6.6-m9, bpi=2
52186   105891  183748  239387  248858  241922  239481  198986  5.6.6-m9, bpi=4
54155   105006  183881  239615  249110  247162  241249  216810  5.6.6-m9, bpi=8

Results with 8 tables, and innodb_buffer_pool_instances=1,2,4,8
48914   93813   155093  197641  203535  205029  58947   67841   5.6.6-m9, bpi=1
49710   94166   156151  201469  208929  208695  75083   76993   5.6.6-m9, bpi=2
48642   93508   152709  196765  203727  201939  140980  129608  5.6.6-m9, bpi=4
48364   92947   150315  190443  196549  197086  192819  147549  5.6.6-m9, bpi=8

Noted in 5.6.8, 5.7.0 changelogs. 

Metadata locking resulted in excessive contention in read-only 
workloads involving InnoDB tables and a low number of connections. 

Now the set of metadata locks can be partitioned into separate hashes 
to permit connections accessing diffent objects to use diffent 
locking hashes and reduce contention. The new 
metadata_locks_hash_instances system variable can be used to specify 
the number of hashes.

Please keep working on this. I produce the same problem with 5.6.10 and MySQL 5.1.63 + FB patch is a lot faster for the problem workload.

The workload is:
* sysbench with 8 tables
* clients distributed uniformly across tables, each client is limited to one table
* SQL is fetch 1 row by primary key
* mysqld was linked with jemalloc
* bpi=X below means innodb_buffer_pool_instances=X
* orig5610, orig5152, orig5163 == unmodified MySQL 5.6.10, 5.1.52, 5.1.63
* dev5163 == MySQL 5.1.63 + Facebook patch

QPS is listed below
    8           16          32          64        128          256   concurrent clients
56161   98147   152856  171499  170448  170035  orig5610, bpi=1
56386   99076   152547  168290  170256  169034  orig5610, bpi=2
52841   101831  151345  171098  171452  170812  orig5610, bpi=4
51927   92475   151547  166432  169196  168639  orig5610, bpi=8
...
61968   114979  179318  223252  223619  222973  dev5163
65231   117471  191359  222856  227052  85756   orig5152
66971   125003  191782  225055  218248  81998   orig5163

This was measured after the test ran for a few minutes. The opt two sources for waits were still growing fast:

bin/mysql -A performance_schema -e "select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/%' and count_star > 0 order by sum_timer_wait desc limit 20"
Warning: Using a password on the command line interface can be insecure.
+---------------------------------------------------+------------+-----------------+
| event_name                                        | count_star | sum_timer_wait  |
+---------------------------------------------------+------------+-----------------+
| wait/synch/mutex/sql/MDL_map::mutex               |   77858908 | 504111071029125 |
| wait/synch/rwlock/sql/MDL_lock::rwlock            |  116294302 | 211569523668750 |
| wait/synch/rwlock/innodb/btr_search_latch         |   48820750 | 100330805167125 |
| wait/synch/mutex/innodb/trx_sys_mutex             |  169600540 |  61770792053625 |
| wait/synch/mutex/sql/THD::LOCK_thd_data           |  282663275 |  35703234132375 |
| wait/synch/mutex/mysys/THR_LOCK::mutex            |  113065534 |  32369886805500 |
| wait/synch/mutex/sql/LOCK_table_cache             |  113065774 |  25974495424500 |
| wait/synch/rwlock/sql/LOCK_grant                  |   56291239 |  14945362568250 |
| wait/synch/rwlock/innodb/hash table locks         |   57572866 |   8973685354125 |
| wait/synch/rwlock/innodb/index_tree_rw_lock       |   16360588 |   2977433323875 |
| wait/synch/mutex/innodb/os_mutex                  |   35510564 |   2751022925250 |
| wait/synch/mutex/innodb/buf_pool_mutex            |    5600301 |   1554670281000 |
| wait/synch/mutex/innodb/fil_system_mutex          |    8928068 |    949883385000 |
| wait/synch/mutex/innodb/log_sys_mutex             |    1984042 |    186805321500 |
| wait/synch/mutex/innodb/rw_lock_list_mutex        |    4194433 |     73857420750 |
| wait/synch/mutex/innodb/mutex_list_mutex          |    4194728 |     73820670000 |
| wait/synch/rwlock/innodb/checkpoint_lock          |        106 |      8239699125 |
| wait/synch/cond/sql/COND_server_started           |          1 |      5888406375 |
| wait/synch/mutex/sql/LOCK_global_system_variables |      13563 |       765520500 |
| wait/synch/mutex/innodb/dict_sys_mutex            |        227 |       632344500 |
+---------------------------------------------------+------------+-----------------+

my.cnf used for mysql 5.6

socket=/data/orig5610/data/mysql.sock
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES 

innodb_log_file_size=1900M
innodb_doublewrite=1
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table

table-definition-cache=1000
table-open-cache=2000
max_connections=2000
key_buffer_size=200M
innodb_io_capacity=1000

innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
query_cache_size=0
query_cache_type=0
skip_log_bin
performance_schema_instrument='wait/synch/%=on'
innodb_buffer_pool_size=64g
table_open_cache_instances=8

So, things have change a bit since I first reported this. But the outcome is that for this workload 5.6.10 is slower than MySQL 5.1.63 using the FB patch. At lower concurrencies 5.6.10 is also slower than unmodified 5.1.63.

I think there is a real problem with mutex contention in the MDL code in 5.6.

QPS for the same workload, but only using one sysbench process and one database table

1          2           4           8            16         32          64          128        256        concurrent clients
9164    17986   36376   66656   122862  186483  213858  218781  204757  dev5163
9317    19237   37861   71679   131732  194570  224029  218642  216220  orig5152
9627    18654   36958   69962   130675  192244  217105  222791  218747  orig5163
8185    16458   32734   61961   116189  162483  179290  177457  177203  orig5610

And QPS for 8 sysbench process, with each process using a different table. Each process uses 1, then 2, then 4, ... then 32 threads.

8            16         32          64          128        256        concurrent clients
61968   114979  179318  223252  223619  222973  dev5163
65231   117471  191359  222856  227052  85756   orig5152
66971   125003  191782  225055  218248  81998   orig5163
56161   98147   152856  171499  170448  170035  orig5610

The test servers have 24 cores when HT is enabled. The test tables have 16M rows each and 16M rows/table X 8 tables == ~32gb of data. The innodb buffer pool was 64gb. sysbench ran on a separate host from mysqld.

Mark,

may you replay the same test on your server on MySQL 5.6, but with:
  1. innodb_spin_wait_delay = 96
  2. performance_schema = OFF  (as you're comparing to 5.1)
  3. both 1. & 2.

because on 24cores and point-selects on 8-tables I'm obtaining 250K QPS on 5.6 currently (and supposing your CPU is faster than my 2300Mhz, no?) - so, I'm little bit surprised by your results..

Rgds,
-Dimitri

Dimitri - I am equally surprise by your results. I previously reported both this problem and http://bugs.mysql.com/bug.php?id=68413. They were true in the past for Percona HW and my HW, so perhaps they are still true. CPU speed today is 2.67 GHz.

I shouldn't have to disable PS to make a valid comparison to 5.1. The PS overhead shouldn't be a concern. Any comparison of the FB patch to unmodified MySQL includes the overhead for user_stats/table_stats. Would be nice one day if the PS provided useful metrics that could be enabled all of the time without fear of overhead as user_stats/table_stats provide.

Perhaps the MDL code is sensitive to pthread mutex performance. I use gcc 4.6.2 and glibc 2.13

I don't think tuning InnoDB mutexes will help here as the problem appears to be mutex contention from MDL. Perhaps trying different values for metadata_locks_hash_instances would help, but the default value is 8 and my tests use 8 tables.

MySQL 5.6 loses about 10% QPS from PS. Results with PS not compiled and with performance_schema=0 are at http://bugs.mysql.com/bug.php?id=68413. 

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
56978   103470  162754  188200  188200  186618  without PS compiled, innodb_spin_wait_delay=96

Mark, I'm also surprised by your QPS numbers.. - with a faster CPU you have you should reach much more higher level than my 250K QPS on 2300Mhz.. (in fact it's 2.27Ghz)..

I've compiled my binaries with gcc 4.4.5, and following options:
CFLAGS="-g -O3" CXXFLAGS="-g -O3" cmake . -DWITH_FAST_MUTEXES=ON

nothing special..
and yes, I use jemalloc (but it's as usual)..

Rgds,
-Dimitri

Dimitri - I think you need to be more specific about your QPS results. AFAIK, you get 250k QPS with HT-disabled and 200k QPS with HT-enabled. So I think you are halfway to reproducing my results. Now just repeat the tests for 5.1 (or maybe 5.5) with and without HT.

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
56978   103470  162754  188200  188200  186618  without PS compiled, innodb_spin_wait_delay=96
55940   101771  165779  191814  193125  192441  no PS, compiled with -O3

Using taskset -c 0,2,4,6,8,10.12,14,16,18,20,22 results are below with the label "taskset". They were much worse. I might try to disable HT on this host tomorrow.

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
56978   103470  162754  188200  188200  186618  without PS compiled, innodb_spin_wait_delay=96
55940   101771  165779  191814  193125  192441  no PS, compiled with -O3
48410   91457   114284  116162  114267  111194  taskset

Mark, 

did not you say you have 24cores?.. (24cores with HT, so 48 vcpu)
or you have 12cores? (and with HT you're obtaining 24 vcpu?)

well, on Linux the numbers of vcpu are assigned not in the same way as on Solaris (0,1 = core0, 2,3 = core1, etc.) - on Linux: 0 = core0.thread0, 1 = core1.thread0, 2 = core2.thread0, etc.

so, if you have true 24cores, the taskset you have to use is:
$ taskset -c 0-23 ...

but if you have only 12cores -- no need to test then, such an impact you may see only since 24cores ;-)

regarding HT: well, with MySQL 5.6 we're scaling today up to 32cores for sure.. - however having HT enabled on 32cores is giving 64 vcpu (or CPU threads, call it as you want).. - and depending on which mutex we're hitting internal contention, you may have an increase by having 64 vcpu, or decrease.. - because having 64 vcpu may result in a much higher concurrency in locking, or may help to resolve such a locking by HW (as only 32 threads are really running on the same time).. - well, all depends on a workload..

you know, there is nothing new ;-) the first threads on CPU cores were implemented on SPARC by Sun, and we played a lot with such a "tuning" to reach the most optimal throughput on various applications which are not scaling up to 72CPU for ex., and we want to see the best possible result on a given HW ;-)

anyway, may you clarify, please:
  - do you have 24 or 12cores finally?..
  - may you give me exactly the command line of sysbench you're using (and its version)?

Rgds,
-Dimitri

Hello Mark!

Could you please provide some additional information?

Particularly I am interested in results of "select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/%' and count_star > 0 order by sum_timer_wait desc limit 20" query after you run "8 sysbench process, with each process using a different table and each process uses 1 thread". In other words I am interested in info from P_S only after sysbench run that you used to produce data for "8 concurrent users" column.

In theory in this case we should not be seeing much contention on MDL locks at all, as in this case each sysbench instance should use its own MDL_lock object and own MDL_map partition. So if you see picture which looks different it is clear indication that something "strange" happens.

Also could you please provide the exact names of database and tables that you are using for sysbench runs? Is it possible that our hash function gives collisions for these names so some of sysbenches in 8 process x 1 table x 1 thread case end up working with the same MDL_map partition?

Thank you in advance!

Dimitri - 12 real cores, 24 when HT is enabled. But otherwise I am getting too many questions. You report that QPS for 5.6 is much worse with HT enabled on your test. The next step for you is to run the tests for MySQL 5.1 with and without HT. I don't think 5.1 will lose so much from using HT.

You can also use PS to show that the top source of mutex contention is MDL.

I use my fork of sysbench that has better metrics.  It is at https://code.launchpad.net/~mdcallag/sysbench/0.4-dev

8 instances of the loop below are run. Each instance uses a different value for table_num.
Also at each iteration of the loop, nthreads was doubled (1 -> 2 -> 4 -> ... -> 32). And each iteration of the loop ran at the same time (so nthreads=32 for all tables ran concurrently).

nthreads=1
table_num=1

./sysbench --batch --batch-delay=10 --test=oltp  --mysql-db=test --oltp-table-size=16000000 --max-time=180 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${table_num} --oltp-read-only --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols --oltp-dist-type=uniform --oltp-range-size=100 --oltp-connect-delay=0 --percentile=99 --num-threads=$nthreads --seed-rng=1 run

results for taskset2 are from using "taskset -c 0-11 mysqld ..."
Both of the taskset results are from binaries without compiled support for PS

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
56978   103470  162754  188200  188200  186618  without PS compiled, innodb_spin_wait_delay=96
55940   101771  165779  191814  193125  192441  no PS, compiled with -O3
48410   91457   114284  116162  114267  111194  taskset
55417   103171  139764  143975  142418  141297  taskset2

Table names are sbtest1, sbtest2, ..., sbtest8
db name is "test"

PS synch waits from 8 and then from 32 concurrent clients. mysqld was restarted prior to each test, then the buffer pool was warmed, then the teste was run. MDL synch objects degrade more than others from 8 to 32.

PS synch waits from 8 concurrent clients:

event_name      count_star      sum_timer_wait
wait/synch/mutex/sql/MDL_map::mutex     69163351        88867813661625
wait/synch/mutex/sql/THD::LOCK_thd_data 172909410       14895489702000
wait/synch/mutex/sql/LOCK_table_cache   69163474        13136791046250
wait/synch/rwlock/sql/MDL_lock::rwlock  69163634        10921715781375
wait/synch/rwlock/innodb/btr_search_latch       29647287        10169021009625
wait/synch/mutex/innodb/trx_sys_mutex   103745293       9453612002625
wait/synch/mutex/mysys/THR_LOCK::mutex  69163309        8550634552125
wait/synch/rwlock/innodb/hash table locks       47841552        6164445476625
wait/synch/rwlock/sql/LOCK_grant        34538720        5295065168250
wait/synch/mutex/innodb/os_mutex        32892092        2292353206125
wait/synch/mutex/innodb/buf_pool_mutex  5533946 1543466449500
wait/synch/mutex/innodb/fil_system_mutex        8933568 948979432875
wait/synch/rwlock/innodb/index_tree_rw_lock     13243461        793798469625
wait/synch/mutex/innodb/log_sys_mutex   1898188 167180788875
wait/synch/mutex/innodb/mutex_list_mutex        4194708 76603623375
wait/synch/mutex/innodb/rw_lock_list_mutex      4194433 74073824250
wait/synch/rwlock/innodb/checkpoint_lock        168     12754220250
wait/synch/cond/sql/COND_server_started 1       5931476625
wait/synch/mutex/sql/LOCK_global_system_variables       22714   780129375
wait/synch/mutex/innodb/flush_list_mutex        2623    351319500

PS synch waits from 32 concurrent clients
event_name      count_star      sum_timer_wait
wait/synch/mutex/sql/MDL_map::mutex     119230858       841983512712000
wait/synch/rwlock/sql/MDL_lock::rwlock  176383948       344424250576875
wait/synch/rwlock/innodb/btr_search_latch       78559058        103554075033000
wait/synch/mutex/innodb/trx_sys_mutex   256703751       91329344352000
wait/synch/mutex/sql/THD::LOCK_thd_data 427835485       55128454216875
wait/synch/mutex/mysys/THR_LOCK::mutex  171134261       47433195025875
wait/synch/mutex/sql/LOCK_table_cache   171134507       41352937581750
wait/synch/rwlock/sql/LOCK_grant        85225707        22604762027625
wait/synch/rwlock/innodb/hash table locks       63075670        9730971753000
wait/synch/rwlock/innodb/index_tree_rw_lock     18015087        3204438111375
wait/synch/mutex/innodb/os_mutex        35538726        2894663019750
wait/synch/mutex/innodb/buf_pool_mutex  5606625 1687920321375
wait/synch/mutex/innodb/fil_system_mutex        8933310 979599250875
wait/synch/mutex/innodb/log_sys_mutex   2101239 202458754125
wait/synch/mutex/innodb/rw_lock_list_mutex      4194433 75105732000
wait/synch/mutex/innodb/mutex_list_mutex        4194728 74155491375
wait/synch/rwlock/innodb/checkpoint_lock        164     11624814375
wait/synch/cond/sql/COND_server_started 1       6223468875
wait/synch/mutex/sql/LOCK_global_system_variables       20505   912133875
wait/synch/mutex/innodb/flush_list_mutex        2580    353359500

Hello Mark!

I have checked in debugger and it turns out that MDL keys for ALL 8 tables from test.sbtest1 .. test.sbtest8 sequence have such hashes that they turn out to be assigned to the same MDL_map partition!!!

Indeed this means that all those threads in your benchmark keep banging against one poor MDL_map partition and its lock, while other seven partitions sit unused. This perfectly explains why do you see such a high contention on MDL_map::mutex in P_S even in situation with 8 sysbenches x 1 table x 1 thread in which normally no contention should be expected at all!

Also this is probably why we didn't see high contention on MDL_map in our tests (as we use different names for our tables).

We need to figure out what can be done to address such scenarios. Meanwhile as a workaround you can use different names for your tables. For example sequence of test.sbtest_1 ... test.sbtest_8 will nicely and uniformly use all 8 MDL_map partitions available in default configuration.

Dmitry Lenev - that is good news, but is there something wrong with the hash algorithm or did I get unlucky? I think there is a 1 in 2M chance that all 8 tables choose the same bucket when there are 8 buckets. Assuming the number of buckets is metadata_locks_hash_instances, then that is 8 by default. Doing the math for selection with replacment of 8 items from 8 buckets, I think this is the probability that all items end in one bucket:
((1/8) ^ 8) * 8 --> about 1 in 2M

Mark,

on 12cores no need to play with HT :-)
as said before, it may show an impact only on 24cores or bigger servers..

on my 12cores server (2.9Ghz with HT) I'm getting over 200K QPS with PFS=on and synch (mutex, rwlock) and I/O instrumentations enabled ;-)  (and the main contention is on the trx_sys mutex as expected)..

So, try to rename your tables as DmitryL suggested, and let's see on your new test results?..

Rgds,
-Dimitri

Results are better with metadata_locks_hash_instances=256 and 5.6.10 is almost as fast as 5.1.63 + FB patch:
* with metadata_locks_hash_instances=256 (default was 8)
* table_open_cache_instances=8 (default was 1, the benefit is 1% or 2%)
* perf_schema disabled at compile time

PS synch waits with 32 concurrent clients

event_name      count_star      sum_timer_wait
wait/synch/rwlock/innodb/btr_search_latch       82516624        106132845340125
wait/synch/mutex/sql/MDL_map::mutex             120409851       100693371306375
wait/synch/mutex/innodb/trx_sys_mutex   269631102       100016743892250
wait/synch/rwlock/sql/MDL_lock::rwlock  181016771       76386925351875
wait/synch/mutex/sql/THD::LOCK_thd_data 449381080       52701740326125
wait/synch/mutex/mysys/THR_LOCK::mutex  179752505       48466235973750
wait/synch/mutex/sql/LOCK_table_cache   179752782       42455525954625
wait/synch/rwlock/sql/LOCK_grant        89509377        27099863204625
wait/synch/rwlock/innodb/hash table locks       63678623        9439458772500
wait/synch/rwlock/innodb/index_tree_rw_lock     18189424        2798154987750
wait/synch/mutex/innodb/os_mutex        35443021        2775027340125
wait/synch/mutex/innodb/buf_pool_mutex  5608060 1624722504375
wait/synch/mutex/innodb/fil_system_mutex        8932983 951317686125
wait/synch/mutex/innodb/log_sys_mutex   2118386 206010405375
wait/synch/mutex/innodb/mutex_list_mutex        4194728 83925673500
wait/synch/mutex/innodb/rw_lock_list_mutex      4194433 82427459250
wait/synch/rwlock/innodb/checkpoint_lock        158     12088848000
wait/synch/cond/sql/COND_server_started 1       5972825250
wait/synch/mutex/sql/LOCK_global_system_variables       20190   868146375
wait/synch/mutex/innodb/flush_list_mutex        2533    366217500

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
56978   103470  162754  188200  188200  186618  without PS compiled, innodb_spin_wait_delay=96
55940   101771  165779  191814  193125  192441  no PS, compiled with -O3
48410   91457   114284  116162  114267  111194  taskset
55417   103171  139764  143975  142418  141297  taskset2
...
53598   100470  165261  187998  189511  185304  with PS compiled, metadata_locks_hash_instances=256
61024   104393  177109  217812  216948  215513  without PS compiled, metadata_locks_hash_instances=256
...
61968   114979  179318  223252  223619  222973  5.1.63 + FB patch
65231   117471  191359  222856  227052  85756   unmodified 5.1.52
66971   125003  191782  225055  218248  81998   unmodified 5.1.63

Hello Mark!

You math has assumption that each bucket is selected perfectly randomly,
i.e. that for each table bucket is selected totally randomly and
independently of selection for previous table.

But this is not the case as hash function is a function from the key
and keys in your case are dependant on each other.

Assuming that I understand correctly how my_hash_sort_bin() hash
function works, here is how we can express hash function for MDL keys 
for test.sbtestX (or more generally somedb.someprefixX) tables:

hash-for-mdl-key-for-"test.sbtestX"-table = const1 ^ (const2 * (48 + x) + const3 << 8) ^ (func1(x) << 8)

where const1 and const2 are functions from "test.sbtest" prefix, const3 is
function of prefix length.

So number of bucket/partition which is chosen looks like:

MDL_map_partition_id = hash-for-mdl-key-for-"test.sbtestX"-table % 8 =
                       (const1 ^ (const2 * (48 + x) + const3 << 8) ^ (func1(x) << 8)) & 7 =
                       (const1 ^ (const2 * (48 + x) + const3 << 8)) & 7 =
                       (const1 ^ (const2 * 48 + const2 * x)) & 7

After a bit of head scratching one can see that if const2 is such that
const2 & 7 == 0 the above expression turns out to be totally independent 
of X. Probability of const2 & 7 being 0 can be guesstimated as 1/8.

So in fact probability of all tables named somedb.someprefix1 .. 
somedb.someprefix8 getting assigned to the same partition can be
guesstimated as 1/8.

Mark,

when the code s working as expected, MDL_map::mutex waits are completely disappearing from the top list.. So, please, try to rename tables to get another view of the problem ;-)

(expecting MDL_map::mutex will be no more here)

Rgds,
-Dimitri

Hello Mark!

After a bit more debugging... It turns out that for "test.sbtest" prefix const2 in the above expression is equal to 64. Which is rather unfortunate as it means that even with 256 MDL_map partitions you will be using only 4 of them ( "64 * x mod 256" will leave only 2 bits from 3-bits of X are going to be used).

Dmitry Lenev - nice debugging, but I am confused by the pseudo-code for my_hash_sort_bin. This is the source I see for 5.1 and 5.6 that implements the hash_sort function pointer for MY_COLLATION_HANDLER. It is called by calc_hash which only uses nr1.

static uint calc_hash(const HASH *hash, const uchar *key, size_t length)
{
  ulong nr1=1, nr2=4;
  hash->charset->coll->hash_sort(hash->charset,(uchar*) key,length,&nr1,&nr2);
  return nr1;
}

void my_hash_sort_bin(CHARSET_INFO *cs __attribute__((unused)),
                      const uchar *key, size_t len,ulong *nr1, ulong *nr2)
{
  const uchar *pos = key;

  key+= len;

  for (; pos < (uchar*) key ; pos++)
  {
    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) *
             ((uint)*pos)) + (nr1[0] << 8);
    nr2[0]+=3;
  }
}

So for strings of the form test.sbtestX, then 
nr1[0] ^= (K1 * X) + K2

Using my test program the strings test.sbtestX for X in 1 to 8 used 4 of 8 buckets (1,3,5 and 7). With 256 buckets there were no collisions. It has been a long time since I tried to analyze the quality of a hash function but I suspect this one needs work or at least should not be used when there are a small number of buckets. burtle was a great reference for me and he used to work at Oracle -- http://burtleburtle.net/bob/hash/hashfaq.html

gcc -o hash hash.c; ./hash
test.sbtest1 : 13154541203486169799, 7, 199
test.sbtest2 : 13154541203486169613, 5, 13
test.sbtest3 : 13154541203486169683, 3, 83
test.sbtest4 : 13154541203486170009, 1, 153
test.sbtest5 : 13154541203486170079, 7, 223
test.sbtest6 : 13154541203486169957, 5, 101
test.sbtest7 : 13154541203486164139, 3, 171
test.sbtest8 : 13154541203486164209, 1, 241

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

typedef unsigned long ulong;
typedef unsigned char uchar;

void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2)
{
  const uchar *pos = key;

  key+= len;

  for (; pos < (uchar*) key ; pos++)
  {
    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) *
             ((uint)*pos)) + (nr1[0] << 8);
    nr2[0]+=3;
  }
}

char const * s[] = { 
 "test.sbtest1", "test.sbtest2", "test.sbtest3", "test.sbtest4",
 "test.sbtest5", "test.sbtest6", "test.sbtest7", "test.sbtest8" };

int main() {
  int x;
  for (x=0; x < 8; ++x) {
    ulong nr1 = 1; ulong nr2 = 4;
    my_hash_sort_bin(s[x], strlen(s[x]), &nr1, &nr2);
    printf("%s : %lu, %lu, %lu\n", s[x], nr1, nr1 % 8, nr1 % 256);
  }
  return 0;
}

Dimitri Kravtchuk - I tested with metadata_locks_hash_instances=256 and my analysis shows there should be no hash bucket collisions in that case. MySQL 5.1 + FB patch is still a small amount faster than 5.6 when the PS is disabled at compile time and a lot faster when PS is enabled with default options.

Dmitry Lenev - I updated my testcase to include test.sbtest0 and test.sbtest9. The output below is <key> : <nr1> <nr1 % 8> <nr1 % 256>

The test output shows that 4 of 8 buckets and 10 of 256 buckets will be used. So I continue to wonder why we are talking about different behavior for my_hash_sort_bin, but I appreciate your analysis. The hash function my_hash_sort_bin has a common problem for a hash function -- you can't trust the least significant bits.

$ gcc -o hash hash.c; ./hash
test.sbtest1 : 13154541203486169799, 7, 199
test.sbtest2 : 13154541203486169613, 5, 13
test.sbtest3 : 13154541203486169683, 3, 83
test.sbtest4 : 13154541203486170009, 1, 153
test.sbtest5 : 13154541203486170079, 7, 223
test.sbtest6 : 13154541203486169957, 5, 101
test.sbtest7 : 13154541203486164139, 3, 171
test.sbtest8 : 13154541203486164209, 1, 241
test.sbtest0 : 13154541203486169729, 1, 129
test.sbtest9 : 13154541203486164023, 7, 55

Hello Mark!

In your program you assume that key which is used for "test"."sbtest1" table in MDL subsystem looks exactly like "test.sbtest1" but this assumption is false.

Actually for test.sbtest1 the exact key used in MDL will be
"\002test\000sbtest1\000" (\002 means table, also note \000 used as a separator). This is the value which we pass to hash function.

After adjusting you program to take this fact into account I get from it results which are consistent with results of my debugging:
for 8 buckets only 1 bucket is used, for 256 buckets only 4 buckets are used.

And yes I guess this tells us something about our hash function.

Using my test, there is one collision for mod 16 and no collisons for mod 32. My guess is that either a better hash function is needed or you can't use too few buckets because the least significant bits of the hash output are not sufficiently random.

I also updated my test program to compare my_hash_sort_bin and the InnoDB hash functions. InnoDB does much better for a small number of hash buckets. The results below show:
<key> : <hash return> <hash % 8> <hash % 16> <hash % 32> <hash % 64> <hash % 128> <hash % 256>

gcc -o hash hash.c; ./hash
MySQL hash
test.sbtest1 : 13154541203486169799, 7, 7, 7, 7, 71, 199
test.sbtest2 : 13154541203486169613, 5, 13, 13, 13, 13, 13
test.sbtest3 : 13154541203486169683, 3, 3, 19, 19, 83, 83
test.sbtest4 : 13154541203486170009, 1, 9, 25, 25, 25, 153
test.sbtest5 : 13154541203486170079, 7, 15, 31, 31, 95, 223
test.sbtest6 : 13154541203486169957, 5, 5, 5, 37, 101, 101
test.sbtest7 : 13154541203486164139, 3, 11, 11, 43, 43, 171
test.sbtest8 : 13154541203486164209, 1, 1, 17, 49, 113, 241
test.sbtest0 : 13154541203486169729, 1, 1, 1, 1, 1, 129
test.sbtest9 : 13154541203486164023, 7, 7, 23, 55, 55, 55

InnoDB hash
test.sbtest1 : 5358432477730207684, 3, 11, 11, 11, 11, 139
test.sbtest2 : 5358432477730206917, 2, 10, 10, 10, 10, 138
test.sbtest3 : 5358432477730207174, 1, 9, 9, 9, 9, 137
test.sbtest4 : 5358432477730206407, 0, 8, 8, 8, 8, 136
test.sbtest5 : 5358432477730206664, 7, 7, 7, 7, 7, 135
test.sbtest6 : 5358432477730205897, 6, 6, 6, 6, 6, 134
test.sbtest7 : 5358432477730206154, 5, 5, 5, 5, 5, 133
test.sbtest8 : 5358432477730197195, 4, 4, 4, 4, 4, 132
test.sbtest0 : 5358432477730207427, 4, 12, 12, 12, 12, 140
test.sbtest9 : 5358432477730197452, 3, 3, 3, 3, 3, 131

Updated test program:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

typedef unsigned long ulong;
typedef unsigned char uchar;
typedef unsigned long ulint;

#define UT_HASH_RANDOM_MASK     1463735687
#define UT_HASH_RANDOM_MASK2    1653893711

ulint
ut_hash_ulint(
/*==========*/
        ulint    key,           /*!< in: value to be hashed */
        ulint    table_size)    /*!< in: hash table size */
{
        key = key ^ UT_HASH_RANDOM_MASK2;

        return(key % table_size);
}

ulint
ut_fold_ulint_pair(
/*===============*/
        ulint   n1,     /*!< in: ulint */
        ulint   n2)     /*!< in: ulint */
{
        return(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1)
                ^ UT_HASH_RANDOM_MASK) + n2);
}

ulint
ut_fold_string(
/*===========*/
        const char*     str)    /*!< in: null-terminated string */
{
        ulint   fold = 0;

        while (*str != '\0') {
                fold = ut_fold_ulint_pair(fold, (ulint)(*str));
                str++;
        }

        return(fold);
}

void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2)
{
  const uchar *pos = key;

  key+= len;

  for (; pos < (uchar*) key ; pos++)
  {
    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) *
             ((uint)*pos)) + (nr1[0] << 8);
    nr2[0]+=3;
  }
}

char const * s[] = { 
 "test.sbtest1", "test.sbtest2", "test.sbtest3", "test.sbtest4",
 "test.sbtest5", "test.sbtest6", "test.sbtest7", "test.sbtest8",
 "test.sbtest0", "test.sbtest9" };

int main() {
  int x;
  printf("MySQL hash\n");
  for (x=0; x < 10; ++x) {
    ulong nr1 = 1; ulong nr2 = 4;
    my_hash_sort_bin(s[x], strlen(s[x]), &nr1, &nr2);
    printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n",
           s[x], nr1, nr1 % 8, nr1 % 16, nr1 % 32, nr1 % 64, nr1 % 128, nr1 % 256);
  }

  printf("\nInnoDB hash\n");
  for (x=0; x < 10; ++x) {
    ulint k1 = ut_fold_string(s[x]);
    printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n",
           s[x], k1,
           ut_hash_ulint(k1, 8),
           ut_hash_ulint(k1, 16),
           ut_hash_ulint(k1, 32),
           ut_hash_ulint(k1, 64),
           ut_hash_ulint(k1, 128),
           ut_hash_ulint(k1, 256));
  }
  return 0;
}

Wow, as you write it is much worse using \002 and \001.

$ gcc -g -o hash hash.c; ./hash
MySQL hash
sbtest1 : 5004758646725041816, 0, 8, 24, 24, 24, 152
sbtest2 : 5004758646725058136, 0, 8, 24, 24, 88, 88
sbtest3 : 5004758646725074456, 0, 8, 24, 24, 24, 24
sbtest4 : 5004758646725091288, 0, 8, 24, 24, 88, 216
sbtest5 : 5004758646725107608, 0, 8, 24, 24, 24, 152
sbtest6 : 5004758646725123928, 0, 8, 24, 24, 88, 88
sbtest7 : 5004758646725140248, 0, 8, 24, 24, 24, 24
sbtest8 : 5004758646725156056, 0, 8, 24, 24, 88, 216
sbtest0 : 5004758646725025496, 0, 8, 24, 24, 88, 216
sbtest9 : 5004758646725172376, 0, 8, 24, 24, 24, 152

InnoDB hash
sbtest1 : 7632798584130221980, 3, 3, 19, 19, 83, 211
sbtest2 : 7632798583777965725, 2, 2, 18, 18, 82, 210
sbtest3 : 7632798583761515934, 1, 1, 17, 17, 81, 209
sbtest4 : 7632798583394466975, 0, 0, 16, 16, 80, 208
sbtest5 : 7632798584118021024, 7, 15, 15, 47, 111, 239
sbtest6 : 7632798583832611489, 6, 14, 14, 46, 110, 238
sbtest7 : 7632798584123513250, 5, 13, 13, 45, 109, 237
sbtest8 : 7632798583194443939, 4, 12, 12, 44, 108, 236
sbtest0 : 7632798584147183771, 4, 4, 20, 20, 84, 212
sbtest9 : 7632798583379062692, 3, 11, 11, 43, 107, 235

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

typedef unsigned long ulong;
typedef unsigned char uchar;
typedef unsigned long ulint;

#define UT_HASH_RANDOM_MASK     1463735687
#define UT_HASH_RANDOM_MASK2    1653893711

ulint
ut_hash_ulint(
/*==========*/
        ulint    key,           /*!< in: value to be hashed */
        ulint    table_size)    /*!< in: hash table size */
{
        key = key ^ UT_HASH_RANDOM_MASK2;

        return(key % table_size);
}

ulint
ut_fold_ulint_pair(
/*===============*/
        ulint   n1,     /*!< in: ulint */
        ulint   n2)     /*!< in: ulint */
{
        return(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1)
                ^ UT_HASH_RANDOM_MASK) + n2);
}

ulint
ut_fold_string(
/*===========*/
        const char*     str, size_t len)    /*!< in: null-terminated string */
{
        ulint   fold = 0;
        const char *end = str + len;

        while (str < end) { 
                fold = ut_fold_ulint_pair(fold, (ulint)(*str));
                str++;
        }

        return(fold);
}

void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2)
{
  const uchar *pos = key;

  key+= len;

  for (; pos < (uchar*) key ; pos++)
  {
    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) *
             ((uint)*pos)) + (nr1[0] << 8);
    nr2[0]+=3;
  }
}

char const * s[] = { 
 "sbtest1", "sbtest2", "sbtest3", "sbtest4", "sbtest5",
 "sbtest6", "sbtest7", "sbtest8", "sbtest0", "sbtest9" };

int main() {
  int x;
  printf("MySQL hash\n");
  for (x=0; x < 10; ++x) {
    ulong nr1 = 1; ulong nr2 = 4;
    char sbuf[32];

    sbuf[0] = '\002';
    strcpy(sbuf+1, "test");
    strcpy(sbuf+1+strlen("test")+1, s[x]);
   
    my_hash_sort_bin(sbuf, strlen(s[x])+strlen("test")+3, &nr1, &nr2);
    printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n",
           s[x], nr1, nr1 % 8, nr1 % 16, nr1 % 32, nr1 % 64, nr1 % 128, nr1 % 256);
  }

  printf("\nInnoDB hash\n");
  for (x=0; x < 10; ++x) {
    char sbuf[32];
    ulint k1;

    sbuf[0] = '\002';
    strcpy(sbuf+1, "test");
    strcpy(sbuf+1+strlen("test"), s[x]);

    k1 = ut_fold_string(sbuf, strlen(s[x])+strlen("test")+3);

    printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n",
           s[x], k1,
           ut_hash_ulint(k1, 8),
           ut_hash_ulint(k1, 16),
           ut_hash_ulint(k1, 32),
           ut_hash_ulint(k1, 64),
           ut_hash_ulint(k1, 128),
           ut_hash_ulint(k1, 256));
  }
  return 0;
}

Dmitry Lenev - I think one or both of the following should be done. What do you think?
* increase default for metadata_locks_hash_instances
* use a better hash function

The hash function is used in several places. Would be nice to reduce this to a few implementations. Also, the nr2 in/out value is ignored by many callers. I wonder if extra work is wasted for that computation.

ql/password.c:    nr^= (((nr & 63)+add)*tmp)+ (nr << 8);
sql/sql_join_buffer.cc:    nr^= (ulong) ((((uint) nr & 63)+nr2)*((uint) *pos))+ (nr << 8);
strings/ctype-bin.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
strings/ctype-bin.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
strings/ctype-latin1.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * X) + (nr1[0] << 8);
strings/ctype-latin1.c:      nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * X) + (nr1[0] << 8);
strings/ctype-mb.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
strings/ctype-simple.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
strings/ctype-uca.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(s_res >> 8))+ (n1[0] << 8);
strings/ctype-uca.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(s_res & 0xFF))+ (n1[0] << 8);
strings/ctype-ucs2.c:    n1[0]^= (((n1[0] & 63) + n2[0]) * (wc & 0xFF)) + (n1[0] << 8);
strings/ctype-ucs2.c:    n1[0]^= (((n1[0] & 63) + n2[0]) * (wc >> 8)) + (n1[0] << 8);
strings/ctype-ucs2.c:    nr1[0]^= (ulong) ((((uint) nr1[0] & 63) + nr2[0]) * 
strings/ctype-ucs2.c:  n1[0]^= (((n1[0] & 63) + n2[0]) * (ch)) + (n1[0] << 8);
strings/ctype-ucs2.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(wc & 0xFF))+ (n1[0] << 8);
strings/ctype-ucs2.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(wc >> 8))+ (n1[0] << 8);
strings/ctype-ucs2.c:    nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * 
strings/ctype-utf8.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(wc & 0xFF))+ (n1[0] << 8);
strings/ctype-utf8.c:    n1[0]^= (((n1[0] & 63)+n2[0])*(wc >> 8))+ (n1[0] << 8);
strings/ctype-utf8.c:  n1[0]^= (((n1[0] & 63) + n2[0]) * (ch)) + (n1[0] << 8);

Hello Mark!

I guess more digging and analysis should be done before we can understand what is the best way to address this problem. And yes both options you have mentioned are among obvious candidates.

Re several implementations of hash functions.

Please keep in mind that hashing function needs to take into account equivalence rules of collation/charset. Hence several similar, but still slightly different implementations in strings/ library. OTOH I agree that there is room for improvement there.

QPS results for cached & read-only sysbench. Both used metadata_locks_hash_instances=8. The first result used the innodb hash algorithm for my_hash_sort_bin.

      8            16          32          64         128        256  concurrent clients
62015   113900  180457  216948  218628  218541  innodb hash
56554   106461  165382  188110  191131  190285  current hash

For a popular set of table names and one set of db names I computed distribution of entries by hash bucket for 2, 4, 8, ..., 256 hash buckets using the current mysql and innodb hash functions. The results also show the stddev of entries per hash bucket. The innodb hash does much better especially at a small number of buckets.

The output below is printed for each number of buckets to compare mysql and innodb hash functions. It includes the number of entries per bucket and the stddev.

2 buckets
mysql hash:  6690 6960 : stddev is 135.8
innodb hash:  6825 6825 : stddev is 15.0
4 buckets
mysql hash:  3510 3805 3180 3155 : stddev is 266.5
innodb hash:  3415 3416 3410 3409 : stddev is 8.1
8 buckets
mysql hash:  1890 1945 1915 1983 1620 1860 1265 1172 : stddev is 300.5
innodb hash:  1694 1705 1714 1719 1721 1711 1696 1690 : stddev is 11.7
16 buckets
mysql hash:  900 964 1032 932 955 900 842 662 990 981 883 1051 665 960 423 510 : stddev is 181.4
innodb hash:  821 825 843 854 873 876 874 875 873 880 871 865 848 835 822 815 : stddev is 22.6
32 buckets
mysql hash:  450 305 397 541 372 563 300 371 480 582 234 541 382 417 213 210 450 659 635 391 583 337 542 291 510 399 649 510 283 543 210 300 : stddev is 133.1
innodb hash:  424 422 406 444 414 461 443 490 449 449 464 434 412 427 431 432 397 403 437 410 459 415 431 385 424 431 407 431 436 408 391 383 : stddev is 23.6
64 buckets
mysql hash:  180 177 244 181 122 92 180 151 120 367 76 274 190 180 120 128 270 168 196 120 273 97 387 140 210 232 400 330 188 300 60 210 270 128 153 360 250 471 120 220 360 215 158 267 192 237 93 82 180 491 439 271 310 240 155 151 300 167 249 180 95 243 150 90 : stddev is 98.5
innodb hash:  225 220 220 228 202 229 226 287 198 232 209 213 192 227 219 210 199 194 218 178 227 214 222 172 194 179 190 212 210 179 201 203 199 202 186 216 212 232 217 203 251 217 255 221 220 200 212 222 198 209 219 232 232 201 209 213 230 252 217 219 226 229 190 180 : stddev is 19.8
128 buckets
mysql hash:  93 87 119 93 60 50 90 76 59 203 44 135 92 90 60 56 138 81 107 65 132 52 202 67 90 108 197 167 91 146 26 108 132 58 75 182 125 242 60 108 178 121 83 122 103 121 47 35 84 241 212 129 165 122 79 77 154 81 121 91 48 122 74 47 87 90 125 88 62 42 90 75 61 164 32 139 98 90 60 72 132 87 89 55 141 45 185 73 120 124 203 163 97 154 34 102 138 70 78 178 125 229 60 112 182 94 75 145 89 116 46 47 96 250 227 142 145 118 76 74 146 86 128 89 47 121 76 43 : stddev is 49.6
innodb hash:  112 105 105 108 118 121 101 152 99 104 116 107 86 118 122 92 104 110 104 84 111 119 140 96 90 108 99 97 98 104 96 120 80 108 100 122 115 135 127 101 127 111 138 126 100 83 123 91 94 96 124 143 116 116 156 130 113 131 124 137 116 126 104 94 113 115 115 120 84 108 125 135 99 128 93 106 106 109 97 118 95 84 114 94 116 95 82 76 104 71 91 115 112 75 105 83 119 94 86 94 97 97 90 102 124 106 117 95 120 117 89 131 104 113 95 89 116 85 53 83 117 121 93 82 110 103 86 86 : stddev is 17.1
256 buckets
mysql hash:  53 41 59 47 28 23 42 46 30 93 22 63 46 47 28 25 73 36 53 30 62 32 101 30 40 53 106 78 48 71 14 51 71 28 37 92 66 131 27 60 89 63 41 57 53 57 28 20 47 114 108 68 78 56 41 36 81 39 54 48 26 64 39 27 46 48 62 41 23 25 46 26 30 78 18 69 53 44 32 40 59 48 41 23 67 20 97 38 64 66 103 79 53 75 16 51 65 29 42 88 60 123 32 50 98 47 39 79 46 56 26 23 56 128 116 69 85 59 36 38 72 38 66 47 25 48 33 17 40 46 60 46 32 27 48 30 29 110 22 72 46 43 32 31 65 45 54 35 70 20 101 37 50 55 91 89 43 75 12 57 61 30 38 90 59 111 33 48 89 58 42 65 50 64 19 15 37 127 104 61 87 66 38 41 73 42 67 43 22 58 35 20 41 42 63 47 39 17 44 49 31 86 14 70 45 46 28 32 73 39 48 32 74 25 88 35 56 58 100 84 44 79 18 51 73 41 36 90 65 106 28 62 84 47 36 66 43 60 20 24 40 122 111 73 60 59 40 36 74 48 62 42 22 73 43 26 : stddev is 25.1
innodb hash:  64 42 60 54 51 38 26 45 38 47 46 52 30 55 48 36 51 55 35 33 43 56 71 40 40 53 47 43 42 41 46 53 35 64 50 47 45 41 36 53 46 36 53 48 43 40 55 58 55 43 63 67 31 53 51 41 45 54 67 73 50 66 46 38 56 49 53 44 39 56 71 46 42 55 37 52 58 54 60 64 54 50 55 74 74 51 27 39 58 33 32 45 52 56 60 55 65 48 42 33 48 38 44 50 65 61 63 61 66 60 49 65 73 67 55 51 49 42 29 45 62 71 57 54 59 58 51 50 48 63 45 54 67 83 75 107 61 57 70 55 56 63 74 56 53 55 69 51 68 63 69 56 50 55 52 54 56 63 50 67 45 44 50 75 70 94 91 48 81 75 85 78 57 43 68 33 39 53 61 76 85 63 105 89 68 77 57 64 66 60 58 56 57 66 62 76 45 52 54 89 57 73 56 54 48 55 37 54 41 34 59 20 42 44 55 37 46 38 59 70 60 19 45 28 54 46 44 61 49 59 46 52 59 45 54 34 54 57 40 66 31 46 40 38 67 43 24 38 55 50 36 28 51 45 35 36 : stddev is 14.0

Hello MarkC, All!

I have created new bug report to track problem with MDL hash partitioning see bug #68487 "MDL hash can still be concurrency bottleneck". Please use this new report to track progress of our work on it and for reporting additional information/thoughts.

Issue with P_S overhead should be tracked through bug #68413 "performance_schema overhead is at least 10%" that you have created.

AFAIU these two issues cover problem with mutex contention in 5.6 as compared to 5.1 fairly well, so I am closing this bug again.

Please let me know if there is any aspect of the problem that I have missed.

Bug #69653 reports that the fix for this bug introduced a regression bug.

If affected by this please see:

1. http://mysqlserverteam.com/removing-scalability-bottlenecks-in-the-metadata-locking-and-th... describing how the issues were quite thoroughly removed from 5.7.5 DMR onwards.

2. https://bugs.mysql.com/bug.php?id=68487 for an improvement in the hashing algorithm that should help a lot from 5.6.15 and 5.7.3 DMR.