Bug #66473 | MySQL 5.6.6-m9 has more mutex contention than MySQL 5.1 | ||
---|---|---|---|
Submitted: | 21 Aug 2012 1:49 | Modified: | 25 Feb 2013 16:51 |
Reporter: | Mark Callaghan | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Locking | Severity: | S5 (Performance) |
Version: | 5.6.6-m9 | OS: | Any |
Assigned to: | Jon Olav Hauglid | CPU Architecture: | Any |
[21 Aug 2012 1:49]
Mark Callaghan
[21 Aug 2012 1:50]
Mark Callaghan
my.cnf options for MySQL 5.6 and 5.1 For 5.1: innodb_log_file_size=1900M innodb_flush_log_at_trx_commit=1 innodb_doublewrite=1 innodb_thread_concurrency=0 innodb_max_dirty_pages_pct=80 innodb_file_format=barracuda innodb_file_per_table max_connections=2000 table_cache=2000 key_buffer_size=200M innodb_io_capacity=1000 skip_log_bin query_cache_size=0 query_cache_type=0 innodb_thread_concurrency=0 innodb_buffer_pool_size=32G For 5.6.6 innodb_buffer_pool_size=16G innodb_log_file_size=1900M innodb_flush_log_at_trx_commit=1 innodb_doublewrite=1 innodb_thread_concurrency=0 innodb_max_dirty_pages_pct=80 innodb_file_format=barracuda innodb_file_per_table max_connections=2000 table_open_cache=2000 table_open_cache_instances=4 key_buffer_size=200M innodb_io_capacity=1000 bind-address = 0.0.0.0 skip_log_bin query_cache_size=0 query_cache_type=0 innodb_thread_concurrency=0 innodb_buffer_pool_size=32G innodb_buffer_pool_instances=4
[21 Aug 2012 1:51]
Mark Callaghan
MySQL 5.6 isn't faster because of mutex contention. I think this is above InnoDB but I might need to use the performance schema to identify the problem. All InnoDB mutex contention for MySQL 5.1 is from kernel_mutex which gets hot from transaction stop and start. For 5.6 the context switch rate is much higher and the system CPU time is also higher. The test uses auto-commit SELECT statements, so the 5.6 fast path for InnoDB transaction start should be used. vmtat for MySQL 5.6.6 at 64 clients, there are 530k context switches/second procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 52 1 0 83037752 608844 44932396 0 0 0 102 400039 432435 49 23 28 1 0 55 1 0 83053400 608844 44932428 0 0 0 108 492252 536340 60 30 9 1 0 45 0 0 83051464 608844 44932504 0 0 0 125 490456 531527 61 29 9 1 0 vmstat for MySQL 5.1fb at 64 clients, there are ~340k context switches/second procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 51 0 0 115455424 107292 15152028 0 0 3 80 573477 337272 72 22 6 0 0 53 0 0 115456008 107292 15152176 0 0 3 77 575436 337216 72 22 6 0 0 56 0 0 115288560 107292 15152340 0 0 3 68 570336 335162 72 22 6 0 0
[21 Aug 2012 2:00]
Mark Callaghan
I then ran the Linux "perf" utility to get call graphs for the largest sources of CPU time and the results are why I think the problem is MDL This is the #1 source of CPU time | --- _raw_spin_lock | |--45.79%-- futex_wake | do_futex | sys_futex | system_call | | | |--98.43%-- __lll_mutex_unlock_wake | | | | | |--61.73%-- MDL_map::find_or_insert(MDL_key const*) | | | MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**) | | | MDL_context::acquire_lock(MDL_request*, unsigned long) | | | _ZL23open_table_get_mdl_lockP3THDP18Open_table_contextP11MDL_requestjPP10MDL_ticket | | | open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*) | | | open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) | | | open_normal_and_derived_tables(THD*, TABLE_LIST*, unsigned int) | | | _ZL21execute_sqlcom_selectP3THDP10TABLE_LIST | | | mysql_execute_command(THD*) | | | mysql_parse(THD*, char*, unsigned int, Parser_state*) | | | dispatch_command(enum_server_command, THD*, char*, unsigned int) | | | do_handle_one_connection(THD*) | | | handle_one_connection | | | start_thread | | | | | |--35.23%-- MDL_context::release_lock(enum_mdl_duration, MDL_ticket*) | | | MDL_context::release_locks_stored_before(enum_mdl_duration, MDL_ticket*) | | | mysql_execute_command(THD*) | | | mysql_parse(THD*, char*, unsigned int, Parser_state*) | | | dispatch_command(enum_server_command, THD*, char*, unsigned int) | | | do_handle_one_connection(THD*) | | | handle_one_connection | | | start_thread ... And this is the #2 source of CPU time |--39.47%-- futex_wait_setup | futex_wait | do_futex | sys_futex | system_call | | | |--99.32%-- __lll_mutex_lock_wait | | | | | --100.00%-- select_precheck(THD*, LEX*, TABLE_LIST*, TABLE_LIST*) | | mysql_execute_command(THD*) | | mysql_parse(THD*, char*, unsigned int, Parser_state*) | | dispatch_command(enum_server_command, THD*, char*, unsigned int) | | do_handle_one_connection(THD*) | | handle_one_connection | | start_thread
[21 Aug 2012 3:16]
Mark Callaghan
I also confirmed that my foreground threads had INSTRUMENTED='YES' when sysbench ran.
[21 Aug 2012 3:41]
Sunny Bains
Mark, can you please also run a single threaded test for both 5.1 and 5.6 and post the results. That should rule out mutex contention (usually).
[21 Aug 2012 3:44]
Mark Callaghan
I tried to enable the mutex monitoring after server start. Adding that to my.cnf fixed it. There is a huge amount of documentation on the PS. It would be nice to have more of a users guide in addition to the current content which is a reference. And the problem appears to be the MDL locks: mysql> SELECT COUNT_STAR, SUM_TIMER_WAIT, AVG_TIMER_WAIT, EVENT_NAME FROM events_waits_summary_global_by_event_name where COUNT_STAR > 0 and EVENT_NAME like 'wait/synch/%' order by SUM_TIMER_WAIT desc limit 10; +------------+-----------------+----------------+-------------------------------------------+ | COUNT_STAR | SUM_TIMER_WAIT | AVG_TIMER_WAIT | EVENT_NAME | +------------+-----------------+----------------+-------------------------------------------+ | 59754588 | 921695778424248 | 15424656 | wait/synch/mutex/sql/MDL_map::mutex | | 20482862 | 845991708402096 | 41302200 | wait/synch/rwlock/innodb/btr_search_latch | | 107961831 | 577757435771256 | 5351160 | wait/synch/rwlock/sql/MDL_lock::rwlock | | 157672459 | 255148746921936 | 1617888 | wait/synch/mutex/innodb/trx_sys_mutex | | 158603974 | 77097481832376 | 486096 | wait/synch/mutex/innodb/buf_pool_mutex | | 144060420 | 59050906585872 | 409488 | wait/synch/mutex/innodb/fil_system_mutex | | 253952141 | 44797824799248 | 176016 | wait/synch/rwlock/innodb/hash table locks | | 105113929 | 44280621423168 | 420888 | wait/synch/mutex/sql/LOCK_table_cache | | 52313292 | 35203238215536 | 672600 | wait/synch/rwlock/sql/LOCK_grant | | 262783571 | 34965961792944 | 132696 | wait/synch/mutex/sql/THD::LOCK_thd_data |
[21 Aug 2012 4:05]
Mark Callaghan
With 1 client thread MySQL 5.1.63 + the FB patch is faster: 5994 -- 5.6.6m9 6442 -- 5.1.63 + FB This workload, short lookup queries, is one for which you are marketing MySQL 5.6 so we should make this better. Table caches were large enough that nothing was evicted. So why is the MDL the top source of contention? Unfortunately I don't have anything to count mutex contention for non-innodb code in MySQL 5.1 to determine whether LOCK_open was as bad. Open_table_definitions 76 Open_tables 128 Opened_files 124 Opened_table_definitions 76 Opened_tables 135 Table_open_cache_hits 56309066 Table_open_cache_misses 135 Table_open_cache_overflows 0
[21 Aug 2012 5:10]
Mark Callaghan
How is MDL_map::m_mutex different from LOCK_open from a performance perspective? Both are global mutexes guarding access to a HASH. Both cause a lot of mutex contention.
[21 Aug 2012 6:45]
MySQL Verification Team
does metadata-locks-cache-size variable exist on non-windows builds? could increasing it help?
[21 Aug 2012 9:01]
Dmitry Lenev
Hi Shane! Yes it does. But since its default value is 1024 I doubt that increasing it will help for case when only 8 tables are used (well unless it is set to some lower value in Mark's case).
[21 Aug 2012 11:26]
Dmitry Lenev
Hello! First of all I would like to emphasize that MDL was added to solve some problems with correctness of isolation between normal transactions and DDL, like infamous MySQL bug #989. It is not a replacement for the Table Definition Cache and its mutexes/locks are not a replacement for LOCK_open (though it has allowed to reduce the size of some of LOCK_open's critical sections). So it should be expected that this new synchronization mechanism introduces some extra overhead compared to 5.1. Especially in the case with just 1 connection since in this case the new synchronization layer just adds pure overhead. Having said the above, I should also say that we have tried and continue to try to reduce the negative effect on performance caused by MDL introduction. So what you have reported above indeed looks like a problem, which we would like to investigate. Could you please clarify a few points for us: Do you observe the same picture when you use single table in your test? The reason why I am asking is that when 8 different tables are used in the way like you have described above (i.e. each connection uses only one specific table), situation with 8 connections using 8 tables will be pretty similar to situation with one connection using one table. Also I suspect that in case then you have 8 tables and fairly low number of connections (16, 32) you can probably observe some overhead caused by extra code being executed under MDL_map::m_mutex. MDL_map::m_mutex works best (i.e. is taken for shortest time) when a table is constantly used in some connection, which might be not the case for a low number of connections and several tables. Now to your question: " How is MDL_map::m_mutex different from LOCK_open from a performance perspective? Both are global mutexes guarding access to a HASH. Both cause a lot of mutex contention. " Yes you are correct that both MDL_map::m_mutex and LOCK_open are global mutexes. But there are some significant differences between them as well: 1) - In the best case MDL_map::m_mutex is supposed to be taken once for each table instance used in a transaction. - In the best case in 5.1/5.5 LOCK_open is taken twice for each table instance used in the statement (in 5.6 partitioned LOCK_table_cache plays a similar role). So the pressure on MDL_map::m_mutex in the best case should be at least twice lower than on LOCK_open in 5.1/5.5. And it will be even lower if you are using transactions and not running in auto-commit mode. 2) In the best, the case size of the critical section for MDL_map::m_mutex should be smaller than the best case for LOCK_open in 5.1/5.5. As I have said above it is likely that you are not observing this best case in scenario for a low number of connections/several tables. Indeed, the above doesn't mean that MDL_map::m_mutex is perfect! Definitely, there is a room for improvement!
[21 Aug 2012 13:46]
Mark Callaghan
Results for 1, 8, 16, 32, 64, 128, 256, 512 and 1024 threads are listed in this thread. QPS for 5.6.6-m9 is lower than 5.1.63 at 1 to 64 threads. Only at 128 threads and above is 5.6 doing better than 5.1.
[21 Aug 2012 15:39]
Marc ALFF
Hi Mark. First, thank you for making the effort to use the performance_schema, and for the data provided. About your comment: " There is a huge amount of documentation on the PS. It would be nice to have more of a users guide in addition to the current content which is a reference. " Point well taken. Now, about the specifics: mysql> select 921695778424248/power(10, 12) as mdlmap_wait_in_sec; +--------------------+ | mdlmap_wait_in_sec | +--------------------+ | 921.695778424248 | +--------------------+ 1 row in set (0.00 sec) mysql> select 921695778424248/power(10, 12)/32 as mdlmap_wait_in_sec_per_core; +-----------------------------+ | mdlmap_wait_in_sec_per_core | +-----------------------------+ | 28.80299307575775 | +-----------------------------+ 1 row in set (0.00 sec) From the performance_schema data collected, the total time spent on wait/synch/mutex/sql/MDL_map::mutex is 15 minutes for all, or 29 sec per core. This is significant, and is indeed the biggest bottleneck found. MDL_map::mutex, LOCK_table_cache, and THD::LOCK_thd_data are all mutexes, implemented using the same underlying code (pthread_mutex_t), so comparing the data gives interesting results: a) THD::LOCK_thd_data is not a singleton, and likely to be not contended at all. Yet these mutexes are locked 262 Million times during the benchmark. An AVG_TIMER_WAIT of 132696 gives the lower end of what can be expected for a mutex lock, for this hardware and this load, on a non contended mutex. b) LOCK_table_cache is not a singleton, but there are only 4 table cache instances. Assuming a uniform distribution, 105113929/4 gives 26 Millions locks per mutex on average, with an AVG_TIMER_WAIT of 420888 per LOCK_table_cache mutex. c) AVG_TIMER_WAIT for MDL_map::mutex is 116 times and 36 times bigger compared to THD::LOCK_thd_data and LOCK_table_cache, respectively. Given that the underlying implementation: - mysql_mutex_lock (the performance schema code) - pthread_mutex_lock (the lock itself), is the same in all cases, and given that this code already scales well with: - either a great number of calls (262 M) on many instances, - or a great number of calls using the same instance (26 M), I think this points to the actual content of the critical section protected by the MDL_map mutex, which is causing the bottleneck here. Based on the data reported in this report, the contention in MDL_map::m_mutex is indeed verified (for the load used) Dimitry: Possible suggestions for a fix, based on early analysis: - Change MDL_map::m_locks to a LF_HASH instead of HASH - Keep MDL_map::m_mutex to protect MDL_map::m_unused_locks_cache if needed, but at least avoid locking this mutex when doing a search in the hash, so that multiple threads can search the hash in parallel. Regards, -- Marc
[21 Aug 2012 16:32]
Dmitry Lenev
Hi Mark! "Results for 1, 8, 16, 32, 64, 128, 256, 512 and 1024 threads are listed in this thread. QPS for 5.6.6-m9 is lower than 5.1.63 at 1 to 64 threads. Only at 128 threads and above is 5.6 doing better than 5.1.". The above you say that these data are for the test that "uses 8 tables with clients uniformly distributed across the tables". My question was "do you observe the same issue for case when only 1 table is used and not 8"? Do I miss any additional data that you have already provided?
[21 Aug 2012 16:53]
Mark Callaghan
If you are asking for results with 1 -> 1024 client threads all using one table, I will get results for that later this week.
[21 Aug 2012 16:58]
Dmitry Lenev
Hi Mark! Yes, this is what I am asking for! Thank you a lot for your feedback and all the efforts!
[24 Aug 2012 23:02]
Mark Callaghan
Original data I provided, innodb_buffer_pool_instances=4 for MySQL 5.6 8 16 32 64 128 256 512 1024 clients 56104 113944 212192 277937 177672 118408 67525 32031 5.1fb 53726 107174 183502 233566 238878 210891 180060 148195 5.6.5m8 49960 105759 182709 237572 240187 228502 220709 215144 5.6.6m9 Results with only 1 table, and innodb_buffer_pool_instances=1,2,4,8 (bpi=X below) 51326 105491 183253 239894 250153 246685 237081 203625 5.6.6-m9, bpi=1 52315 106299 183950 239474 249353 245024 243746 192644 5.6.6-m9, bpi=2 52186 105891 183748 239387 248858 241922 239481 198986 5.6.6-m9, bpi=4 54155 105006 183881 239615 249110 247162 241249 216810 5.6.6-m9, bpi=8 Results with 8 tables, and innodb_buffer_pool_instances=1,2,4,8 48914 93813 155093 197641 203535 205029 58947 67841 5.6.6-m9, bpi=1 49710 94166 156151 201469 208929 208695 75083 76993 5.6.6-m9, bpi=2 48642 93508 152709 196765 203727 201939 140980 129608 5.6.6-m9, bpi=4 48364 92947 150315 190443 196549 197086 192819 147549 5.6.6-m9, bpi=8
[5 Oct 2012 16:56]
Paul DuBois
Noted in 5.6.8, 5.7.0 changelogs. Metadata locking resulted in excessive contention in read-only workloads involving InnoDB tables and a low number of connections. Now the set of metadata locks can be partitioned into separate hashes to permit connections accessing diffent objects to use diffent locking hashes and reduce contention. The new metadata_locks_hash_instances system variable can be used to specify the number of hashes.
[17 Feb 2013 1:37]
Mark Callaghan
Please keep working on this. I produce the same problem with 5.6.10 and MySQL 5.1.63 + FB patch is a lot faster for the problem workload. The workload is: * sysbench with 8 tables * clients distributed uniformly across tables, each client is limited to one table * SQL is fetch 1 row by primary key * mysqld was linked with jemalloc * bpi=X below means innodb_buffer_pool_instances=X * orig5610, orig5152, orig5163 == unmodified MySQL 5.6.10, 5.1.52, 5.1.63 * dev5163 == MySQL 5.1.63 + Facebook patch QPS is listed below 8 16 32 64 128 256 concurrent clients 56161 98147 152856 171499 170448 170035 orig5610, bpi=1 56386 99076 152547 168290 170256 169034 orig5610, bpi=2 52841 101831 151345 171098 171452 170812 orig5610, bpi=4 51927 92475 151547 166432 169196 168639 orig5610, bpi=8 ... 61968 114979 179318 223252 223619 222973 dev5163 65231 117471 191359 222856 227052 85756 orig5152 66971 125003 191782 225055 218248 81998 orig5163 This was measured after the test ran for a few minutes. The opt two sources for waits were still growing fast: bin/mysql -A performance_schema -e "select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/%' and count_star > 0 order by sum_timer_wait desc limit 20" Warning: Using a password on the command line interface can be insecure. +---------------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/MDL_map::mutex | 77858908 | 504111071029125 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 116294302 | 211569523668750 | | wait/synch/rwlock/innodb/btr_search_latch | 48820750 | 100330805167125 | | wait/synch/mutex/innodb/trx_sys_mutex | 169600540 | 61770792053625 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 282663275 | 35703234132375 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 113065534 | 32369886805500 | | wait/synch/mutex/sql/LOCK_table_cache | 113065774 | 25974495424500 | | wait/synch/rwlock/sql/LOCK_grant | 56291239 | 14945362568250 | | wait/synch/rwlock/innodb/hash table locks | 57572866 | 8973685354125 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 16360588 | 2977433323875 | | wait/synch/mutex/innodb/os_mutex | 35510564 | 2751022925250 | | wait/synch/mutex/innodb/buf_pool_mutex | 5600301 | 1554670281000 | | wait/synch/mutex/innodb/fil_system_mutex | 8928068 | 949883385000 | | wait/synch/mutex/innodb/log_sys_mutex | 1984042 | 186805321500 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 4194433 | 73857420750 | | wait/synch/mutex/innodb/mutex_list_mutex | 4194728 | 73820670000 | | wait/synch/rwlock/innodb/checkpoint_lock | 106 | 8239699125 | | wait/synch/cond/sql/COND_server_started | 1 | 5888406375 | | wait/synch/mutex/sql/LOCK_global_system_variables | 13563 | 765520500 | | wait/synch/mutex/innodb/dict_sys_mutex | 227 | 632344500 | +---------------------------------------------------+------------+-----------------+
[17 Feb 2013 1:40]
Mark Callaghan
my.cnf used for mysql 5.6 socket=/data/orig5610/data/mysql.sock sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES innodb_log_file_size=1900M innodb_doublewrite=1 innodb_flush_method=O_DIRECT innodb_thread_concurrency=0 innodb_max_dirty_pages_pct=80 innodb_file_format=barracuda innodb_file_per_table table-definition-cache=1000 table-open-cache=2000 max_connections=2000 key_buffer_size=200M innodb_io_capacity=1000 innodb_flush_log_at_trx_commit=2 innodb_doublewrite=0 query_cache_size=0 query_cache_type=0 skip_log_bin performance_schema_instrument='wait/synch/%=on' innodb_buffer_pool_size=64g table_open_cache_instances=8
[17 Feb 2013 4:04]
Mark Callaghan
So, things have change a bit since I first reported this. But the outcome is that for this workload 5.6.10 is slower than MySQL 5.1.63 using the FB patch. At lower concurrencies 5.6.10 is also slower than unmodified 5.1.63. I think there is a real problem with mutex contention in the MDL code in 5.6. QPS for the same workload, but only using one sysbench process and one database table 1 2 4 8 16 32 64 128 256 concurrent clients 9164 17986 36376 66656 122862 186483 213858 218781 204757 dev5163 9317 19237 37861 71679 131732 194570 224029 218642 216220 orig5152 9627 18654 36958 69962 130675 192244 217105 222791 218747 orig5163 8185 16458 32734 61961 116189 162483 179290 177457 177203 orig5610 And QPS for 8 sysbench process, with each process using a different table. Each process uses 1, then 2, then 4, ... then 32 threads. 8 16 32 64 128 256 concurrent clients 61968 114979 179318 223252 223619 222973 dev5163 65231 117471 191359 222856 227052 85756 orig5152 66971 125003 191782 225055 218248 81998 orig5163 56161 98147 152856 171499 170448 170035 orig5610
[17 Feb 2013 4:07]
Mark Callaghan
The test servers have 24 cores when HT is enabled. The test tables have 16M rows each and 16M rows/table X 8 tables == ~32gb of data. The innodb buffer pool was 64gb. sysbench ran on a separate host from mysqld.
[18 Feb 2013 9:47]
Dimitri KRAVTCHUK
Mark, may you replay the same test on your server on MySQL 5.6, but with: 1. innodb_spin_wait_delay = 96 2. performance_schema = OFF (as you're comparing to 5.1) 3. both 1. & 2. because on 24cores and point-selects on 8-tables I'm obtaining 250K QPS on 5.6 currently (and supposing your CPU is faster than my 2300Mhz, no?) - so, I'm little bit surprised by your results.. Rgds, -Dimitri
[18 Feb 2013 14:51]
Mark Callaghan
Dimitri - I am equally surprise by your results. I previously reported both this problem and http://bugs.mysql.com/bug.php?id=68413. They were true in the past for Percona HW and my HW, so perhaps they are still true. CPU speed today is 2.67 GHz. I shouldn't have to disable PS to make a valid comparison to 5.1. The PS overhead shouldn't be a concern. Any comparison of the FB patch to unmodified MySQL includes the overhead for user_stats/table_stats. Would be nice one day if the PS provided useful metrics that could be enabled all of the time without fear of overhead as user_stats/table_stats provide. Perhaps the MDL code is sensitive to pthread mutex performance. I use gcc 4.6.2 and glibc 2.13 I don't think tuning InnoDB mutexes will help here as the problem appears to be mutex contention from MDL. Perhaps trying different values for metadata_locks_hash_instances would help, but the default value is 8 and my tests use 8 tables. MySQL 5.6 loses about 10% QPS from PS. Results with PS not compiled and with performance_schema=0 are at http://bugs.mysql.com/bug.php?id=68413. 8 16 32 64 128 256 concurrent clients 57223 109233 165677 190622 193110 191969 without PS compiled 56161 98147 152856 171499 170448 170035 with PS compiled 56885 106798 164094 184594 187841 186187 with PS,performance_schema=0 56978 103470 162754 188200 188200 186618 without PS compiled, innodb_spin_wait_delay=96
[18 Feb 2013 17:19]
Dimitri KRAVTCHUK
Mark, I'm also surprised by your QPS numbers.. - with a faster CPU you have you should reach much more higher level than my 250K QPS on 2300Mhz.. (in fact it's 2.27Ghz).. I've compiled my binaries with gcc 4.4.5, and following options: CFLAGS="-g -O3" CXXFLAGS="-g -O3" cmake . -DWITH_FAST_MUTEXES=ON nothing special.. and yes, I use jemalloc (but it's as usual).. Rgds, -Dimitri
[19 Feb 2013 2:47]
Mark Callaghan
Dimitri - I think you need to be more specific about your QPS results. AFAIK, you get 250k QPS with HT-disabled and 200k QPS with HT-enabled. So I think you are halfway to reproducing my results. Now just repeat the tests for 5.1 (or maybe 5.5) with and without HT.
[19 Feb 2013 3:13]
Mark Callaghan
8 16 32 64 128 256 concurrent clients 57223 109233 165677 190622 193110 191969 without PS compiled 56161 98147 152856 171499 170448 170035 with PS compiled 56885 106798 164094 184594 187841 186187 with PS,performance_schema=0 56978 103470 162754 188200 188200 186618 without PS compiled, innodb_spin_wait_delay=96 55940 101771 165779 191814 193125 192441 no PS, compiled with -O3
[19 Feb 2013 3:43]
Mark Callaghan
Using taskset -c 0,2,4,6,8,10.12,14,16,18,20,22 results are below with the label "taskset". They were much worse. I might try to disable HT on this host tomorrow. 8 16 32 64 128 256 concurrent clients 57223 109233 165677 190622 193110 191969 without PS compiled 56161 98147 152856 171499 170448 170035 with PS compiled 56885 106798 164094 184594 187841 186187 with PS,performance_schema=0 56978 103470 162754 188200 188200 186618 without PS compiled, innodb_spin_wait_delay=96 55940 101771 165779 191814 193125 192441 no PS, compiled with -O3 48410 91457 114284 116162 114267 111194 taskset
[19 Feb 2013 7:31]
Dimitri KRAVTCHUK
Mark, did not you say you have 24cores?.. (24cores with HT, so 48 vcpu) or you have 12cores? (and with HT you're obtaining 24 vcpu?) well, on Linux the numbers of vcpu are assigned not in the same way as on Solaris (0,1 = core0, 2,3 = core1, etc.) - on Linux: 0 = core0.thread0, 1 = core1.thread0, 2 = core2.thread0, etc. so, if you have true 24cores, the taskset you have to use is: $ taskset -c 0-23 ... but if you have only 12cores -- no need to test then, such an impact you may see only since 24cores ;-) regarding HT: well, with MySQL 5.6 we're scaling today up to 32cores for sure.. - however having HT enabled on 32cores is giving 64 vcpu (or CPU threads, call it as you want).. - and depending on which mutex we're hitting internal contention, you may have an increase by having 64 vcpu, or decrease.. - because having 64 vcpu may result in a much higher concurrency in locking, or may help to resolve such a locking by HW (as only 32 threads are really running on the same time).. - well, all depends on a workload.. you know, there is nothing new ;-) the first threads on CPU cores were implemented on SPARC by Sun, and we played a lot with such a "tuning" to reach the most optimal throughput on various applications which are not scaling up to 72CPU for ex., and we want to see the best possible result on a given HW ;-) anyway, may you clarify, please: - do you have 24 or 12cores finally?.. - may you give me exactly the command line of sysbench you're using (and its version)? Rgds, -Dimitri
[19 Feb 2013 9:06]
Dmitry Lenev
Hello Mark! Could you please provide some additional information? Particularly I am interested in results of "select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/%' and count_star > 0 order by sum_timer_wait desc limit 20" query after you run "8 sysbench process, with each process using a different table and each process uses 1 thread". In other words I am interested in info from P_S only after sysbench run that you used to produce data for "8 concurrent users" column. In theory in this case we should not be seeing much contention on MDL locks at all, as in this case each sysbench instance should use its own MDL_lock object and own MDL_map partition. So if you see picture which looks different it is clear indication that something "strange" happens. Also could you please provide the exact names of database and tables that you are using for sysbench runs? Is it possible that our hash function gives collisions for these names so some of sysbenches in 8 process x 1 table x 1 thread case end up working with the same MDL_map partition? Thank you in advance!
[19 Feb 2013 14:43]
Mark Callaghan
Dimitri - 12 real cores, 24 when HT is enabled. But otherwise I am getting too many questions. You report that QPS for 5.6 is much worse with HT enabled on your test. The next step for you is to run the tests for MySQL 5.1 with and without HT. I don't think 5.1 will lose so much from using HT. You can also use PS to show that the top source of mutex contention is MDL. I use my fork of sysbench that has better metrics. It is at https://code.launchpad.net/~mdcallag/sysbench/0.4-dev 8 instances of the loop below are run. Each instance uses a different value for table_num. Also at each iteration of the loop, nthreads was doubled (1 -> 2 -> 4 -> ... -> 32). And each iteration of the loop ran at the same time (so nthreads=32 for all tables ran concurrently). nthreads=1 table_num=1 ./sysbench --batch --batch-delay=10 --test=oltp --mysql-db=test --oltp-table-size=16000000 --max-time=180 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${table_num} --oltp-read-only --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols --oltp-dist-type=uniform --oltp-range-size=100 --oltp-connect-delay=0 --percentile=99 --num-threads=$nthreads --seed-rng=1 run
[19 Feb 2013 15:30]
Mark Callaghan
results for taskset2 are from using "taskset -c 0-11 mysqld ..." Both of the taskset results are from binaries without compiled support for PS 8 16 32 64 128 256 concurrent clients 57223 109233 165677 190622 193110 191969 without PS compiled 56161 98147 152856 171499 170448 170035 with PS compiled 56885 106798 164094 184594 187841 186187 with PS,performance_schema=0 56978 103470 162754 188200 188200 186618 without PS compiled, innodb_spin_wait_delay=96 55940 101771 165779 191814 193125 192441 no PS, compiled with -O3 48410 91457 114284 116162 114267 111194 taskset 55417 103171 139764 143975 142418 141297 taskset2 Table names are sbtest1, sbtest2, ..., sbtest8 db name is "test" PS synch waits from 8 and then from 32 concurrent clients. mysqld was restarted prior to each test, then the buffer pool was warmed, then the teste was run. MDL synch objects degrade more than others from 8 to 32. PS synch waits from 8 concurrent clients: event_name count_star sum_timer_wait wait/synch/mutex/sql/MDL_map::mutex 69163351 88867813661625 wait/synch/mutex/sql/THD::LOCK_thd_data 172909410 14895489702000 wait/synch/mutex/sql/LOCK_table_cache 69163474 13136791046250 wait/synch/rwlock/sql/MDL_lock::rwlock 69163634 10921715781375 wait/synch/rwlock/innodb/btr_search_latch 29647287 10169021009625 wait/synch/mutex/innodb/trx_sys_mutex 103745293 9453612002625 wait/synch/mutex/mysys/THR_LOCK::mutex 69163309 8550634552125 wait/synch/rwlock/innodb/hash table locks 47841552 6164445476625 wait/synch/rwlock/sql/LOCK_grant 34538720 5295065168250 wait/synch/mutex/innodb/os_mutex 32892092 2292353206125 wait/synch/mutex/innodb/buf_pool_mutex 5533946 1543466449500 wait/synch/mutex/innodb/fil_system_mutex 8933568 948979432875 wait/synch/rwlock/innodb/index_tree_rw_lock 13243461 793798469625 wait/synch/mutex/innodb/log_sys_mutex 1898188 167180788875 wait/synch/mutex/innodb/mutex_list_mutex 4194708 76603623375 wait/synch/mutex/innodb/rw_lock_list_mutex 4194433 74073824250 wait/synch/rwlock/innodb/checkpoint_lock 168 12754220250 wait/synch/cond/sql/COND_server_started 1 5931476625 wait/synch/mutex/sql/LOCK_global_system_variables 22714 780129375 wait/synch/mutex/innodb/flush_list_mutex 2623 351319500 PS synch waits from 32 concurrent clients event_name count_star sum_timer_wait wait/synch/mutex/sql/MDL_map::mutex 119230858 841983512712000 wait/synch/rwlock/sql/MDL_lock::rwlock 176383948 344424250576875 wait/synch/rwlock/innodb/btr_search_latch 78559058 103554075033000 wait/synch/mutex/innodb/trx_sys_mutex 256703751 91329344352000 wait/synch/mutex/sql/THD::LOCK_thd_data 427835485 55128454216875 wait/synch/mutex/mysys/THR_LOCK::mutex 171134261 47433195025875 wait/synch/mutex/sql/LOCK_table_cache 171134507 41352937581750 wait/synch/rwlock/sql/LOCK_grant 85225707 22604762027625 wait/synch/rwlock/innodb/hash table locks 63075670 9730971753000 wait/synch/rwlock/innodb/index_tree_rw_lock 18015087 3204438111375 wait/synch/mutex/innodb/os_mutex 35538726 2894663019750 wait/synch/mutex/innodb/buf_pool_mutex 5606625 1687920321375 wait/synch/mutex/innodb/fil_system_mutex 8933310 979599250875 wait/synch/mutex/innodb/log_sys_mutex 2101239 202458754125 wait/synch/mutex/innodb/rw_lock_list_mutex 4194433 75105732000 wait/synch/mutex/innodb/mutex_list_mutex 4194728 74155491375 wait/synch/rwlock/innodb/checkpoint_lock 164 11624814375 wait/synch/cond/sql/COND_server_started 1 6223468875 wait/synch/mutex/sql/LOCK_global_system_variables 20505 912133875 wait/synch/mutex/innodb/flush_list_mutex 2580 353359500
[19 Feb 2013 16:46]
Dmitry Lenev
Hello Mark! I have checked in debugger and it turns out that MDL keys for ALL 8 tables from test.sbtest1 .. test.sbtest8 sequence have such hashes that they turn out to be assigned to the same MDL_map partition!!! Indeed this means that all those threads in your benchmark keep banging against one poor MDL_map partition and its lock, while other seven partitions sit unused. This perfectly explains why do you see such a high contention on MDL_map::mutex in P_S even in situation with 8 sysbenches x 1 table x 1 thread in which normally no contention should be expected at all! Also this is probably why we didn't see high contention on MDL_map in our tests (as we use different names for our tables). We need to figure out what can be done to address such scenarios. Meanwhile as a workaround you can use different names for your tables. For example sequence of test.sbtest_1 ... test.sbtest_8 will nicely and uniformly use all 8 MDL_map partitions available in default configuration.
[19 Feb 2013 17:45]
Mark Callaghan
Dmitry Lenev - that is good news, but is there something wrong with the hash algorithm or did I get unlucky? I think there is a 1 in 2M chance that all 8 tables choose the same bucket when there are 8 buckets. Assuming the number of buckets is metadata_locks_hash_instances, then that is 8 by default. Doing the math for selection with replacment of 8 items from 8 buckets, I think this is the probability that all items end in one bucket: ((1/8) ^ 8) * 8 --> about 1 in 2M
[19 Feb 2013 17:47]
Dimitri KRAVTCHUK
Mark, on 12cores no need to play with HT :-) as said before, it may show an impact only on 24cores or bigger servers.. on my 12cores server (2.9Ghz with HT) I'm getting over 200K QPS with PFS=on and synch (mutex, rwlock) and I/O instrumentations enabled ;-) (and the main contention is on the trx_sys mutex as expected).. So, try to rename your tables as DmitryL suggested, and let's see on your new test results?.. Rgds, -Dimitri
[19 Feb 2013 18:12]
Mark Callaghan
Results are better with metadata_locks_hash_instances=256 and 5.6.10 is almost as fast as 5.1.63 + FB patch: * with metadata_locks_hash_instances=256 (default was 8) * table_open_cache_instances=8 (default was 1, the benefit is 1% or 2%) * perf_schema disabled at compile time PS synch waits with 32 concurrent clients event_name count_star sum_timer_wait wait/synch/rwlock/innodb/btr_search_latch 82516624 106132845340125 wait/synch/mutex/sql/MDL_map::mutex 120409851 100693371306375 wait/synch/mutex/innodb/trx_sys_mutex 269631102 100016743892250 wait/synch/rwlock/sql/MDL_lock::rwlock 181016771 76386925351875 wait/synch/mutex/sql/THD::LOCK_thd_data 449381080 52701740326125 wait/synch/mutex/mysys/THR_LOCK::mutex 179752505 48466235973750 wait/synch/mutex/sql/LOCK_table_cache 179752782 42455525954625 wait/synch/rwlock/sql/LOCK_grant 89509377 27099863204625 wait/synch/rwlock/innodb/hash table locks 63678623 9439458772500 wait/synch/rwlock/innodb/index_tree_rw_lock 18189424 2798154987750 wait/synch/mutex/innodb/os_mutex 35443021 2775027340125 wait/synch/mutex/innodb/buf_pool_mutex 5608060 1624722504375 wait/synch/mutex/innodb/fil_system_mutex 8932983 951317686125 wait/synch/mutex/innodb/log_sys_mutex 2118386 206010405375 wait/synch/mutex/innodb/mutex_list_mutex 4194728 83925673500 wait/synch/mutex/innodb/rw_lock_list_mutex 4194433 82427459250 wait/synch/rwlock/innodb/checkpoint_lock 158 12088848000 wait/synch/cond/sql/COND_server_started 1 5972825250 wait/synch/mutex/sql/LOCK_global_system_variables 20190 868146375 wait/synch/mutex/innodb/flush_list_mutex 2533 366217500 8 16 32 64 128 256 concurrent clients 57223 109233 165677 190622 193110 191969 without PS compiled 56161 98147 152856 171499 170448 170035 with PS compiled 56885 106798 164094 184594 187841 186187 with PS,performance_schema=0 56978 103470 162754 188200 188200 186618 without PS compiled, innodb_spin_wait_delay=96 55940 101771 165779 191814 193125 192441 no PS, compiled with -O3 48410 91457 114284 116162 114267 111194 taskset 55417 103171 139764 143975 142418 141297 taskset2 ... 53598 100470 165261 187998 189511 185304 with PS compiled, metadata_locks_hash_instances=256 61024 104393 177109 217812 216948 215513 without PS compiled, metadata_locks_hash_instances=256 ... 61968 114979 179318 223252 223619 222973 5.1.63 + FB patch 65231 117471 191359 222856 227052 85756 unmodified 5.1.52 66971 125003 191782 225055 218248 81998 unmodified 5.1.63
[19 Feb 2013 19:34]
Dmitry Lenev
Hello Mark! You math has assumption that each bucket is selected perfectly randomly, i.e. that for each table bucket is selected totally randomly and independently of selection for previous table. But this is not the case as hash function is a function from the key and keys in your case are dependant on each other. Assuming that I understand correctly how my_hash_sort_bin() hash function works, here is how we can express hash function for MDL keys for test.sbtestX (or more generally somedb.someprefixX) tables: hash-for-mdl-key-for-"test.sbtestX"-table = const1 ^ (const2 * (48 + x) + const3 << 8) ^ (func1(x) << 8) where const1 and const2 are functions from "test.sbtest" prefix, const3 is function of prefix length. So number of bucket/partition which is chosen looks like: MDL_map_partition_id = hash-for-mdl-key-for-"test.sbtestX"-table % 8 = (const1 ^ (const2 * (48 + x) + const3 << 8) ^ (func1(x) << 8)) & 7 = (const1 ^ (const2 * (48 + x) + const3 << 8)) & 7 = (const1 ^ (const2 * 48 + const2 * x)) & 7 After a bit of head scratching one can see that if const2 is such that const2 & 7 == 0 the above expression turns out to be totally independent of X. Probability of const2 & 7 being 0 can be guesstimated as 1/8. So in fact probability of all tables named somedb.someprefix1 .. somedb.someprefix8 getting assigned to the same partition can be guesstimated as 1/8.
[19 Feb 2013 19:56]
Dimitri KRAVTCHUK
Mark, when the code s working as expected, MDL_map::mutex waits are completely disappearing from the top list.. So, please, try to rename tables to get another view of the problem ;-) (expecting MDL_map::mutex will be no more here) Rgds, -Dimitri
[19 Feb 2013 20:26]
Dmitry Lenev
Hello Mark! After a bit more debugging... It turns out that for "test.sbtest" prefix const2 in the above expression is equal to 64. Which is rather unfortunate as it means that even with 256 MDL_map partitions you will be using only 4 of them ( "64 * x mod 256" will leave only 2 bits from 3-bits of X are going to be used).
[19 Feb 2013 20:50]
Mark Callaghan
Dmitry Lenev - nice debugging, but I am confused by the pseudo-code for my_hash_sort_bin. This is the source I see for 5.1 and 5.6 that implements the hash_sort function pointer for MY_COLLATION_HANDLER. It is called by calc_hash which only uses nr1. static uint calc_hash(const HASH *hash, const uchar *key, size_t length) { ulong nr1=1, nr2=4; hash->charset->coll->hash_sort(hash->charset,(uchar*) key,length,&nr1,&nr2); return nr1; } void my_hash_sort_bin(CHARSET_INFO *cs __attribute__((unused)), const uchar *key, size_t len,ulong *nr1, ulong *nr2) { const uchar *pos = key; key+= len; for (; pos < (uchar*) key ; pos++) { nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * ((uint)*pos)) + (nr1[0] << 8); nr2[0]+=3; } } So for strings of the form test.sbtestX, then nr1[0] ^= (K1 * X) + K2 Using my test program the strings test.sbtestX for X in 1 to 8 used 4 of 8 buckets (1,3,5 and 7). With 256 buckets there were no collisions. It has been a long time since I tried to analyze the quality of a hash function but I suspect this one needs work or at least should not be used when there are a small number of buckets. burtle was a great reference for me and he used to work at Oracle -- http://burtleburtle.net/bob/hash/hashfaq.html gcc -o hash hash.c; ./hash test.sbtest1 : 13154541203486169799, 7, 199 test.sbtest2 : 13154541203486169613, 5, 13 test.sbtest3 : 13154541203486169683, 3, 83 test.sbtest4 : 13154541203486170009, 1, 153 test.sbtest5 : 13154541203486170079, 7, 223 test.sbtest6 : 13154541203486169957, 5, 101 test.sbtest7 : 13154541203486164139, 3, 171 test.sbtest8 : 13154541203486164209, 1, 241 #include <stdlib.h> #include <stdio.h> #include <string.h> typedef unsigned long ulong; typedef unsigned char uchar; void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2) { const uchar *pos = key; key+= len; for (; pos < (uchar*) key ; pos++) { nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * ((uint)*pos)) + (nr1[0] << 8); nr2[0]+=3; } } char const * s[] = { "test.sbtest1", "test.sbtest2", "test.sbtest3", "test.sbtest4", "test.sbtest5", "test.sbtest6", "test.sbtest7", "test.sbtest8" }; int main() { int x; for (x=0; x < 8; ++x) { ulong nr1 = 1; ulong nr2 = 4; my_hash_sort_bin(s[x], strlen(s[x]), &nr1, &nr2); printf("%s : %lu, %lu, %lu\n", s[x], nr1, nr1 % 8, nr1 % 256); } return 0; }
[19 Feb 2013 20:52]
Mark Callaghan
Dimitri Kravtchuk - I tested with metadata_locks_hash_instances=256 and my analysis shows there should be no hash bucket collisions in that case. MySQL 5.1 + FB patch is still a small amount faster than 5.6 when the PS is disabled at compile time and a lot faster when PS is enabled with default options.
[19 Feb 2013 21:06]
Mark Callaghan
Dmitry Lenev - I updated my testcase to include test.sbtest0 and test.sbtest9. The output below is <key> : <nr1> <nr1 % 8> <nr1 % 256> The test output shows that 4 of 8 buckets and 10 of 256 buckets will be used. So I continue to wonder why we are talking about different behavior for my_hash_sort_bin, but I appreciate your analysis. The hash function my_hash_sort_bin has a common problem for a hash function -- you can't trust the least significant bits. $ gcc -o hash hash.c; ./hash test.sbtest1 : 13154541203486169799, 7, 199 test.sbtest2 : 13154541203486169613, 5, 13 test.sbtest3 : 13154541203486169683, 3, 83 test.sbtest4 : 13154541203486170009, 1, 153 test.sbtest5 : 13154541203486170079, 7, 223 test.sbtest6 : 13154541203486169957, 5, 101 test.sbtest7 : 13154541203486164139, 3, 171 test.sbtest8 : 13154541203486164209, 1, 241 test.sbtest0 : 13154541203486169729, 1, 129 test.sbtest9 : 13154541203486164023, 7, 55
[19 Feb 2013 21:11]
Dmitry Lenev
Hello Mark! In your program you assume that key which is used for "test"."sbtest1" table in MDL subsystem looks exactly like "test.sbtest1" but this assumption is false. Actually for test.sbtest1 the exact key used in MDL will be "\002test\000sbtest1\000" (\002 means table, also note \000 used as a separator). This is the value which we pass to hash function. After adjusting you program to take this fact into account I get from it results which are consistent with results of my debugging: for 8 buckets only 1 bucket is used, for 256 buckets only 4 buckets are used. And yes I guess this tells us something about our hash function.
[19 Feb 2013 21:22]
Mark Callaghan
Using my test, there is one collision for mod 16 and no collisons for mod 32. My guess is that either a better hash function is needed or you can't use too few buckets because the least significant bits of the hash output are not sufficiently random. I also updated my test program to compare my_hash_sort_bin and the InnoDB hash functions. InnoDB does much better for a small number of hash buckets. The results below show: <key> : <hash return> <hash % 8> <hash % 16> <hash % 32> <hash % 64> <hash % 128> <hash % 256> gcc -o hash hash.c; ./hash MySQL hash test.sbtest1 : 13154541203486169799, 7, 7, 7, 7, 71, 199 test.sbtest2 : 13154541203486169613, 5, 13, 13, 13, 13, 13 test.sbtest3 : 13154541203486169683, 3, 3, 19, 19, 83, 83 test.sbtest4 : 13154541203486170009, 1, 9, 25, 25, 25, 153 test.sbtest5 : 13154541203486170079, 7, 15, 31, 31, 95, 223 test.sbtest6 : 13154541203486169957, 5, 5, 5, 37, 101, 101 test.sbtest7 : 13154541203486164139, 3, 11, 11, 43, 43, 171 test.sbtest8 : 13154541203486164209, 1, 1, 17, 49, 113, 241 test.sbtest0 : 13154541203486169729, 1, 1, 1, 1, 1, 129 test.sbtest9 : 13154541203486164023, 7, 7, 23, 55, 55, 55 InnoDB hash test.sbtest1 : 5358432477730207684, 3, 11, 11, 11, 11, 139 test.sbtest2 : 5358432477730206917, 2, 10, 10, 10, 10, 138 test.sbtest3 : 5358432477730207174, 1, 9, 9, 9, 9, 137 test.sbtest4 : 5358432477730206407, 0, 8, 8, 8, 8, 136 test.sbtest5 : 5358432477730206664, 7, 7, 7, 7, 7, 135 test.sbtest6 : 5358432477730205897, 6, 6, 6, 6, 6, 134 test.sbtest7 : 5358432477730206154, 5, 5, 5, 5, 5, 133 test.sbtest8 : 5358432477730197195, 4, 4, 4, 4, 4, 132 test.sbtest0 : 5358432477730207427, 4, 12, 12, 12, 12, 140 test.sbtest9 : 5358432477730197452, 3, 3, 3, 3, 3, 131
[19 Feb 2013 21:22]
Mark Callaghan
Updated test program: #include <stdlib.h> #include <stdio.h> #include <string.h> typedef unsigned long ulong; typedef unsigned char uchar; typedef unsigned long ulint; #define UT_HASH_RANDOM_MASK 1463735687 #define UT_HASH_RANDOM_MASK2 1653893711 ulint ut_hash_ulint( /*==========*/ ulint key, /*!< in: value to be hashed */ ulint table_size) /*!< in: hash table size */ { key = key ^ UT_HASH_RANDOM_MASK2; return(key % table_size); } ulint ut_fold_ulint_pair( /*===============*/ ulint n1, /*!< in: ulint */ ulint n2) /*!< in: ulint */ { return(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1) ^ UT_HASH_RANDOM_MASK) + n2); } ulint ut_fold_string( /*===========*/ const char* str) /*!< in: null-terminated string */ { ulint fold = 0; while (*str != '\0') { fold = ut_fold_ulint_pair(fold, (ulint)(*str)); str++; } return(fold); } void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2) { const uchar *pos = key; key+= len; for (; pos < (uchar*) key ; pos++) { nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * ((uint)*pos)) + (nr1[0] << 8); nr2[0]+=3; } } char const * s[] = { "test.sbtest1", "test.sbtest2", "test.sbtest3", "test.sbtest4", "test.sbtest5", "test.sbtest6", "test.sbtest7", "test.sbtest8", "test.sbtest0", "test.sbtest9" }; int main() { int x; printf("MySQL hash\n"); for (x=0; x < 10; ++x) { ulong nr1 = 1; ulong nr2 = 4; my_hash_sort_bin(s[x], strlen(s[x]), &nr1, &nr2); printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n", s[x], nr1, nr1 % 8, nr1 % 16, nr1 % 32, nr1 % 64, nr1 % 128, nr1 % 256); } printf("\nInnoDB hash\n"); for (x=0; x < 10; ++x) { ulint k1 = ut_fold_string(s[x]); printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n", s[x], k1, ut_hash_ulint(k1, 8), ut_hash_ulint(k1, 16), ut_hash_ulint(k1, 32), ut_hash_ulint(k1, 64), ut_hash_ulint(k1, 128), ut_hash_ulint(k1, 256)); } return 0; }
[20 Feb 2013 2:29]
Mark Callaghan
Wow, as you write it is much worse using \002 and \001. $ gcc -g -o hash hash.c; ./hash MySQL hash sbtest1 : 5004758646725041816, 0, 8, 24, 24, 24, 152 sbtest2 : 5004758646725058136, 0, 8, 24, 24, 88, 88 sbtest3 : 5004758646725074456, 0, 8, 24, 24, 24, 24 sbtest4 : 5004758646725091288, 0, 8, 24, 24, 88, 216 sbtest5 : 5004758646725107608, 0, 8, 24, 24, 24, 152 sbtest6 : 5004758646725123928, 0, 8, 24, 24, 88, 88 sbtest7 : 5004758646725140248, 0, 8, 24, 24, 24, 24 sbtest8 : 5004758646725156056, 0, 8, 24, 24, 88, 216 sbtest0 : 5004758646725025496, 0, 8, 24, 24, 88, 216 sbtest9 : 5004758646725172376, 0, 8, 24, 24, 24, 152 InnoDB hash sbtest1 : 7632798584130221980, 3, 3, 19, 19, 83, 211 sbtest2 : 7632798583777965725, 2, 2, 18, 18, 82, 210 sbtest3 : 7632798583761515934, 1, 1, 17, 17, 81, 209 sbtest4 : 7632798583394466975, 0, 0, 16, 16, 80, 208 sbtest5 : 7632798584118021024, 7, 15, 15, 47, 111, 239 sbtest6 : 7632798583832611489, 6, 14, 14, 46, 110, 238 sbtest7 : 7632798584123513250, 5, 13, 13, 45, 109, 237 sbtest8 : 7632798583194443939, 4, 12, 12, 44, 108, 236 sbtest0 : 7632798584147183771, 4, 4, 20, 20, 84, 212 sbtest9 : 7632798583379062692, 3, 11, 11, 43, 107, 235 #include <stdlib.h> #include <stdio.h> #include <string.h> typedef unsigned long ulong; typedef unsigned char uchar; typedef unsigned long ulint; #define UT_HASH_RANDOM_MASK 1463735687 #define UT_HASH_RANDOM_MASK2 1653893711 ulint ut_hash_ulint( /*==========*/ ulint key, /*!< in: value to be hashed */ ulint table_size) /*!< in: hash table size */ { key = key ^ UT_HASH_RANDOM_MASK2; return(key % table_size); } ulint ut_fold_ulint_pair( /*===============*/ ulint n1, /*!< in: ulint */ ulint n2) /*!< in: ulint */ { return(((((n1 ^ n2 ^ UT_HASH_RANDOM_MASK2) << 8) + n1) ^ UT_HASH_RANDOM_MASK) + n2); } ulint ut_fold_string( /*===========*/ const char* str, size_t len) /*!< in: null-terminated string */ { ulint fold = 0; const char *end = str + len; while (str < end) { fold = ut_fold_ulint_pair(fold, (ulint)(*str)); str++; } return(fold); } void my_hash_sort_bin(const uchar *key, size_t len, ulong *nr1, ulong *nr2) { const uchar *pos = key; key+= len; for (; pos < (uchar*) key ; pos++) { nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * ((uint)*pos)) + (nr1[0] << 8); nr2[0]+=3; } } char const * s[] = { "sbtest1", "sbtest2", "sbtest3", "sbtest4", "sbtest5", "sbtest6", "sbtest7", "sbtest8", "sbtest0", "sbtest9" }; int main() { int x; printf("MySQL hash\n"); for (x=0; x < 10; ++x) { ulong nr1 = 1; ulong nr2 = 4; char sbuf[32]; sbuf[0] = '\002'; strcpy(sbuf+1, "test"); strcpy(sbuf+1+strlen("test")+1, s[x]); my_hash_sort_bin(sbuf, strlen(s[x])+strlen("test")+3, &nr1, &nr2); printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n", s[x], nr1, nr1 % 8, nr1 % 16, nr1 % 32, nr1 % 64, nr1 % 128, nr1 % 256); } printf("\nInnoDB hash\n"); for (x=0; x < 10; ++x) { char sbuf[32]; ulint k1; sbuf[0] = '\002'; strcpy(sbuf+1, "test"); strcpy(sbuf+1+strlen("test"), s[x]); k1 = ut_fold_string(sbuf, strlen(s[x])+strlen("test")+3); printf("%s : %lu, %lu, %lu, %lu, %lu, %lu, %lu\n", s[x], k1, ut_hash_ulint(k1, 8), ut_hash_ulint(k1, 16), ut_hash_ulint(k1, 32), ut_hash_ulint(k1, 64), ut_hash_ulint(k1, 128), ut_hash_ulint(k1, 256)); } return 0; }
[20 Feb 2013 21:14]
Mark Callaghan
Dmitry Lenev - I think one or both of the following should be done. What do you think? * increase default for metadata_locks_hash_instances * use a better hash function
[21 Feb 2013 19:12]
Mark Callaghan
The hash function is used in several places. Would be nice to reduce this to a few implementations. Also, the nr2 in/out value is ignored by many callers. I wonder if extra work is wasted for that computation. ql/password.c: nr^= (((nr & 63)+add)*tmp)+ (nr << 8); sql/sql_join_buffer.cc: nr^= (ulong) ((((uint) nr & 63)+nr2)*((uint) *pos))+ (nr << 8); strings/ctype-bin.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * strings/ctype-bin.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * strings/ctype-latin1.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * X) + (nr1[0] << 8); strings/ctype-latin1.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * X) + (nr1[0] << 8); strings/ctype-mb.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * strings/ctype-simple.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * strings/ctype-uca.c: n1[0]^= (((n1[0] & 63)+n2[0])*(s_res >> 8))+ (n1[0] << 8); strings/ctype-uca.c: n1[0]^= (((n1[0] & 63)+n2[0])*(s_res & 0xFF))+ (n1[0] << 8); strings/ctype-ucs2.c: n1[0]^= (((n1[0] & 63) + n2[0]) * (wc & 0xFF)) + (n1[0] << 8); strings/ctype-ucs2.c: n1[0]^= (((n1[0] & 63) + n2[0]) * (wc >> 8)) + (n1[0] << 8); strings/ctype-ucs2.c: nr1[0]^= (ulong) ((((uint) nr1[0] & 63) + nr2[0]) * strings/ctype-ucs2.c: n1[0]^= (((n1[0] & 63) + n2[0]) * (ch)) + (n1[0] << 8); strings/ctype-ucs2.c: n1[0]^= (((n1[0] & 63)+n2[0])*(wc & 0xFF))+ (n1[0] << 8); strings/ctype-ucs2.c: n1[0]^= (((n1[0] & 63)+n2[0])*(wc >> 8))+ (n1[0] << 8); strings/ctype-ucs2.c: nr1[0]^=(ulong) ((((uint) nr1[0] & 63)+nr2[0]) * strings/ctype-utf8.c: n1[0]^= (((n1[0] & 63)+n2[0])*(wc & 0xFF))+ (n1[0] << 8); strings/ctype-utf8.c: n1[0]^= (((n1[0] & 63)+n2[0])*(wc >> 8))+ (n1[0] << 8); strings/ctype-utf8.c: n1[0]^= (((n1[0] & 63) + n2[0]) * (ch)) + (n1[0] << 8);
[21 Feb 2013 20:54]
Dmitry Lenev
Hello Mark! I guess more digging and analysis should be done before we can understand what is the best way to address this problem. And yes both options you have mentioned are among obvious candidates. Re several implementations of hash functions. Please keep in mind that hashing function needs to take into account equivalence rules of collation/charset. Hence several similar, but still slightly different implementations in strings/ library. OTOH I agree that there is room for improvement there.
[21 Feb 2013 22:51]
Mark Callaghan
QPS results for cached & read-only sysbench. Both used metadata_locks_hash_instances=8. The first result used the innodb hash algorithm for my_hash_sort_bin. 8 16 32 64 128 256 concurrent clients 62015 113900 180457 216948 218628 218541 innodb hash 56554 106461 165382 188110 191131 190285 current hash
[24 Feb 2013 19:34]
Mark Callaghan
For a popular set of table names and one set of db names I computed distribution of entries by hash bucket for 2, 4, 8, ..., 256 hash buckets using the current mysql and innodb hash functions. The results also show the stddev of entries per hash bucket. The innodb hash does much better especially at a small number of buckets. The output below is printed for each number of buckets to compare mysql and innodb hash functions. It includes the number of entries per bucket and the stddev. 2 buckets mysql hash: 6690 6960 : stddev is 135.8 innodb hash: 6825 6825 : stddev is 15.0 4 buckets mysql hash: 3510 3805 3180 3155 : stddev is 266.5 innodb hash: 3415 3416 3410 3409 : stddev is 8.1 8 buckets mysql hash: 1890 1945 1915 1983 1620 1860 1265 1172 : stddev is 300.5 innodb hash: 1694 1705 1714 1719 1721 1711 1696 1690 : stddev is 11.7 16 buckets mysql hash: 900 964 1032 932 955 900 842 662 990 981 883 1051 665 960 423 510 : stddev is 181.4 innodb hash: 821 825 843 854 873 876 874 875 873 880 871 865 848 835 822 815 : stddev is 22.6 32 buckets mysql hash: 450 305 397 541 372 563 300 371 480 582 234 541 382 417 213 210 450 659 635 391 583 337 542 291 510 399 649 510 283 543 210 300 : stddev is 133.1 innodb hash: 424 422 406 444 414 461 443 490 449 449 464 434 412 427 431 432 397 403 437 410 459 415 431 385 424 431 407 431 436 408 391 383 : stddev is 23.6 64 buckets mysql hash: 180 177 244 181 122 92 180 151 120 367 76 274 190 180 120 128 270 168 196 120 273 97 387 140 210 232 400 330 188 300 60 210 270 128 153 360 250 471 120 220 360 215 158 267 192 237 93 82 180 491 439 271 310 240 155 151 300 167 249 180 95 243 150 90 : stddev is 98.5 innodb hash: 225 220 220 228 202 229 226 287 198 232 209 213 192 227 219 210 199 194 218 178 227 214 222 172 194 179 190 212 210 179 201 203 199 202 186 216 212 232 217 203 251 217 255 221 220 200 212 222 198 209 219 232 232 201 209 213 230 252 217 219 226 229 190 180 : stddev is 19.8 128 buckets mysql hash: 93 87 119 93 60 50 90 76 59 203 44 135 92 90 60 56 138 81 107 65 132 52 202 67 90 108 197 167 91 146 26 108 132 58 75 182 125 242 60 108 178 121 83 122 103 121 47 35 84 241 212 129 165 122 79 77 154 81 121 91 48 122 74 47 87 90 125 88 62 42 90 75 61 164 32 139 98 90 60 72 132 87 89 55 141 45 185 73 120 124 203 163 97 154 34 102 138 70 78 178 125 229 60 112 182 94 75 145 89 116 46 47 96 250 227 142 145 118 76 74 146 86 128 89 47 121 76 43 : stddev is 49.6 innodb hash: 112 105 105 108 118 121 101 152 99 104 116 107 86 118 122 92 104 110 104 84 111 119 140 96 90 108 99 97 98 104 96 120 80 108 100 122 115 135 127 101 127 111 138 126 100 83 123 91 94 96 124 143 116 116 156 130 113 131 124 137 116 126 104 94 113 115 115 120 84 108 125 135 99 128 93 106 106 109 97 118 95 84 114 94 116 95 82 76 104 71 91 115 112 75 105 83 119 94 86 94 97 97 90 102 124 106 117 95 120 117 89 131 104 113 95 89 116 85 53 83 117 121 93 82 110 103 86 86 : stddev is 17.1 256 buckets mysql hash: 53 41 59 47 28 23 42 46 30 93 22 63 46 47 28 25 73 36 53 30 62 32 101 30 40 53 106 78 48 71 14 51 71 28 37 92 66 131 27 60 89 63 41 57 53 57 28 20 47 114 108 68 78 56 41 36 81 39 54 48 26 64 39 27 46 48 62 41 23 25 46 26 30 78 18 69 53 44 32 40 59 48 41 23 67 20 97 38 64 66 103 79 53 75 16 51 65 29 42 88 60 123 32 50 98 47 39 79 46 56 26 23 56 128 116 69 85 59 36 38 72 38 66 47 25 48 33 17 40 46 60 46 32 27 48 30 29 110 22 72 46 43 32 31 65 45 54 35 70 20 101 37 50 55 91 89 43 75 12 57 61 30 38 90 59 111 33 48 89 58 42 65 50 64 19 15 37 127 104 61 87 66 38 41 73 42 67 43 22 58 35 20 41 42 63 47 39 17 44 49 31 86 14 70 45 46 28 32 73 39 48 32 74 25 88 35 56 58 100 84 44 79 18 51 73 41 36 90 65 106 28 62 84 47 36 66 43 60 20 24 40 122 111 73 60 59 40 36 74 48 62 42 22 73 43 26 : stddev is 25.1 innodb hash: 64 42 60 54 51 38 26 45 38 47 46 52 30 55 48 36 51 55 35 33 43 56 71 40 40 53 47 43 42 41 46 53 35 64 50 47 45 41 36 53 46 36 53 48 43 40 55 58 55 43 63 67 31 53 51 41 45 54 67 73 50 66 46 38 56 49 53 44 39 56 71 46 42 55 37 52 58 54 60 64 54 50 55 74 74 51 27 39 58 33 32 45 52 56 60 55 65 48 42 33 48 38 44 50 65 61 63 61 66 60 49 65 73 67 55 51 49 42 29 45 62 71 57 54 59 58 51 50 48 63 45 54 67 83 75 107 61 57 70 55 56 63 74 56 53 55 69 51 68 63 69 56 50 55 52 54 56 63 50 67 45 44 50 75 70 94 91 48 81 75 85 78 57 43 68 33 39 53 61 76 85 63 105 89 68 77 57 64 66 60 58 56 57 66 62 76 45 52 54 89 57 73 56 54 48 55 37 54 41 34 59 20 42 44 55 37 46 38 59 70 60 19 45 28 54 46 44 61 49 59 46 52 59 45 54 34 54 57 40 66 31 46 40 38 67 43 24 38 55 50 36 28 51 45 35 36 : stddev is 14.0
[25 Feb 2013 16:50]
Dmitry Lenev
Hello MarkC, All! I have created new bug report to track problem with MDL hash partitioning see bug #68487 "MDL hash can still be concurrency bottleneck". Please use this new report to track progress of our work on it and for reporting additional information/thoughts. Issue with P_S overhead should be tracked through bug #68413 "performance_schema overhead is at least 10%" that you have created. AFAIU these two issues cover problem with mutex contention in 5.6 as compared to 5.1 fairly well, so I am closing this bug again. Please let me know if there is any aspect of the problem that I have missed.
[7 Jul 2013 11:46]
James Day
Bug #69653 reports that the fix for this bug introduced a regression bug.
[4 Aug 2016 0:38]
James Day
If affected by this please see: 1. http://mysqlserverteam.com/removing-scalability-bottlenecks-in-the-metadata-locking-and-th... describing how the issues were quite thoroughly removed from 5.7.5 DMR onwards. 2. https://bugs.mysql.com/bug.php?id=68487 for an improvement in the hashing algorithm that should help a lot from 5.6.15 and 5.7.3 DMR.