Bug #68079 | InnoDB does not scale well on 12 core system for specific all in memory SELECT | ||
---|---|---|---|
Submitted: | 14 Jan 2013 12:01 | Modified: | 16 Jan 2014 13:15 |
Reporter: | Valeriy Kravchuk | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 5.5.29, 5.6.9-rc | OS: | Linux (x86_64) |
Assigned to: | Arnaud Adant | CPU Architecture: | Any |
Tags: | performance, read only, scalability, throughput |
[14 Jan 2013 12:01]
Valeriy Kravchuk
[14 Jan 2013 12:12]
Valeriy Kravchuk
Dump of tables, incase somebody thinks exact content matters
Attachment: test.sql.gz (application/x-tar-gz, text), 233.32 KiB.
[14 Jan 2013 16:16]
Patrick Casey
For what its worth, I think its the join, rather than the group by, that's doing us in. I use a different benchmark than Valeriy, but I'm seeing the same performance cliff even with selects by primary key against hte join product. For this test, we Spin 1, 2, 4, 8, 16, 24, 32, and 48 threads. Hardware is a 12 core (physical) 24 core (hyperthreading), 128G (96G buffer pool) box running mysql 5.5.25a. Each thread executes this query (select by PK, followed by a nested loop join) 100X with no waits between them. select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id='0000004c46567472005844a12a205beb' Total time to execute: 2012-10-01 17:40:09 (542) main [0:00:00.242] Parallel 1 2012-10-01 17:40:09 (857) main [0:00:00.225] Parallel 2 <--- two threads, same wall clock time 2012-10-01 17:40:10 (379) main [0:00:00.434] Parallel 4 <--- four threads on a 24 core box, but already takes twice as long 2012-10-01 17:40:11 (348) main [0:00:00.849] Parallel 8 <--- 8 threads are taking 4X as long for the workload as a single thread 2012-10-01 17:40:13 (222) main [0:00:01.696] Parallel 16 2012-10-01 17:40:16 (028) main [0:00:02.553] Parallel 24 <--- all cores fully engaged, and we get 1/10 the throughput per core that we did at parallel 1 2012-10-01 17:40:19 (763) main [0:00:03.401] Parallel 32 2012-10-01 17:40:25 (334) main [0:00:05.085] Parallel 48 <--- interestingly we take twice as long as we do at 24 cores, so once we go non-linear like this, we're in a linear state Some more examples: *** Bind Variables ... Same as above, but using bind variables instead of a literal ... essentially identical numbers select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id='?' BIND = 0000004c46567472005844a12a205beb 100X in tight loop mysql: 2012-10-02 16:25:25 (819) main [0:00:00.235] Parallel 1 2012-10-02 16:25:26 (118) main [0:00:00.212] Parallel 2 2012-10-02 16:25:26 (645) main [0:00:00.429] Parallel 4 2012-10-02 16:25:27 (622) main [0:00:00.840] Parallel 8 2012-10-02 16:25:29 (506) main [0:00:01.679] Parallel 16 2012-10-02 16:25:32 (324) main [0:00:02.524] Parallel 24 2012-10-02 16:25:36 (077) main [0:00:03.366] Parallel 32 2012-10-02 16:25:41 (683) main [0:00:05.034] Parallel 48 ******* 100 Primary keys. Executed 1X only per thread (without the 100X per thread loop as in teh first case): select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id IN ('6e7de761296660009a1cdbdba7d294b1','1e2a04eb100820004916aea6bc378a09','c2dd9cc87dec7400c14de29bdcf48088','caa82626f97fe000c5ab2ef1817e96ce','4dd138c687943000c14d83bdff434d9e','db6ae9a387807c00c14d83bdff434d35','2055c9cb8710b0005c9ab3aeef434de6','f66e17878794b4003fff83bdff434d5c','22644a20297f2c003fff9456dc7773f3','8ab5646be0f7e400957b9e05e1466395','145498f487a16400957bb3aeef434dd4','e06560d3e0bf2000957b9e05e146635e','e0bcd817e0bb2000957b9e05e14663fe','68d1245fe03f2000957b9e05e146631a','80274b0814d6ac409a1c431fe5411324','6ae835a908ec68009a1c7e7723d878bc','bb826b9d4c60e4003fffad30c2edccba','c8a8fa3b874c3440491683bdff434d3f','9802c5e187386400957bb3aeef434d10','debee39e7969a40049166a127486e654','089262bb7d5a24004916494baeedb4db','be55360484b7a400957b62b4414c0973','2d42966d8755ac409a1cb3aeef434d0d','d5347e1e85ef6c009a1cacb754d428cc','7752a2a78708bc00c14d83bdff434d3d','45a315eb357e20009a1c6c2a055da761','91fa0f5b6dd07000c14dd31166116763','87e5728a87d8f000491683bdff434d23','a3694c3fb9ab60003fff388835944194','31a091378740b040491683bdff434dd6','21dff80887e8b4003fff83bdff434dab','1e3fc33887302800957bb3aeef434d2f','87150598870118003fff83bdff434d9b','e3f2496787403c00c14d83bdff434d4c','3d67017387d7e4489a1cb3aeef434d09','06859c3bc5e7a0009a1cb4a4c8842448','91bbda5b8787ec40491683bdff434d1d','a374b389875c3400b43083bdff434dbd','97962f268744b4003fff83bdff434d48','ae910eab7c7620009a1ce03d7ffe488d','2d464658871ee0803fff83bdff434d2d','56e964a50c27e0009a1c70511c18ac50','9ebea8238738a4009a1cb3aeef434dbd','dafeb8d3878c34005c9ab3aeef434dfe','087942e7c00b6c843fffc43a2f2c8640','a32ec79587d0b000c5ab83bdff434d67','2d03c1028753a0449a1cb3aeef434d1d','fc5c23c7997d240049163de5a8f9945a','227dd8f8878ed000957bb3aeef434d41','419520e4e5334400957b5ef5706ff670','70f17879c1d364043fff71e055206af6','70175dc487e4b400b43083bdff434d9d','8531984b870874005c9ab3aeef434d57','3be662b4ed26a400957b6bc67ca1af4f','120557cf87d874005c9ab3aeef434dc9','0f8ece42875cb000c14d83bdff434d3e','d642dfc4715864409a1c66e63cf7a2a2','3394ef3185816440491668304b730ed3','550f5c4704bf6000b430af5c1b17094e','f3381d318750f400957bb3aeef434d4f','3960c9fc875cb040b43083bdff434d03','14fdf42987cf9c803fff83bdff434d29','db2ced3d8710f4009a1cb3aeef434d87','e22e102a87c07000c14d83bdff434d67','ed81b8b38558b0009a1c222fca18a36e','747dd8358720100049e6feddff434d73','733aa0ccc1e3ac483fff71e055206ab4','90c5eb0187183400957bb3aeef434ddf','a92086d587103000c14d83bdff434d1b','7d51ce70875434403fff83bdff434ddd','c893557c87987440c14d83bdff434d42','072a8972405c64004916fe7f27fe2ffa','889508003408b000c5aba1f0f97f7986','035b571e8414f000957bc168246c99b0','da0e76b68744f000c14d83bdff434d62','ca561b6a87a290003fff83bdff434dfa','230e1a0c84376400957b62b4414c092a','e3ca48d987547400b43083bdff434d13','1ae15281879b2c809a1cb3aeef434db9','311d1bc78784b0009a1cb3aeef434d13','e99c964b7d983400957be671126fb970','d294e86f71141000957bb8353b1e6c9d','89689f5684bd20003fff21253fbef3df','f3e4043240b3e040957b9a98b2d98de2','be8224ea985ca4009a1c12391b8ce90d','829d5938398cf000b4305f461d471aa1','b5579b2987c16c009a1cb3aeef434d74','546f337c8700b8009a1cb3aeef434df1','5594af0a87d51400957bb3aeef434dde','db7f5f0687951400957bb3aeef434d22','659f56d2904b2004957bad7c9b94a550','8d010bd8a96c24009a1c764c84831176','42e2af4bc083e8443fffc43a2f2c863a','2bbe273b0804ec003fffaf9b706ba5eb','63b4330f878cb000491683bdff434dec','07eefa498754f0003fff83bdff434dcd','a542571210b01400957b793c9605f286','840d9f8f8704b000491683bdff434d91','3502609e091cb400957b695f3707b695','cdab2fd287799000957bb3aeef434d53') mysql: 2012-10-01 17:46:17 (705) main [0:00:00.016] Parallel 1 2012-10-01 17:46:17 (816) main [0:00:00.025] Parallel 2 2012-10-01 17:46:17 (952) main [0:00:00.045] Parallel 4 2012-10-01 17:46:18 (166) main [0:00:00.086] Parallel 8 2012-10-01 17:46:18 (500) main [0:00:00.167] Parallel 16 2012-10-01 17:46:18 (985) main [0:00:00.249] Parallel 24 2012-10-01 17:46:19 (639) main [0:00:00.329] Parallel 32 2012-10-01 17:46:20 (641) main [0:00:00.492] Parallel 48
[15 Jan 2013 10:56]
Valeriy Kravchuk
MySQL 5.6.9 scales better (10 threads allowed to run load faster than 8), but still does not scale even to 12 threads for this query: [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 37.871 seconds Minimum number of seconds to run all queries: 37.853 seconds Maximum number of seconds to run all queries: 37.965 seconds Number of clients running queries: 1 Average number of queries per client: 1000 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=2 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 21.072 seconds Minimum number of seconds to run all queries: 17.819 seconds Maximum number of seconds to run all queries: 22.401 seconds Number of clients running queries: 2 Average number of queries per client: 500 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 11.222 seconds Minimum number of seconds to run all queries: 8.349 seconds Maximum number of seconds to run all queries: 12.143 seconds Number of clients running queries: 4 Average number of queries per client: 250 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 7.795 seconds Minimum number of seconds to run all queries: 6.865 seconds Maximum number of seconds to run all queries: 8.899 seconds Number of clients running queries: 8 Average number of queries per client: 125 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=10 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 7.481 seconds Minimum number of seconds to run all queries: 6.830 seconds Maximum number of seconds to run all queries: 8.329 seconds Number of clients running queries: 10 Average number of queries per client: 100 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=12 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 7.708 seconds Minimum number of seconds to run all queries: 7.153 seconds Maximum number of seconds to run all queries: 7.988 seconds Number of clients running queries: 12 Average number of queries per client: 83 [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=24 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 9.634 seconds Minimum number of seconds to run all queries: 9.184 seconds Maximum number of seconds to run all queries: 9.860 seconds Number of clients running queries: 24 Average number of queries per client: 41 Waits at this stage: mysql> show engine innodb mutex; +--------+-------------------------+-----------------+ | Type | Name | Status | +--------+-------------------------+-----------------+ | InnoDB | log0log.cc:737 | os_waits=6 | | InnoDB | buf0buf.cc:1225 | os_waits=3 | | InnoDB | combined buf0buf.cc:975 | os_waits=117391 | | InnoDB | log0log.cc:799 | os_waits=124 | +--------+-------------------------+-----------------+ 4 rows in set (0.00 sec)
[15 Jan 2013 12:08]
Valeriy Kravchuk
This is what I see in "top waits" using performance schema after running test with 24 threads 2 times after clean startup: mysql> select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 5; +--------------------------------------------+------------+----------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------------+------------+----------------+ | wait/synch/mutex/innodb/os_mutex | 71754726 | 11380485535846 | | wait/synch/mutex/innodb/trx_sys_mutex | 60057 | 4191727883 | | wait/synch/mutex/innodb/buf_pool_mutex | 8655 | 468303388 | | wait/synch/mutex/innodb/mutex_list_mutex | 10310 | 427385546 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 8304 | 316946749 | +--------------------------------------------+------------+----------------+ 5 rows in set (0.00 sec) FYI, this was the runtime: [valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=24 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 9.781 seconds Minimum number of seconds to run all queries: 9.471 seconds Maximum number of seconds to run all queries: 9.947 seconds Number of clients running queries: 24 Average number of queries per client: 41 So one can compare influence of full P_S instrumentation with default timing for this case: Average number of seconds to run all queries: 9.634 seconds Like 1.5%. Looks really good. For next runs I see increase only for the first 4 mutexes: +--------------------------------------------+------------+----------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------------+------------+----------------+ | wait/synch/mutex/innodb/os_mutex | 107565498 | 17037167192311 | | wait/synch/mutex/innodb/trx_sys_mutex | 90057 | 6242580106 | | wait/synch/mutex/innodb/buf_pool_mutex | 13194 | 707529424 | | wait/synch/mutex/innodb/mutex_list_mutex | 11270 | 483923608 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 8304 | 316946749 | +--------------------------------------------+------------+----------------+ 5 rows in set (0.00 sec)
[15 Jan 2013 23:11]
Dimitri KRAVTCHUK
Valeriy, thanks for reporting it and a very good test case to reproduce the problem! I'd say initially I4ve supposed the contention related to the BP instances, but it's not really the case (while the improvement you're seeing in 5.6 is related to it as in 5.6 we have in InnoDB a more advanced BP locking management (hash tables were added additionally to BP instances)).. But seems to me the issue is coming from the concurrent access to the same pages + having a join. If I'm changing test condition little bit and leaving each user thread to run within it's own data set (by restricting it with WHERE clause) -- I see a way better performance. You may try other variations if you want, but well, we have a true problem here, and it seems to be pretty old.. (I've also observed scalability issues with simple dbSTRESS RO workloads too, but did not yet get a deep dive in its problem). More investigation is needed to understand it better.. BTW, mutex contention(s) you'll see are also very depending on the system you're using, on the speed of your CPU cores and their number. For ex. on my 32cores server the top wait on 16 threads is on the btr_search_latch, while AHI is turned off.. - so, there is really something pretty interesting ;-) Rgds, -Dimitri
[16 Jan 2013 17:43]
Valeriy Kravchuk
Thank you for paying attention to this report and test case, Dimitri! Surely os_mutex waits can be removed form the top by tuning innodb_sync_spin_loops, to get something like: mysql> select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10; +--------------------------------------------+------------+--------+ | event_name | count_star | time | +--------------------------------------------+------------+--------+ | wait/synch/mutex/innodb/trx_sys_mutex | 30186 | 4.5602 | | wait/synch/mutex/innodb/os_mutex | 51546 | 2.3394 | | wait/synch/mutex/innodb/buf_pool_mutex | 4269 | 0.5441 | | wait/synch/mutex/innodb/mutex_list_mutex | 8878 | 0.3815 | | wait/synch/mutex/innodb/dict_sys_mutex | 133 | 0.3352 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 8304 | 0.3198 | | wait/synch/mutex/innodb/fil_system_mutex | 2119 | 0.1864 | | wait/synch/mutex/innodb/buf_dblwr_mutex | 1099 | 0.1318 | | wait/synch/mutex/innodb/log_sys_mutex | 1008 | 0.0964 | | wait/synch/mutex/innodb/flush_list_mutex | 311 | 0.0437 | +--------------------------------------------+------------+--------+ 10 rows in set (0.00 sec) and comparing the above to result just after starting up and enabling consumers: mysql> select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10; +--------------------------------------------+------------+--------+ | event_name | count_star | time | +--------------------------------------------+------------+--------+ | wait/synch/mutex/innodb/os_mutex | 50228 | 2.1837 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 8300 | 0.3116 | | wait/synch/mutex/innodb/mutex_list_mutex | 8379 | 0.3054 | | wait/synch/mutex/innodb/fil_system_mutex | 1241 | 0.0534 | | wait/synch/mutex/innodb/buf_pool_mutex | 818 | 0.0333 | | wait/synch/mutex/innodb/log_sys_mutex | 261 | 0.0140 | | wait/synch/mutex/innodb/rseg_mutex | 273 | 0.0130 | | wait/synch/mutex/innodb/recv_sys_mutex | 185 | 0.0073 | | wait/synch/mutex/innodb/trx_mutex | 183 | 0.0067 | | wait/synch/mutex/innodb/flush_list_mutex | 25 | 0.0053 | +--------------------------------------------+------------+--------+ 10 rows in set (0.00 sec) I'd say it is about wait/synch/mutex/innodb/trx_sys_mutex (why?) and somewhat about wait/synch/mutex/innodb/buf_pool_mutex (this is because of access to same blocks) mostly on my system. FYI, my older testing attempts shown that using several (up to 12) instances of buffer pool on 5.5.x do not help to improve scalability. I am still a beginner in P_S real life usage, so my usage and interpretation can be wrong. If somebody wants to run specific P_S query after testing or make other non-default settings for instruments and/or consumers or anything else in P_S, please, tell me. Thank you for accepting this report as real problem.
[16 Jan 2013 19:13]
Arnaud Adant
OK this bug is now verified as described. On a tested 4 core machine and 5.5.29, the test does not even scale to 2 connections. mysql> select EVENT_NAME as nm, COUNT_STAR as cnt, SUM_TIMER_WAIT as tm -> from performance_schema.events_waits_summary_global_by_event_name -> where COUNT_STAR > 0 order by 3 desc limit 20; +-----------------------------------------------+-----------+----------------+ | nm | cnt | tm | +-----------------------------------------------+-----------+----------------+ | wait/synch/mutex/innodb/buf_pool_mutex | 165112346 | 42207640363152 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 81991326 | 22057201473246 | | wait/synch/rwlock/innodb/btr_search_latch | 93372604 | 20577738333006 | | wait/synch/mutex/innodb/kernel_mutex | 4087505 | 2659589263404 | already 42s + 20s + 20s for mutexes and rwlock. InnoDB buffer pool instances had a positive effect although it does not solve the scalability issue. It is well known for a long time that MySQL does not scale very well if the queries examine a lot of rows.
[17 Jan 2013 18:59]
Patrick Casey
Does anyone have an opinion as to whether or not the underlying OS is going to matter here? Are we spinning on these things long enough to fail into Kernel level mutxes, or is this all internal to the innodb code line? Reason I ask is I can probably (with some difficulty) re-run these tests on mysql on SUNOS, but it'd be disruptive here so I'd only want to do it if folks thought there was a reasonable chance we'd see an improvement.
[17 Jan 2013 19:27]
Arnaud Adant
Hi Patrick, The contention is not OS related. It comes from InnoDB. So you'll reach the limit even with powerful hardware. Maybe the CPU frequency could help go faster but not scale. It might be easier to tune your queries. 16k rows examined per statement is really something that does not scale for now. If you cut it by 100, you'll scale better. Best regards,
[18 Jan 2013 18:13]
Patrick Casey
Unfortunately, that's not really practical in the general case since these are customer generated queries. We're moving to shard out servers down more aggressively so that we have more parallel instances of mysql running, which is going to mitigate the problem, but sharding doesn't go to infinity and we're going to have an ongoing hot shard problem. I have to say, I'm sort of shocked that this isn't a more commonly experienced bottleneck in the field. Now that we know what to look for, we're seeing pretty strong evidence of what we're calling the "10 core problem" on multiple instances in our farm. Is our query pattern really that unusual? Or are we unusual in that we're deploying innodb on bigger iron than most folks so throughput capping at 10/24 cores is a lot more noticeable than throughput capping at, say, 10/12 cores?
[20 Jan 2013 19:20]
Mark Callaghan
Patrick - performance tuning expertise for MySQL + SunOS is less frequent than for MySQL + Linux. InnoDB implements a value-added mutex on top of pthread_mutex and a value-added rw-lock on top of pthread_mutex and pthread_cond_t. The rw-lock can be very efficient, depending on the platform (works good on Linux). A good pthread library can be very important here. With modern Linux we benefit from the use of futexes so that many operations don't require a system call. Note that the value-added InnoDB code has its own busy-wait loop where it will spin/pause/spin/pause for a while before going to sleep. Things you can do: * tune the spin loops -- change innodb_spin_wait_delay, innodb_sync_spin_loops * limit concurrency in InnoDB -- reduce innodb_thread_concurrency to some number less than the count of CPU cores, increase innodb_concurrency_tickets
[28 Jan 2013 16:13]
Patrick Casey
Mark, I thought the way the innodb mutex worked is that it spun on the lock for a configurable number of spins, then fell through into an OS mutex if the spin count was exceeded. Looking at the wait states here, I'm seeing the wait time dominated by os mutex waits (or at least that's what I assume that counter is tracking). mysql> select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10; +--------------------------------------------+------------+--------+ | event_name | count_star | time | +--------------------------------------------+------------+--------+ | wait/synch/mutex/innodb/os_mutex | 50228 | 2.1837 | | wait/synch/mutex/innodb/rw_lock_list_mutex | 8300 | 0.3116 | | wait/synch/mutex/innodb/mutex_list_mutex | 8379 | 0.3054 | | wait/synch/mutex/innodb/fil_system_mutex | 1241 | 0.0534 | | wait/synch/mutex/innodb/buf_pool_mutex | 818 | 0.0333 | | wait/synch/mutex/innodb/log_sys_mutex | 261 | 0.0140 | | wait/synch/mutex/innodb/rseg_mutex | 273 | 0.0130 | | wait/synch/mutex/innodb/recv_sys_mutex | 185 | 0.0073 | | wait/synch/mutex/innodb/trx_mutex | 183 | 0.0067 | | wait/synch/mutex/innodb/flush_list_mutex | 25 | 0.0053 | +--------------------------------------------+------------+--------+ We've tried tuning the spin counts and delays every which way to see if we could move the needle on this thing, but no material difference in throughput, so I suspect we're just moving the problem around e.g. we are probably just shifting the wait off of the OS mutex and into the spin loop, but either way we're still waiting :(. What I'm not following here though (probably because I don't have a great mental map of the guts of the innodb memory subsystem), is why there's an exclusive lock at work here at all? This is a read mostly workload in production. Its a read only workload in the benchmark. I can't see why two threads cannot concurrently access the same page in the buffer pool concurrently (at least conceptually). Are we locking on a page level exclusive lock?
[28 Jan 2013 17:07]
Mark Callaghan
Filed http://bugs.mysql.com/bug.php?id=68207 to ask for better granularity in reporting for "os_mutex". That is the top source of contention. In your case I think the problem is from the os_mutex that guards the sync array -- which is where threads go to sleep. If you haven't tried it, then setting "innodb_thread_concurrency" to limit the number of threads that can be in InnoDB might help but only if you aren't IO bound. For some workloads where many rows are read per query then raising innodb_concurrency_tickets can help. InnoDB has mutexes protecting various subsystems, just like all other server code I read. It still has lots of things to make better. The sync array is one of them. One problem in that area is http://bugs.mysql.com/bug.php?id=52806
[28 Jan 2013 17:25]
Dimitri KRAVTCHUK
Mark, Patrick, I'm looking closely on this issue and will keep you updated.. For the moment I think the problem is coming from the block locking (and maybe something related to JOIN).. To be honest, it's pretty strange to see it, but this is a reality and we're facing it :-) I was also able to reproduce the problem on Sysbench data (seems to me, once we'll get it fixed, we'll add this test into our QA runs to be sure to never hit this regression again). I may confirm that this problem is long dated, and it's really strange that nobody reported it before (that's why I'm also so surprised).. But well, before to get it fixed, we have to find from where exactly issue is coming in.. - I'm digging, so stay tuned ;-) Rgds, -Dimitri
[29 Jan 2013 22:01]
Dimitri KRAVTCHUK
Patrick, Valeriy, It was unclear for me why you're seeing the "os_mutex" waits on the top position, and not me.. - but in fact you've both used a too restricted name pattern in PFS query: mysql> select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10; try instead: mysql> select event_name, count_star, sum_timer_wait/10000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; which is selecting all wait "synch" events, and then you should see the RW-locks or other on the top position (I hope so).. in my case on one of the tests it's: -- before each test think to reset the data mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ; -- top-10 wait synch mysql> select event_name, count_star, sum_timer_wait/10000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; +---------------------------------------------+------------+--------------+ | event_name | count_star | time | +---------------------------------------------+------------+--------------+ | wait/synch/rwlock/innodb/btr_search_latch | 27065635 | 2526009.2864 | | wait/synch/rwlock/innodb/hash table locks | 43618714 | 1302668.4231 | | wait/synch/mutex/innodb/os_mutex | 12904603 | 1239325.2118 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 14451920 | 632423.6294 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 2785412 | 18428.1909 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 551543 | 13842.7159 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 551554 | 9266.0430 | | wait/synch/mutex/sql/MDL_map::mutex | 275889 | 6979.9721 | | wait/synch/mutex/sql/LOCK_table_cache | 551557 | 6443.5962 | | wait/synch/mutex/innodb/trx_sys_mutex | 413552 | 3589.6080 | +---------------------------------------------+------------+--------------+ 10 rows in set (0.01 sec) and I'm curious about your output. Rgds, -Dimitri
[30 Jan 2013 7:51]
Valeriy Kravchuk
That 12 core box is busy with other benchmarks now, so I've tested on my old QuadCore with 4 threads (it does not scale well to 4 concurrent queries of this kind, based on my older tests). This is what I've got on 5.6.9-rc: [openxs@chief 5.6]$ bin/mysqlslap -uroot --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Average number of seconds to run all queries: 18.372 seconds Minimum number of seconds to run all queries: 18.108 seconds Maximum number of seconds to run all queries: 18.847 seconds Number of clients running queries: 4 Average number of queries per client: 250 [openxs@chief 5.6]$ bin/mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 44 Server version: 5.6.9-rc MySQL Community Server (GPL) Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select event_name, count_star, sum_timer_wait/10000000 time from -> performance_schema.events_waits_summary_global_by_event_name where event_name -> like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; +----------------------------------------------+------------+--------------+ | event_name | count_star | time | +----------------------------------------------+------------+--------------+ | wait/synch/rwlock/innodb/hash table locks | 165128221 | 5412925.4088 | | wait/synch/rwlock/innodb/btr_search_latch | 92493290 | 2755134.4885 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 78284326 | 1696840.0731 | | wait/synch/rwlock/innodb/checkpoint_lock | 60 | 165570.9816 | | wait/synch/cond/sql/COND_server_started | 1 | 105655.9263 | | wait/synch/mutex/innodb/buf_pool_mutex | 15987 | 48932.7542 | | wait/synch/mutex/innodb/os_mutex | 109107 | 9889.8180 | | wait/synch/rwlock/innodb/dict_operation_lock | 1012 | 7411.5460 | | wait/synch/mutex/innodb/log_sys_mutex | 50692 | 2883.7936 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 40211 | 1420.8270 | +----------------------------------------------+------------+--------------+ 10 rows in set (0.01 sec)
[30 Jan 2013 9:11]
Dimitri KRAVTCHUK
Valeriy, may you truncate (reset) PFS data before your test? just do: mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ; because some events in your output are not related to RO workloads.. but I think it'll confirm now what I'm observing too (on small servers the hash table waits are on the top position, on big ones - btr_serach_latch). BTW, do you have AHI enabled or disabled in your InnoDB instance? (innodb_adaptive_hash_index=1/0) Rgds, -Dimitri
[31 Jan 2013 13:40]
Valeriy Kravchuk
Sorry, forgot to truncate after initial data load it seems. I start server with --no-defaults + P_S related settings only, so adaptive hash indexing is ON: [openxs@chief 5.6]$ mysql -uroot test Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 85 Server version: 5.6.9-rc MySQL Community Server (GPL) Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show variables like 'innodb_adaptive_ha%'; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | innodb_adaptive_hash_index | ON | +----------------------------+-------+ 1 row in set (0.00 sec) mysql> select event_name, count_star, sum_timer_wait/10000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; +---------------------------------------------+------------+--------------+ | event_name | count_star | time | +---------------------------------------------+------------+--------------+ | wait/synch/rwlock/innodb/hash table locks | 165049219 | 4315982.3893 | | wait/synch/rwlock/innodb/btr_search_latch | 92507461 | 2581370.7551 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 82236944 | 1601547.3698 | | wait/synch/rwlock/innodb/checkpoint_lock | 48 | 98272.5330 | | wait/synch/mutex/innodb/os_mutex | 64656 | 13226.1263 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 140604 | 1569.4737 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 40016 | 815.9076 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 40011 | 786.5132 | | wait/synch/mutex/sql/LOCK_table_cache | 40008 | 544.1570 | | wait/synch/mutex/innodb/trx_sys_mutex | 30082 | 371.1979 | +---------------------------------------------+------------+--------------+ 10 rows in set (0.00 sec)
[31 Jan 2013 21:47]
Patrick Casey
This is on a 24 core (ht) box. 2013-01-31 13:42:11 (158) main Parallel 1 Time 0:00:04.669 2013-01-31 13:42:16 (893) main Parallel 2 Time 0:00:05.632 2013-01-31 13:42:23 (680) main Parallel 4 Time 0:00:06.715 2013-01-31 13:42:30 (911) main Parallel 8 Time 0:00:07.121 2013-01-31 13:42:40 (703) main Parallel 12 Time 0:00:09.636 2013-01-31 13:42:53 (402) main Parallel 16 Time 0:00:12.504 2013-01-31 13:43:09 (561) main Parallel 24 Time 0:00:15.974 2013-01-31 13:43:31 (709) main Parallel 32 Time 0:00:21.923 2013-01-31 13:44:05 (197) main Parallel 48 Time 0:00:32.773 Main wait state looks to be btr_search_latch mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ; Query OK, 0 rows affected, 1 warning (0.02 sec) mysql> use hi02; Database changed mysql> exit Bye [perf-db03.lab3:/]$vmstat 2 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 312 347688 1012580 19995148 0 0 7 12 0 0 0 0 100 0 0 3 0 312 347996 1012580 19995152 0 0 0 20 1150 429 8 0 92 0 0 2 0 312 348000 1012580 19995152 0 0 0 420 1047 346 8 0 92 0 0 4 0 312 348500 1012580 19995152 0 0 0 10 1098 403 14 0 86 0 0 4 0 312 348512 1012580 19995156 0 0 0 20 1068 370 17 0 83 0 0 4 0 312 348560 1012580 19995156 0 0 0 10 1023 312 17 0 83 0 0 8 0 312 348532 1012580 19995160 0 0 0 20 1176 511 20 0 80 0 0 <snip> [perf-db03.lab3:/]$mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 207205 Server version: 5.5.25a-log MySQL Community Server (GPL) Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select event_name, count_star, sum_timer_wait/10000000 time from -> performance_schema.events_waits_summary_global_by_event_name where event_name -> like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; +-----------------------------------------------+------------+---------------+ | event_name | count_star | time | +-----------------------------------------------+------------+---------------+ | wait/synch/rwlock/innodb/btr_search_latch | 170108654 | 50911891.9512 | | wait/synch/mutex/innodb/buf_pool_mutex | 10192868 | 23605561.2940 | | wait/synch/mutex/innodb/log_flush_order_mutex | 7554710 | 880044.6446 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 883949 | 53036.0507 | | wait/synch/rwlock/innodb/checkpoint_lock | 104 | 37912.8009 | | wait/synch/mutex/innodb/fil_system_mutex | 8726 | 22322.1135 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 249863 | 1749.2439 | | wait/synch/mutex/sql/MDL_map::mutex | 247220 | 1602.0584 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 42815 | 739.8814 | | wait/synch/mutex/sql/LOCK_thread_count | 660 | 451.1033 | +-----------------------------------------------+------------+---------------+ 10 rows in set (0.02 sec)
[31 Jan 2013 21:55]
Patrick Casey
Previous run had adaptive hash index ON. This run has adaptive hash index OFF. mysql> set global innodb_adaptive_hash_index=OFF; Query OK, 0 rows affected (0.33 sec) mysql> show global variables like 'innodb_ad%'; +---------------------------------+---------+ | Variable_name | Value | +---------------------------------+---------+ | innodb_adaptive_flushing | ON | | innodb_adaptive_hash_index | OFF | | innodb_additional_mem_pool_size | 2097152 | +---------------------------------+---------+ 3 rows in set (0.00 sec) Performance is much, much worse. 2013-01-31 13:48:55 (249) main Parallel 1 Time 0:00:07.614 2013-01-31 13:49:06 (206) main Parallel 2 Time 0:00:10.857 2013-01-31 13:49:18 (658) main Parallel 4 Time 0:00:12.381 2013-01-31 13:49:37 (999) main Parallel 8 Time 0:00:19.227 2013-01-31 13:50:08 (757) main Parallel 12 Time 0:00:30.609 2013-01-31 13:50:50 (434) main Parallel 16 Time 0:00:41.139 2013-01-31 13:51:55 (855) main Parallel 24 Time 0:01:05.232 <I cancelled the 32 and 48 thread runs) Wait states are different. Looks like we just shifted the problem onto the buffer pool mutex. mysql> select event_name, count_star, sum_timer_wait/10000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10; +---------------------------------------------------+------------+----------------+ | event_name | count_star | time | +---------------------------------------------------+------------+----------------+ | wait/synch/mutex/innodb/buf_pool_mutex | 413972344 | 364754611.9145 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 140104101 | 1447339.8741 | | wait/synch/mutex/innodb/log_flush_order_mutex | 9600076 | 89843.1713 | | wait/synch/rwlock/innodb/btr_search_latch | 68 | 31865.4732 | | wait/synch/rwlock/innodb/checkpoint_lock | 68 | 5376.1065 | | wait/synch/mutex/sql/MDL_map::mutex | 182725 | 3311.1823 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 184404 | 1149.3740 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 17225 | 244.7680 | | wait/synch/mutex/innodb/kernel_mutex | 7704 | 194.3324 | | wait/synch/mutex/sql/LOCK_global_system_variables | 9085 | 75.6374 | +---------------------------------------------------+------------+----------------+ 10 rows in set (0.00 sec)
[31 Jan 2013 21:57]
Patrick Casey
For what its worth I'm running with 4 buffer pool instances. mysql> show variables like 'innodb_buf%'; +------------------------------+--------------+ | Variable_name | Value | +------------------------------+--------------+ | innodb_buffer_pool_instances | 4 | | innodb_buffer_pool_size | 103079215104 | +------------------------------+--------------+ 2 rows in set (0.00 sec) mysql> show engine innodb mutex; +--------+------------------------+------------------+ | Type | Name | Status | +--------+------------------------+------------------+ | InnoDB | trx0rseg.c:210 | os_waits=1 | | InnoDB | ibuf0ibuf.c:537 | os_waits=5 | | InnoDB | ibuf0ibuf.c:534 | os_waits=7 | | InnoDB | dict0dict.c:698 | os_waits=63 | | InnoDB | trx0sys.c:196 | os_waits=454637 | | InnoDB | log0log.c:775 | os_waits=14428 | | InnoDB | log0log.c:771 | os_waits=113906 | | InnoDB | buf0buf.c:1201 | os_waits=76 | | InnoDB | buf0buf.c:1166 | os_waits=124819 | | InnoDB | buf0buf.c:1201 | os_waits=74 | | InnoDB | buf0buf.c:1166 | os_waits=4566148 | | InnoDB | buf0buf.c:1201 | os_waits=78 | | InnoDB | buf0buf.c:1166 | os_waits=159743 | | InnoDB | buf0buf.c:1201 | os_waits=44 | | InnoDB | buf0buf.c:1166 | os_waits=91288 | | InnoDB | fil0fil.c:1608 | os_waits=62117 | | InnoDB | srv0srv.c:1018 | os_waits=20542 | | InnoDB | combined buf0buf.c:915 | os_waits=1751 | | InnoDB | dict0dict.c:1725 | os_waits=15 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=7 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=2 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=2 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=2 | | InnoDB | dict0dict.c:1725 | os_waits=11 | | InnoDB | dict0dict.c:1725 | os_waits=15 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=2 | | InnoDB | dict0dict.c:1725 | os_waits=1 | | InnoDB | dict0dict.c:1725 | os_waits=3 | | InnoDB | dict0dict.c:1725 | os_waits=5 | | InnoDB | dict0dict.c:1725 | os_waits=28 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=10 | | InnoDB | dict0dict.c:721 | os_waits=5 | | InnoDB | dict0dict.c:721 | os_waits=3 | | InnoDB | dict0dict.c:721 | os_waits=4 | | InnoDB | dict0dict.c:721 | os_waits=52 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=3 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=5 | | InnoDB | dict0dict.c:721 | os_waits=3 | | InnoDB | dict0dict.c:721 | os_waits=2 | | InnoDB | dict0dict.c:721 | os_waits=6 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=5 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=3 | | InnoDB | dict0dict.c:721 | os_waits=3 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:721 | os_waits=2 | | InnoDB | dict0dict.c:721 | os_waits=1 | | InnoDB | dict0dict.c:711 | os_waits=33 | | InnoDB | fil0fil.c:1300 | os_waits=170 | | InnoDB | log0log.c:832 | os_waits=521350 | | InnoDB | btr0sea.c:178 | os_waits=1310965 | | InnoDB | combined buf0buf.c:916 | os_waits=124407 | +--------+------------------------+------------------+ 65 rows in set (0.88 sec)
[31 Jan 2013 22:23]
Dimitri KRAVTCHUK
Patrick, it's normal you have a better performance with AHI=on, it's as expected. in your case you should also increase your number of BP instance (from 4 to say 8 or 16) -- you should see then less differences between BP mutex waits on the innodb mutex status output.. for the rest: we're working on it.. As I've mentioned, the problem is long dated, and not just related to MySQL 5.6. Sure, it's pity it was not fixed before, but seems that nobody reported it either until now. But we'll fix it for sure, I don't see any other option :-) so, please, be patient with us now, as it'll not be simple. I'll be back to you here once we have any news. Rgds, -Dimitri
[31 Jan 2013 23:16]
Patrick Casey
No worries on the timing; we're clearly very interested in seeing this resolved, but we're also engineers so I don't think any of us have the illusion/expectation that this is going to be either quick or easy. One thing I would offer is that I do have access to a fairly robust set of test equipment, so if there's specific tests/scenarios you want to run I'd be more than happy to try to either A) run them for you or B) get you access to some lab equipment if you want to try out something different.
[1 Feb 2013 8:07]
Dimitri KRAVTCHUK
Patrick, thanks for the offer! :-) send me a ping to my @oracle.com email or one of the mail address from my site (http://dimitrik.free.fr/blog) - I'll then come back to you if we will need an access to additional HW.. Rgds, -Dimitri
[1 Feb 2013 17:36]
Patrick Casey
I pinged you on your gmail account. Let me know how we can help.
[7 Jun 2013 15:29]
Dimitri KRAVTCHUK
Hi Patrick, don't know if it'll be too much to ask you, but I need this info for further investigation ;-) (and sorry for a long silence, but the issue is not simple either).. what I need from you is: - replay the same tests on MySQL 5.6 (get the latest 5.6.12 version now) with: - performance_schema = OFF - innodb_buffer_pool_instances = 32 - innodb_spin_wait_delay = 6 (default), 12, 24, 48 - for every test iteration I'll need the "show engine innodb mutex" output there will be a lot of data, but I think you may easily script it (as well "innodb_spin_wait_delay" is dynamic, so you can change it on-line via "set global innodb_spin_wait_delay = N").. sorry again ;-) Rgds, -Dimitri
[10 Jul 2013 20:02]
Robert Sacca
This is really killing us at GE! How can we get traction on this? Robert Sacca Senior Lead Operational Database Technologies GE robert.sacca@ge.com
[15 Jul 2013 20:30]
Tim Yim
Hi Dimitri, I work for Pat Casey and I'll be re-running these tests today/tomorrow. I'll send you the results soon. Tim Yim
[16 Jul 2013 17:22]
Arnaud Adant
Hi Robert, >This is really killing us at GE! >How can we get traction on this? You can contact Oracle Support if you have a valid contract. At least, we can see if you run into this problem and if there are workarounds. Hi Yim >I work for Pat Casey and I'll be re-running these tests today/tomorrow. >I'll send you the results soon. Thanks ! Best regards, Arnaud Adant
[18 Jul 2013 19:58]
Tim Yim
Hi Dimitri, I scripted the tests so anyone can run them. The files and full results will be attached including the tuning parameters for the tests. If you have any recommendations for tuning parameters, let me know and I will re-run the tests. I hope this posts correctly, a table of the results: (if not, just look for chart.txt in the file attachement) It looks like you've made some progress with higher spin_wait and more threads but the problem still exists. Threads: Spin: 1 2 4 8 16 24 32 48 6: 19.527 13.941 8.146 9.416 11.790 14.982 14.241 15.867 12: 19.736 14.692 8.277 7.824 10.370 11.063 11.224 10.727 24: 19.832 14.637 8.617 7.304 8.590 9.353 9.882 10.035 48: 18.727 14.456 9.326 7.224 8.486 7.627 8.275 8.366 Tests were ran against: MySQL-client-5.6.12-2.el6.x86_64 MySQL-server-5.6.12-2.el6.x86_64 # Percona Toolkit System Summary Report ###################### Date | 2013-07-17 18:16:22 UTC (local TZ: PDT -0700) Hostname | xxxxxxxxxxxxxxxxxxxxxx Uptime | 58 days, 7:15, 1 user, load average: 0.07, 0.10, 2.49 Platform | Linux Release | CentOS release 6.4 (Final) Kernel | 2.6.32-358.6.2.el6.x86_64 Architecture | CPU = 64-bit, OS = 64-bit Threading | NPTL 2.12 Compiler | GNU CC version 4.4.6 20120305 (Red Hat 4.4.6-4). SELinux | Disabled Virtualized | No virtualization detected # Processor ################################################## Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes Speeds | 24x2666.671 Models | 24xIntel(R) Xeon(R) CPU X5650 @ 2.67GHz Caches | 24x12288 KB # Memory ##################################################### Total | 126.0G Free | 104.4G Used | physical = 21.6G, swap allocated = 2.0G, swap used = 0.0, virtual = 21.6G Buffers | 489.3M Caches | 11.4G Dirty | 11864 kB UsedRSS | 8.1G Swappiness | 1
[18 Jul 2013 19:59]
Tim Yim
Files results from latest tests.
Attachment: innodb-scale-bug-68079.tar.gz (application/x-gzip, text), 34.34 KiB.
[6 Aug 2013 13:40]
MySQL Verification Team
There are several facets of this bug. Right now, first of the problems is being identified and a plan for the code changes is drafted. There are a couple of other bottlenecks on which we have to work. Time required for the proper fix of this bug and amount of changes that is planned can not be possibly introduced into a GA version like 5.6. Hence, this bug will be fixed in a version that is developed. So, this bug will be fixed later and a target for the fixes is already planned.
[6 Aug 2013 13:45]
Hartmut Holzgraefe
"... can not be possibly introduced into a GA version like 5.6" but the "Target Version" field still says 5.6?
[18 Oct 2013 5:45]
Sunny Bains
Posted by developer: Can somebody please confirm these results separately. The first run is on mysql-trunk (a.k.a 5.7.3): Scales up to 32 threads reasonably well, the HW has 48 logical CPUs. subains@supra01 ~> mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Benchmark Average number of seconds to run all queries: 16.856 seconds Minimum number of seconds to run all queries: 15.642 seconds Maximum number of seconds to run all queries: 18.066 seconds Number of clients running queries: 1 Average number of queries per client: 100000 subains@supra01 ~> mysqlslap -uroot --concurrency=32 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Benchmark Average number of seconds to run all queries: 1.173 seconds Minimum number of seconds to run all queries: 1.024 seconds Maximum number of seconds to run all queries: 1.341 seconds Number of clients running queries: 32 Average number of queries per client: 3125 If this were scaling linearly then the avg. should approximately be 0.7s. Compared to the original report, unless I've missed something this seems to solve their problem. subains@supra01 ~> mysqlslap -uroot --concurrency=64 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Benchmark Average number of seconds to run all queries: 2.060 seconds Minimum number of seconds to run all queries: 2.041 seconds Maximum number of seconds to run all queries: 2.077 seconds Number of clients running queries: 64 Average number of queries per client: 1562 mysql-5.6 - has problems at 32 threads ============================== subains@supra01 ~> mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Benchmark Average number of seconds to run all queries: 15.692 seconds Minimum number of seconds to run all queries: 15.009 seconds Maximum number of seconds to run all queries: 16.533 seconds Number of clients running queries: 1 Average number of queries per client: 100000 subains@supra01 ~> mysqlslap -uroot --concurrency=32 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark Benchmark Average number of seconds to run all queries: 2.446 seconds Minimum number of seconds to run all queries: 2.418 seconds Maximum number of seconds to run all queries: 2.470 seconds Number of clients running queries: 32 Average number of queries per client: 3125
[21 Oct 2013 5:18]
justin beech
As I understand the original issue, mysql ceased to scale at less than the number of physical cores for this join. So for example, unlike other simpler queries which scale close to 2*the number of cores, beyond 10 concurrent queries on a machine with 12 cores (or 24 hyperthreaded cores) does not improve throughput. Running 5.7.3 on 48 physical cores might get 32 but does that mean anything is improved? (because 48 physical cores >> 24 hyperthreaded cores). Also, your test run on 5.7.3 was it done with no data? iterations = 100,000 ? Not sure if that means anything but is running with data any different to running with no data? BTW I ran the test with 5.7.2, on a 32 core virtual server and throughput stalled after 16 threads.
[21 Oct 2013 6:00]
Sunny Bains
The bottlenecks fixed in 5.6+ are different and not directly related to the underlying problem that affects scalability due to the test in this bug report. This bug has to do with fetching and releasing pages from the buffer pool.
[21 Oct 2013 6:01]
Sunny Bains
Also, my comment from [18 Oct 5:45] can be ignored, I was running the wrong test :-(
[29 Oct 2013 14:55]
MySQL Verification Team
We gave performed a very, very detailed analysis of the problem, with many, many measurements and profiling. Our InnoDB team has also made lot's of code changes, but it just shifted a bottleneck from one piece of code to another, without any significant difference. It turns out that problem is unsolvable in near or further future. The only solution is to use a workaround , like query re-writing, usage of partitions, re-thinking of indices and similar.
[30 Oct 2013 2:37]
Sunny Bains
There is some confusion here, there are two issues in this bug report. One is the page mutex and the other is the page re-lock. For this query we acquire it in S mode all the time. If we remove the page mutex overhead then this S mode rwlock becomes the next bottlleneck. However, this doesn't mean that the original problem is not fixed, it simply means that the fix requires further work that is more involved, that's all. As reference mysqlslap -u root --concurrency=4096 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark Benchmark Average number of seconds to run all queries: 29.361 seconds Minimum number of seconds to run all queries: 29.312 seconds Maximum number of seconds to run all queries: 29.439 seconds Number of clients running queries: 4096 Average number of queries per client: 0 mysqlslap -u root --concurrency=2048 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark Benchmark Average number of seconds to run all queries: 14.550 seconds Minimum number of seconds to run all queries: 14.311 seconds Maximum number of seconds to run all queries: 14.808 seconds Number of clients running queries: 2048 Average number of queries per client: 0 These same numbers can be achieved by using "PARTITION BY KEY() PARTITIONS N". By using partitions you can reduce the contention on the block mutex.
[30 Oct 2013 3:38]
Sunny Bains
To demonstrate the difference, this the same test @2K threads without any changes. mysqlslap -u root --concurrency=2048 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark Benchmark Average number of seconds to run all queries: 105.629 seconds Minimum number of seconds to run all queries: 105.494 seconds Maximum number of seconds to run all queries: 105.725 seconds Number of clients running queries: 2048 Average number of queries per client: 0
[8 Jan 2014 8:48]
Erlend Dahl
Setting to "verified" since we will discontinue the use of "to be fixed later".
[16 Jan 2014 13:15]
Daniel Price
Fixed as of 5.6.16, 5.7.4 and here's the changelog entry: "Fetching and releasing pages from the buffer pool and tracking the page state is an expensive and complex operation. Prior to the bug fix, this operation was performed using a page mutex. Using a page mutex to track several things is expensive and does not scale well. The bug fix separates fetch and release tracking (in-use state) of a page from page I/O state tracking. Fetch and release is now tracked using atomics where available. For portability, a new "CMAKE" build option, "INNODB_PAGE_ATOMIC_REF_COUNT" (default "ON"), disables atomic page reference counting on platforms where atomics support is not available."
[16 Jan 2014 15:12]
Daniel Price
The following information has been added to the changelog entry: "When atomic page reference counting is enabled (default), "[Note] InnoDB: Using atomics to ref count buffer pool pages" is printed to the error log at server startup. If atomic page reference counting is disabled, "[Note] InnoDB: Using mutexes to ref count buffer pool pages" is printed instead."
[3 Feb 2014 11:07]
Laurynas Biveinis
5.6$ bzr log -r 5677 ------------------------------------------------------------ revno: 5677 committer: Sunny Bains <Sunny.Bains@Oracle.Com> branch nick: 5.6 timestamp: Tue 2013-12-10 14:30:34 +0530 message: Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. If this option is enabled then use atomic reference counting to track block use. If it is off then use the old way. Approved by Yasufumi Kinoshita rb#3958.
[3 Feb 2014 11:31]
Laurynas Biveinis
5.6$ bzr log -r 5684 ------------------------------------------------------------ revno: 5684 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Fri 2013-12-13 12:52:47 +0900 message: The adjustment about UNIV_SUNC_DEBUG is needed for Bug#16249481 fix Approved by Sunny Bains (IM) * revno: 5677 * committer: Sunny Bains <Sunny.Bains@Oracle.Com> * branch nick: 5.6 * timestamp: Tue 2013-12-10 14:30:34 +0530 * message: * Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT * * Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. * * If this option is enabled then use atomic reference counting to track * block use. If it is off then use the old way. * * Approved by Yasufumi Kinoshita rb#3958.
[3 Feb 2014 11:32]
Laurynas Biveinis
5.6$ bzr log -r 5685 ------------------------------------------------------------ revno: 5685 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Fri 2013-12-13 13:23:34 +0900 message: Fix the possibile rare race condition at Bug#16249481 fix. Approved by Sunny Bains (IM) * revno: 5677 * committer: Sunny Bains <Sunny.Bains@Oracle.Com> * branch nick: 5.6 * timestamp: Tue 2013-12-10 14:30:34 +0530 * message: * Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT * * Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. * * If this option is enabled then use atomic reference counting to track * block use. If it is off then use the old way. * * Approved by Yasufumi Kinoshita rb#3958.
[3 Feb 2014 11:45]
Laurynas Biveinis
5.6$ bzr log -r 5692 ------------------------------------------------------------ revno: 5692 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Mon 2013-12-16 21:03:59 +0900 message: The fix for Bug#16249481 was not enabled for builds. "#cmakedefine INNODB_PAGE_ATOMIC_REF_COUNT" is added to config.h.cmake Patch made by Sunny Bains (by mail) Approved by Yasufumi Kinoshita * revno: 5677 * committer: Sunny Bains <Sunny.Bains@Oracle.Com> * branch nick: 5.6 * timestamp: Tue 2013-12-10 14:30:34 +0530 * message: * Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT * * Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. * * If this option is enabled then use atomic reference counting to track * block use. If it is off then use the old way. * * Approved by Yasufumi Kinoshita rb#3958.
[3 Feb 2014 11:45]
Laurynas Biveinis
5.6$ bzr log -r 5694 ------------------------------------------------------------ revno: 5694 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Tue 2013-12-17 20:03:26 +0900 message: Follow up for Bug#16249481 fix. Atomic operation macro for Soralis and Windows added by Bug#16249481 fix were wrong. Aligned same for mysql-trunk. * revno: 5677 * committer: Sunny Bains <Sunny.Bains@Oracle.Com> * branch nick: 5.6 * timestamp: Tue 2013-12-10 14:30:34 +0530 * message: * Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT * * Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. * * If this option is enabled then use atomic reference counting to track * block use. If it is off then use the old way. * * Approved by Yasufumi Kinoshita rb#3958.
[3 Feb 2014 11:46]
Laurynas Biveinis
5.6$ bzr log -r 5695 ------------------------------------------------------------ revno: 5695 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Tue 2013-12-17 20:30:16 +0900 message: Follow up for Bug#16249481 fix. backport from mysql-trunk revno: 7152 committer: Vasil Dimov <vasil.dimov@oracle.com> branch nick: mysql-trunk timestamp: Tue 2013-12-17 11:29:47 +0200 message: Followup to sunny.bains@oracle.com-20131210092144-0xfzdmj5vgqlqpq2 Fix a compilation failure on Win32: buf0buf.ic(1016): error C2664: '_InterlockedExchangeAdd' : cannot convert parameter 1 from 'ib_uint32_t *' to 'volatile long *' Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast
[3 Feb 2014 11:58]
Laurynas Biveinis
5.6$ bzr log -r 5720 ------------------------------------------------------------ revno: 5720 committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com> branch nick: mysql-5.6 timestamp: Tue 2013-12-31 12:29:00 +0900 message: Follow up for Bug#16249481 fix. If flush_rbt is active (in recovery phase), buf_page_t::space and buf_page_t::offset cannot be invalidated until the bpage is removed from flush_list. * revno: 5677 * committer: Sunny Bains <Sunny.Bains@Oracle.Com> * branch nick: 5.6 * timestamp: Tue 2013-12-10 14:30:34 +0530 * message: * Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT * * Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON. * * If this option is enabled then use atomic reference counting to track * block use. If it is off then use the old way. * * Approved by Yasufumi Kinoshita rb#3958.
[28 Feb 2014 6:40]
Laurynas Biveinis
See also bug 71727.