MySQL Bugs: #68079: InnoDB does not scale well on 12 core system for specific all in memory SELECT

Bug #68079	InnoDB does not scale well on 12 core system for specific all in memory SELECT
Submitted:	14 Jan 2013 12:01	Modified:	16 Jan 2014 13:15
Reporter:	Valeriy Kravchuk	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.5.29, 5.6.9-rc	OS:	Linux (x86_64)
Assigned to:	Arnaud Adant	CPU Architecture:	Any
Tags:	performance, read only, scalability, throughput

Description:
We all know that MySQL 5.5.x usually scales (not linearly, but still) up to 2*number of cores if not more. Like this:

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select * from task where sys_id=rand()*1000000;'                  
Benchmark
        Average number of seconds to run all queries: 13.808 seconds
        Minimum number of seconds to run all queries: 13.792 seconds
        Maximum number of seconds to run all queries: 13.821 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1000

...

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=12 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select * from task where sys_id=rand()*1000000;'
Benchmark
        Average number of seconds to run all queries: 1.517 seconds
        Minimum number of seconds to run all queries: 1.227 seconds
        Maximum number of seconds to run all queries: 1.886 seconds
        Number of clients running queries: 12
        Average number of queries per client: 83

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=24 

--create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select * from task where sys_id=rand()*1000000;'
Benchmark
        Average number of seconds to run all queries: 0.780 seconds
        Minimum number of seconds to run all queries: 0.697 seconds
        Maximum number of seconds to run all queries: 0.967 seconds
        Number of clients running queries: 24
        Average number of queries per client: 41

That is, the more concurrent threads you add, the faster you can execute the same amount of work (or the the higher throughput is).

But for a bit more complex query, joining two tables on a long character primary key, we see that on 12 cores it does not scale even to 10 threads:

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=4 

--create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category 

from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 11.367 seconds
        Minimum number of seconds to run all queries: 8.318 seconds
        Maximum number of seconds to run all queries: 14.322 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=8 

--create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category 

from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 8.512 seconds
        Minimum number of seconds to run all queries: 8.248 seconds
        Maximum number of seconds to run all queries: 8.844 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=10 

--create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category 

from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 9.399 seconds
        Minimum number of seconds to run all queries: 9.137 seconds
        Maximum number of seconds to run all queries: 9.749 seconds
        Number of clients running queries: 10
        Average number of queries per client: 100

[valerii.kravchuk@cisco1 mysql-5.5.29-linux2.6-x86_64]$ bin/mysqlslap -uroot --concurrency=12 

--create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category 

from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 9.971 seconds
        Minimum number of seconds to run all queries: 9.224 seconds
        Maximum number of seconds to run all queries: 10.491 seconds
        Number of clients running queries: 12
        Average number of queries per client: 83

This was on a fresh 5.5.29 started with --no-defaults. Surely one can try to tune it (more about this later), but hardly can see it scaling to 10+ threads anyway (from my experience).

This is the EXPLAIN for the query used:

mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by 

incident.category;
+----+-------------+----------+--------+---------------+-------------------+---------+----------------------+-------+-------------+
| id | select_type | table    | type   | possible_keys | key               | key_len | ref                  | rows  | Extra       |
+----+-------------+----------+--------+---------------+-------------------+---------+----------------------+-------+-------------+
|  1 | SIMPLE      | incident | index  | PRIMARY       | incident_category | 123     | NULL                 | 10749 | Using 

index |
|  1 | SIMPLE      | task     | eq_ref | PRIMARY       | PRIMARY           | 96      | test.incident.sys_id |     1 | Using 

index |
+----+-------------+----------+--------+---------------+-------------------+---------+----------------------+-------+-------------+
2 rows in set (0.00 sec)

Waits at InnoDB level are like these:

mysql> show engine innodb mutex;                                                
+--------+------------------------+-----------------+
| Type   | Name                   | Status          |
+--------+------------------------+-----------------+
| InnoDB | buf0buf.c:1159         | os_waits=892436 |
| InnoDB | srv0srv.c:1018         | os_waits=1      |
| InnoDB | combined buf0buf.c:908 | os_waits=208    |
| InnoDB | log0log.c:832          | os_waits=120    |
+--------+------------------------+-----------------+
4 rows in set (0.00 sec)

From buf0buf.c:

1158  mutex_create(buf_pool_mutex_key,
1159  &buf_pool->mutex, SYNC_BUF_POOL);

How to repeat:
CREATE TABLE `incident` (
`sys_id` char(32) NOT NULL DEFAULT '',
`category` varchar(40) DEFAULT NULL,
PRIMARY KEY (`sys_id`),
KEY `incident_category` (`category`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

insert into incident(sys_id) values (rand()*1000000);
insert into incident(sys_id) select rand()*1000000 from incident; -- 13 times

CREATE TABLE `task` (
`sys_id` char(32) NOT NULL DEFAULT '',
`u_root_cause` char(32) DEFAULT NULL,
`u_business_impact_description` mediumtext,
`u_business_impact_category` mediumtext,
PRIMARY KEY (`sys_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

insert into task(sys_id) values(rand()*1000000);
insert into task(sys_id) select rand()*1000000 from task; -- 13 times

insert into task(sys_id) select sys_id from incident limit 100;

update incident set category=rand()*100000;

show table status like 'task'\G
show table status like 'incident'\G

Make yourself sure that it is all in memory, there are no other activity etc, then run:

mysqlslap -uroot --iterations=10 --concurrency=N --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'

where N in (1, 2, 4, 8, 10, 12, 24 ... 2*number_of_cores ...)

Compare to the following queries that scale as expected:

select count(*), category from incident group by category;
select * from task where sys_id=rand()*1000000;

Tune InnoDB to death (set non-deafult values for innodb_buffer_pool_instances , innodb_sync_spin_loops, whatever). You can get rid of most OS waits, but still it does not scale to even 10 threads on 12 core system I've used:

...
# Percona Toolkit System Summary Report ######################
        Date | 2013-01-14 11:41:35 UTC (local TZ: PST -0800)
    Hostname | cisco1.XXX.XXX
      Uptime | 11 days, 17:20,  1 user,  load average: 0.00, 0.92, 1.47
    Platform | Linux
     Release | CentOS release 6.2 (Final)
      Kernel | 2.6.32-279.11.1.el6.x86_64
Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.12
    Compiler | GNU CC version 4.4.6 20110731 (Red Hat 4.4.6-3).
     SELinux | Disabled
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes
      Speeds | 24x2926.256
      Models | 24xIntel(R) Xeon(R) CPU X5670 @ 2.93GHz
      Caches | 24x12288 KB
# Memory #####################################################
       Total | 362.7G
        Free | 356.6G
        Used | physical = 6.1G, swap allocated = 2.0G, swap used = 0.0, virtual = 6.1G
     Buffers | 241.3M
      Caches | 1.7G
       Dirty | 4 kB
     UsedRSS | 59.4M
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0
...

(sorry for pt-summary output, but it's an easy way to get OS level information in once place fast)

Suggested fix:
I truly don't know at the moment

Dump of tables, incase somebody thinks exact content matters

Attachment: test.sql.gz (application/x-tar-gz, text), 233.32 KiB.

For what its worth, I think its the join, rather than the group by, that's doing us in.

I use a different benchmark than Valeriy, but I'm seeing the same performance cliff even with selects by primary key against hte join product.

For this test, we Spin 1, 2, 4, 8, 16, 24, 32, and 48 threads.

Hardware is a 12 core (physical) 24 core (hyperthreading), 128G (96G buffer pool) box running mysql 5.5.25a.

Each thread executes this query (select by PK, followed by a nested loop join) 100X with no waits between them.

select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id='0000004c46567472005844a12a205beb'

Total time to execute:

2012-10-01 17:40:09 (542) main [0:00:00.242] Parallel 1
2012-10-01 17:40:09 (857) main [0:00:00.225] Parallel 2 <--- two threads, same wall clock time
2012-10-01 17:40:10 (379) main [0:00:00.434] Parallel 4 <--- four threads on a 24 core box, but already takes twice as long
2012-10-01 17:40:11 (348) main [0:00:00.849] Parallel 8 <--- 8 threads are taking 4X as long for the workload as a single thread
2012-10-01 17:40:13 (222) main [0:00:01.696] Parallel 16
2012-10-01 17:40:16 (028) main [0:00:02.553] Parallel 24 <--- all cores fully engaged, and we get 1/10 the throughput per core that we did at parallel 1
2012-10-01 17:40:19 (763) main [0:00:03.401] Parallel 32
2012-10-01 17:40:25 (334) main [0:00:05.085] Parallel 48 <--- interestingly we take twice as long as we do at 24 cores, so once we go non-linear like this, we're in a linear state

Some more examples:

*** Bind Variables ... Same as above, but using bind variables instead of a literal ... essentially identical numbers

select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id='?'
BIND = 0000004c46567472005844a12a205beb
100X in tight loop

mysql:

2012-10-02 16:25:25 (819) main [0:00:00.235] Parallel 1
2012-10-02 16:25:26 (118) main [0:00:00.212] Parallel 2
2012-10-02 16:25:26 (645) main [0:00:00.429] Parallel 4
2012-10-02 16:25:27 (622) main [0:00:00.840] Parallel 8
2012-10-02 16:25:29 (506) main [0:00:01.679] Parallel 16
2012-10-02 16:25:32 (324) main [0:00:02.524] Parallel 24
2012-10-02 16:25:36 (077) main [0:00:03.366] Parallel 32
2012-10-02 16:25:41 (683) main [0:00:05.034] Parallel 48

******* 100 Primary keys.
Executed 1X only per thread (without the 100X per thread loop as in teh first case):

select * from task inner join incident on task.sys_id=incident.sys_id where task.sys_id IN ('6e7de761296660009a1cdbdba7d294b1','1e2a04eb100820004916aea6bc378a09','c2dd9cc87dec7400c14de29bdcf48088','caa82626f97fe000c5ab2ef1817e96ce','4dd138c687943000c14d83bdff434d9e','db6ae9a387807c00c14d83bdff434d35','2055c9cb8710b0005c9ab3aeef434de6','f66e17878794b4003fff83bdff434d5c','22644a20297f2c003fff9456dc7773f3','8ab5646be0f7e400957b9e05e1466395','145498f487a16400957bb3aeef434dd4','e06560d3e0bf2000957b9e05e146635e','e0bcd817e0bb2000957b9e05e14663fe','68d1245fe03f2000957b9e05e146631a','80274b0814d6ac409a1c431fe5411324','6ae835a908ec68009a1c7e7723d878bc','bb826b9d4c60e4003fffad30c2edccba','c8a8fa3b874c3440491683bdff434d3f','9802c5e187386400957bb3aeef434d10','debee39e7969a40049166a127486e654','089262bb7d5a24004916494baeedb4db','be55360484b7a400957b62b4414c0973','2d42966d8755ac409a1cb3aeef434d0d','d5347e1e85ef6c009a1cacb754d428cc','7752a2a78708bc00c14d83bdff434d3d','45a315eb357e20009a1c6c2a055da761','91fa0f5b6dd07000c14dd31166116763','87e5728a87d8f000491683bdff434d23','a3694c3fb9ab60003fff388835944194','31a091378740b040491683bdff434dd6','21dff80887e8b4003fff83bdff434dab','1e3fc33887302800957bb3aeef434d2f','87150598870118003fff83bdff434d9b','e3f2496787403c00c14d83bdff434d4c','3d67017387d7e4489a1cb3aeef434d09','06859c3bc5e7a0009a1cb4a4c8842448','91bbda5b8787ec40491683bdff434d1d','a374b389875c3400b43083bdff434dbd','97962f268744b4003fff83bdff434d48','ae910eab7c7620009a1ce03d7ffe488d','2d464658871ee0803fff83bdff434d2d','56e964a50c27e0009a1c70511c18ac50','9ebea8238738a4009a1cb3aeef434dbd','dafeb8d3878c34005c9ab3aeef434dfe','087942e7c00b6c843fffc43a2f2c8640','a32ec79587d0b000c5ab83bdff434d67','2d03c1028753a0449a1cb3aeef434d1d','fc5c23c7997d240049163de5a8f9945a','227dd8f8878ed000957bb3aeef434d41','419520e4e5334400957b5ef5706ff670','70f17879c1d364043fff71e055206af6','70175dc487e4b400b43083bdff434d9d','8531984b870874005c9ab3aeef434d57','3be662b4ed26a400957b6bc67ca1af4f','120557cf87d874005c9ab3aeef434dc9','0f8ece42875cb000c14d83bdff434d3e','d642dfc4715864409a1c66e63cf7a2a2','3394ef3185816440491668304b730ed3','550f5c4704bf6000b430af5c1b17094e','f3381d318750f400957bb3aeef434d4f','3960c9fc875cb040b43083bdff434d03','14fdf42987cf9c803fff83bdff434d29','db2ced3d8710f4009a1cb3aeef434d87','e22e102a87c07000c14d83bdff434d67','ed81b8b38558b0009a1c222fca18a36e','747dd8358720100049e6feddff434d73','733aa0ccc1e3ac483fff71e055206ab4','90c5eb0187183400957bb3aeef434ddf','a92086d587103000c14d83bdff434d1b','7d51ce70875434403fff83bdff434ddd','c893557c87987440c14d83bdff434d42','072a8972405c64004916fe7f27fe2ffa','889508003408b000c5aba1f0f97f7986','035b571e8414f000957bc168246c99b0','da0e76b68744f000c14d83bdff434d62','ca561b6a87a290003fff83bdff434dfa','230e1a0c84376400957b62b4414c092a','e3ca48d987547400b43083bdff434d13','1ae15281879b2c809a1cb3aeef434db9','311d1bc78784b0009a1cb3aeef434d13','e99c964b7d983400957be671126fb970','d294e86f71141000957bb8353b1e6c9d','89689f5684bd20003fff21253fbef3df','f3e4043240b3e040957b9a98b2d98de2','be8224ea985ca4009a1c12391b8ce90d','829d5938398cf000b4305f461d471aa1','b5579b2987c16c009a1cb3aeef434d74','546f337c8700b8009a1cb3aeef434df1','5594af0a87d51400957bb3aeef434dde','db7f5f0687951400957bb3aeef434d22','659f56d2904b2004957bad7c9b94a550','8d010bd8a96c24009a1c764c84831176','42e2af4bc083e8443fffc43a2f2c863a','2bbe273b0804ec003fffaf9b706ba5eb','63b4330f878cb000491683bdff434dec','07eefa498754f0003fff83bdff434dcd','a542571210b01400957b793c9605f286','840d9f8f8704b000491683bdff434d91','3502609e091cb400957b695f3707b695','cdab2fd287799000957bb3aeef434d53')

mysql:

2012-10-01 17:46:17 (705) main [0:00:00.016] Parallel 1
2012-10-01 17:46:17 (816) main [0:00:00.025] Parallel 2
2012-10-01 17:46:17 (952) main [0:00:00.045] Parallel 4
2012-10-01 17:46:18 (166) main [0:00:00.086] Parallel 8
2012-10-01 17:46:18 (500) main [0:00:00.167] Parallel 16
2012-10-01 17:46:18 (985) main [0:00:00.249] Parallel 24
2012-10-01 17:46:19 (639) main [0:00:00.329] Parallel 32
2012-10-01 17:46:20 (641) main [0:00:00.492] Parallel 48

MySQL 5.6.9 scales better (10 threads allowed to run load faster than 8), but still does not scale even to 12 threads for this query:

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 37.871 seconds
        Minimum number of seconds to run all queries: 37.853 seconds
        Maximum number of seconds to run all queries: 37.965 seconds
        Number of clients running queries: 1
        Average number of queries per client: 1000

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=2 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 21.072 seconds
        Minimum number of seconds to run all queries: 17.819 seconds
        Maximum number of seconds to run all queries: 22.401 seconds
        Number of clients running queries: 2
        Average number of queries per client: 500

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 11.222 seconds
        Minimum number of seconds to run all queries: 8.349 seconds
        Maximum number of seconds to run all queries: 12.143 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 7.795 seconds
        Minimum number of seconds to run all queries: 6.865 seconds
        Maximum number of seconds to run all queries: 8.899 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=10 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 7.481 seconds
        Minimum number of seconds to run all queries: 6.830 seconds
        Maximum number of seconds to run all queries: 8.329 seconds
        Number of clients running queries: 10
        Average number of queries per client: 100

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=12 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 7.708 seconds
        Minimum number of seconds to run all queries: 7.153 seconds
        Maximum number of seconds to run all queries: 7.988 seconds
        Number of clients running queries: 12
        Average number of queries per client: 83

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=24 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 9.634 seconds
        Minimum number of seconds to run all queries: 9.184 seconds
        Maximum number of seconds to run all queries: 9.860 seconds
        Number of clients running queries: 24
        Average number of queries per client: 41

Waits at this stage:

mysql> show engine innodb mutex;
+--------+-------------------------+-----------------+
| Type   | Name                    | Status          |
+--------+-------------------------+-----------------+
| InnoDB | log0log.cc:737          | os_waits=6      |
| InnoDB | buf0buf.cc:1225         | os_waits=3      |
| InnoDB | combined buf0buf.cc:975 | os_waits=117391 |
| InnoDB | log0log.cc:799          | os_waits=124    |
+--------+-------------------------+-----------------+
4 rows in set (0.00 sec)

This is what I see in "top waits" using performance schema after running test with 24 threads 2 times after clean startup:

mysql>  select event_name, count_star, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 5;
+--------------------------------------------+------------+----------------+
| event_name                                 | count_star | sum_timer_wait |
+--------------------------------------------+------------+----------------+
| wait/synch/mutex/innodb/os_mutex           |   71754726 | 11380485535846 |
| wait/synch/mutex/innodb/trx_sys_mutex      |      60057 |     4191727883 |
| wait/synch/mutex/innodb/buf_pool_mutex     |       8655 |      468303388 |
| wait/synch/mutex/innodb/mutex_list_mutex   |      10310 |      427385546 |
| wait/synch/mutex/innodb/rw_lock_list_mutex |       8304 |      316946749 |
+--------------------------------------------+------------+----------------+
5 rows in set (0.00 sec)

FYI, this was the runtime:

[valerii.kravchuk@cisco1 mysql-5.6.9-rc-linux-glibc2.5-x86_64]$ bin/mysqlslap -uroot --concurrency=24 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 9.781 seconds
        Minimum number of seconds to run all queries: 9.471 seconds
        Maximum number of seconds to run all queries: 9.947 seconds
        Number of clients running queries: 24
        Average number of queries per client: 41

So one can compare influence of full P_S instrumentation with default timing for this case:

        Average number of seconds to run all queries: 9.634 seconds

Like 1.5%. Looks really good.

For next runs I see increase only for the first 4 mutexes:

+--------------------------------------------+------------+----------------+
| event_name                                 | count_star | sum_timer_wait |
+--------------------------------------------+------------+----------------+
| wait/synch/mutex/innodb/os_mutex           |  107565498 | 17037167192311 |
| wait/synch/mutex/innodb/trx_sys_mutex      |      90057 |     6242580106 |
| wait/synch/mutex/innodb/buf_pool_mutex     |      13194 |      707529424 |
| wait/synch/mutex/innodb/mutex_list_mutex   |      11270 |      483923608 |
| wait/synch/mutex/innodb/rw_lock_list_mutex |       8304 |      316946749 |
+--------------------------------------------+------------+----------------+
5 rows in set (0.00 sec)

Valeriy, thanks for reporting it and a very good test case to reproduce the problem!

I'd say initially I4ve supposed the contention related to the BP instances, but it's not really the case (while the improvement you're seeing in 5.6 is related to it as in 5.6 we have in InnoDB a more advanced BP locking management (hash tables were added additionally to BP instances))..

But seems to me the issue is coming from the concurrent access to the same pages + having a join. If I'm changing test condition little bit and leaving each user thread to run within it's own data set (by restricting it with WHERE clause) -- I see a way better performance. You may try other variations if you want, but well, we have a true problem here, and it seems to be pretty old.. (I've also observed scalability issues with simple dbSTRESS RO workloads too, but did not yet get a deep dive in its problem).

More investigation is needed to understand it better..

BTW, mutex contention(s) you'll see are also very depending on the system you're using, on the speed of your CPU cores and their number. For ex. on my 32cores server the top wait on 16 threads is on the btr_search_latch, while AHI is turned off.. - so, there is really something pretty interesting ;-)

Rgds,
-Dimitri

Thank you for paying attention to this report and test case, Dimitri!

Surely os_mutex waits can be removed form the top by tuning innodb_sync_spin_loops, to get something like:

mysql>  select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10;
+--------------------------------------------+------------+--------+
| event_name                                 | count_star | time   |
+--------------------------------------------+------------+--------+
| wait/synch/mutex/innodb/trx_sys_mutex      |      30186 | 4.5602 |
| wait/synch/mutex/innodb/os_mutex           |      51546 | 2.3394 |
| wait/synch/mutex/innodb/buf_pool_mutex     |       4269 | 0.5441 |
| wait/synch/mutex/innodb/mutex_list_mutex   |       8878 | 0.3815 |
| wait/synch/mutex/innodb/dict_sys_mutex     |        133 | 0.3352 |
| wait/synch/mutex/innodb/rw_lock_list_mutex |       8304 | 0.3198 |
| wait/synch/mutex/innodb/fil_system_mutex   |       2119 | 0.1864 |
| wait/synch/mutex/innodb/buf_dblwr_mutex    |       1099 | 0.1318 |
| wait/synch/mutex/innodb/log_sys_mutex      |       1008 | 0.0964 |
| wait/synch/mutex/innodb/flush_list_mutex   |        311 | 0.0437 |
+--------------------------------------------+------------+--------+
10 rows in set (0.00 sec)

and comparing the above to result just after starting up and enabling consumers:

mysql>  select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10;
+--------------------------------------------+------------+--------+
| event_name                                 | count_star | time   |
+--------------------------------------------+------------+--------+
| wait/synch/mutex/innodb/os_mutex           |      50228 | 2.1837 |
| wait/synch/mutex/innodb/rw_lock_list_mutex |       8300 | 0.3116 |
| wait/synch/mutex/innodb/mutex_list_mutex   |       8379 | 0.3054 |
| wait/synch/mutex/innodb/fil_system_mutex   |       1241 | 0.0534 |
| wait/synch/mutex/innodb/buf_pool_mutex     |        818 | 0.0333 |
| wait/synch/mutex/innodb/log_sys_mutex      |        261 | 0.0140 |
| wait/synch/mutex/innodb/rseg_mutex         |        273 | 0.0130 |
| wait/synch/mutex/innodb/recv_sys_mutex     |        185 | 0.0073 |
| wait/synch/mutex/innodb/trx_mutex          |        183 | 0.0067 |
| wait/synch/mutex/innodb/flush_list_mutex   |         25 | 0.0053 |
+--------------------------------------------+------------+--------+
10 rows in set (0.00 sec)

I'd say it is about wait/synch/mutex/innodb/trx_sys_mutex (why?) and somewhat about wait/synch/mutex/innodb/buf_pool_mutex (this is because of access to same blocks) mostly on my system. 

FYI, my older testing attempts shown that using several (up to 12) instances of buffer pool on 5.5.x do not help to improve scalability.

I am still a beginner in P_S real life usage, so my usage and interpretation can be wrong. If somebody wants to run specific P_S query after testing or make other non-default settings for instruments and/or consumers or anything else in P_S, please, tell me.

Thank you for accepting this report as real problem.

OK this bug is now verified as described.

On a tested 4 core machine and 5.5.29, the test does not even scale to 2 connections.

mysql> select EVENT_NAME as nm, COUNT_STAR as cnt, SUM_TIMER_WAIT as tm
    ->          from performance_schema.events_waits_summary_global_by_event_name
    ->          where COUNT_STAR > 0 order by 3 desc limit 20;
+-----------------------------------------------+-----------+----------------+
| nm                                            | cnt       | tm             |
+-----------------------------------------------+-----------+----------------+
| wait/synch/mutex/innodb/buf_pool_mutex        | 165112346 | 42207640363152 |
| wait/synch/rwlock/innodb/index_tree_rw_lock   |  81991326 | 22057201473246 |
| wait/synch/rwlock/innodb/btr_search_latch     |  93372604 | 20577738333006 |
| wait/synch/mutex/innodb/kernel_mutex          |   4087505 |  2659589263404 |

already 42s + 20s + 20s for mutexes and rwlock.

InnoDB buffer pool instances had a positive effect although it does not solve 
the scalability issue.

It is well known for a long time that MySQL does not scale very well if the queries examine a lot of rows.

Does anyone have an opinion as to whether or not the underlying OS is going to matter here? Are we spinning on these things long enough to fail into Kernel level mutxes, or is this all internal to the innodb code line?

Reason I ask is I can probably (with some difficulty) re-run these tests on mysql on SUNOS, but it'd be disruptive here so I'd only want to do it if folks thought there was a reasonable chance we'd see an improvement.

Hi Patrick,

The contention is not OS related. It comes from InnoDB.
So you'll reach the limit even with powerful hardware. Maybe the CPU frequency could help go faster but not scale.

It might be easier to tune your queries. 16k rows examined per statement is really something that does not scale for now.  If you cut it by 100, you'll scale better.

Best regards,

Unfortunately, that's not really practical in the general case since these are customer generated queries.

We're moving to shard out servers down more aggressively so that we have more parallel instances of mysql running, which is going to mitigate the problem, but sharding doesn't go to infinity and we're going to have an ongoing hot shard problem.

I have to say, I'm sort of shocked that this isn't a more commonly experienced bottleneck in the field.

Now that we know what to look for, we're seeing pretty strong evidence of what we're calling the "10 core problem" on multiple instances in our farm.

Is our query pattern really that unusual?

Or are we unusual in that we're deploying innodb on bigger iron than most folks so throughput capping at 10/24 cores is a lot more noticeable than throughput capping at, say, 10/12 cores?

Patrick - performance tuning expertise for MySQL + SunOS is less frequent than for MySQL + Linux. InnoDB implements a value-added mutex on top of pthread_mutex and a value-added rw-lock on top of pthread_mutex and pthread_cond_t. The rw-lock can be very efficient, depending on the platform (works good on Linux). A good pthread library can be very important here. With modern Linux we benefit from the use of futexes so that many operations don't require a system call. Note that the value-added InnoDB code has its own busy-wait loop where it will spin/pause/spin/pause for a while before going to sleep. 

Things you can do:
* tune the spin loops -- change innodb_spin_wait_delay, innodb_sync_spin_loops
* limit concurrency in InnoDB -- reduce innodb_thread_concurrency to some number less than the count of CPU cores, increase innodb_concurrency_tickets

Mark,

I thought the way the innodb mutex worked is that it spun on the lock for a configurable number of spins, then fell through into an OS mutex if the spin count was exceeded.

Looking at the wait states here, I'm seeing the wait time dominated by os mutex waits (or at least that's what I assume that counter is tracking).

mysql>  select event_name, count_star, sum_timer_wait/1000000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait desc limit 10;
+--------------------------------------------+------------+--------+
| event_name                                 | count_star | time   |
+--------------------------------------------+------------+--------+
| wait/synch/mutex/innodb/os_mutex           |      50228 | 2.1837 |
| wait/synch/mutex/innodb/rw_lock_list_mutex |       8300 | 0.3116 |
| wait/synch/mutex/innodb/mutex_list_mutex   |       8379 | 0.3054 |
| wait/synch/mutex/innodb/fil_system_mutex   |       1241 | 0.0534 |
| wait/synch/mutex/innodb/buf_pool_mutex     |        818 | 0.0333 |
| wait/synch/mutex/innodb/log_sys_mutex      |        261 | 0.0140 |
| wait/synch/mutex/innodb/rseg_mutex         |        273 | 0.0130 |
| wait/synch/mutex/innodb/recv_sys_mutex     |        185 | 0.0073 |
| wait/synch/mutex/innodb/trx_mutex          |        183 | 0.0067 |
| wait/synch/mutex/innodb/flush_list_mutex   |         25 | 0.0053 |
+--------------------------------------------+------------+--------+

We've tried tuning the spin counts and delays every which way to see if we could move the needle on this thing, but no material difference in throughput, so I suspect we're just moving the problem around e.g. we are probably just shifting the wait off of the OS mutex and into the spin loop, but either way we're still waiting :(.

What I'm not following here though (probably because I don't have a great mental map of the guts of the innodb memory subsystem), is why there's an exclusive lock at work here at all?

This is a read mostly workload in production.
Its a read only workload in the benchmark.

I can't see why two threads cannot concurrently access the same page in the buffer pool concurrently (at least conceptually).

Are we locking on a page level exclusive lock?

Filed http://bugs.mysql.com/bug.php?id=68207 to ask for better granularity in reporting for "os_mutex". That is the top source of contention. In your case I think the problem is from the os_mutex that guards the sync array -- which is where threads go to sleep.

If you haven't tried it, then setting "innodb_thread_concurrency" to limit the number of threads that can be in InnoDB might help but only if you aren't IO bound. For some workloads where many rows are read per query then raising innodb_concurrency_tickets can help.

InnoDB has mutexes protecting various subsystems, just like all other server code I read. It still has lots of things to make better. The sync array is one of them. One problem in that area is http://bugs.mysql.com/bug.php?id=52806

Mark, Patrick,

I'm looking closely on this issue and will keep you updated..
For the moment I think the problem is coming from the block locking (and maybe something related to JOIN).. To be honest, it's pretty strange to see it, but this is a reality and we're facing it :-)  I was also able to reproduce the problem on Sysbench data (seems to me, once we'll get it fixed, we'll add this test into our QA runs to be sure to never hit this regression again). I may confirm that this problem is long dated, and it's really strange that nobody reported it before (that's why I'm also so surprised)..

But well, before to get it fixed, we have to find from where exactly issue is coming in.. - I'm digging, so stay tuned ;-)

Rgds,
-Dimitri

Patrick, Valeriy,

It was unclear for me why you're seeing the "os_mutex" waits on the top position, and not me.. - but in fact you've both used a too restricted name pattern in PFS query:

mysql> select event_name, count_star, sum_timer_wait/1000000000 time from 
  performance_schema.events_waits_summary_global_by_event_name where event_name 
  like 'wait/synch/mutex/innodb%' and count_star > 0 order by sum_timer_wait
  desc limit 10;

try instead:

mysql> select event_name, count_star, sum_timer_wait/10000000 time from 
  performance_schema.events_waits_summary_global_by_event_name where event_name 
  like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;

which is selecting all wait "synch" events, and then you should see the RW-locks or other on the top position (I hope so)..

in my case on one of the tests it's:

-- before each test think to reset the data
mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ;

-- top-10 wait synch
mysql> select event_name, count_star, sum_timer_wait/10000000 time from 
  performance_schema.events_waits_summary_global_by_event_name where event_name 
  like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;

+---------------------------------------------+------------+--------------+
| event_name                                  | count_star | time         |
+---------------------------------------------+------------+--------------+
| wait/synch/rwlock/innodb/btr_search_latch   |   27065635 | 2526009.2864 |
| wait/synch/rwlock/innodb/hash table locks   |   43618714 | 1302668.4231 |
| wait/synch/mutex/innodb/os_mutex            |   12904603 | 1239325.2118 |
| wait/synch/rwlock/innodb/index_tree_rw_lock |   14451920 |  632423.6294 |
| wait/synch/mutex/sql/THD::LOCK_thd_data     |    2785412 |   18428.1909 |
| wait/synch/rwlock/sql/MDL_lock::rwlock      |     551543 |   13842.7159 |
| wait/synch/mutex/mysys/THR_LOCK::mutex      |     551554 |    9266.0430 |
| wait/synch/mutex/sql/MDL_map::mutex         |     275889 |    6979.9721 |
| wait/synch/mutex/sql/LOCK_table_cache       |     551557 |    6443.5962 |
| wait/synch/mutex/innodb/trx_sys_mutex       |     413552 |    3589.6080 |
+---------------------------------------------+------------+--------------+
10 rows in set (0.01 sec)

and I'm curious about your output.

Rgds,
-Dimitri

That 12 core box is busy with other benchmarks now, so I've tested on my old QuadCore with 4 threads (it does not scale well to 4 concurrent queries of this kind, based on my older tests). This is what I've got on 5.6.9-rc:

[openxs@chief 5.6]$ bin/mysqlslap -uroot --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.372 seconds
        Minimum number of seconds to run all queries: 18.108 seconds
        Maximum number of seconds to run all queries: 18.847 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@chief 5.6]$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 44
Server version: 5.6.9-rc MySQL Community Server (GPL)

Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select event_name, count_star, sum_timer_wait/10000000 time from
    ->   performance_schema.events_waits_summary_global_by_event_name where event_name
    ->   like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;
+----------------------------------------------+------------+--------------+
| event_name                                   | count_star | time         |
+----------------------------------------------+------------+--------------+
| wait/synch/rwlock/innodb/hash table locks    |  165128221 | 5412925.4088 |
| wait/synch/rwlock/innodb/btr_search_latch    |   92493290 | 2755134.4885 |
| wait/synch/rwlock/innodb/index_tree_rw_lock  |   78284326 | 1696840.0731 |
| wait/synch/rwlock/innodb/checkpoint_lock     |         60 |  165570.9816 |
| wait/synch/cond/sql/COND_server_started      |          1 |  105655.9263 |
| wait/synch/mutex/innodb/buf_pool_mutex       |      15987 |   48932.7542 |
| wait/synch/mutex/innodb/os_mutex             |     109107 |    9889.8180 |
| wait/synch/rwlock/innodb/dict_operation_lock |       1012 |    7411.5460 |
| wait/synch/mutex/innodb/log_sys_mutex        |      50692 |    2883.7936 |
| wait/synch/rwlock/sql/MDL_lock::rwlock       |      40211 |    1420.8270 |
+----------------------------------------------+------------+--------------+
10 rows in set (0.01 sec)

Valeriy,

may you truncate (reset) PFS data before your test?
just do:

mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ;

because some events in your output are not related to RO workloads..
but I think it'll confirm now what I'm observing too (on small servers the hash table waits are on the top position, on big ones - btr_serach_latch).

BTW, do you have AHI enabled or disabled in your InnoDB instance? (innodb_adaptive_hash_index=1/0)

Rgds,
-Dimitri

Sorry, forgot to truncate after initial data load it seems. I start server with --no-defaults + P_S related settings only, so adaptive hash indexing is ON:

[openxs@chief 5.6]$ mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 85
Server version: 5.6.9-rc MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show variables like 'innodb_adaptive_ha%';                               +----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| innodb_adaptive_hash_index | ON    |
+----------------------------+-------+
1 row in set (0.00 sec)

mysql> select event_name, count_star, sum_timer_wait/10000000 time from    performance_schema.events_waits_summary_global_by_event_name where event_name    like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;
+---------------------------------------------+------------+--------------+
| event_name                                  | count_star | time         |
+---------------------------------------------+------------+--------------+
| wait/synch/rwlock/innodb/hash table locks   |  165049219 | 4315982.3893 |
| wait/synch/rwlock/innodb/btr_search_latch   |   92507461 | 2581370.7551 |
| wait/synch/rwlock/innodb/index_tree_rw_lock |   82236944 | 1601547.3698 |
| wait/synch/rwlock/innodb/checkpoint_lock    |         48 |   98272.5330 |
| wait/synch/mutex/innodb/os_mutex            |      64656 |   13226.1263 |
| wait/synch/mutex/sql/THD::LOCK_thd_data     |     140604 |    1569.4737 |
| wait/synch/mutex/mysys/THR_LOCK::mutex      |      40016 |     815.9076 |
| wait/synch/rwlock/sql/MDL_lock::rwlock      |      40011 |     786.5132 |
| wait/synch/mutex/sql/LOCK_table_cache       |      40008 |     544.1570 |
| wait/synch/mutex/innodb/trx_sys_mutex       |      30082 |     371.1979 |
+---------------------------------------------+------------+--------------+
10 rows in set (0.00 sec)

This is on a 24 core (ht) box.

2013-01-31 13:42:11 (158) main Parallel 1 Time 0:00:04.669
2013-01-31 13:42:16 (893) main Parallel 2 Time 0:00:05.632
2013-01-31 13:42:23 (680) main Parallel 4 Time 0:00:06.715
2013-01-31 13:42:30 (911) main Parallel 8 Time 0:00:07.121
2013-01-31 13:42:40 (703) main Parallel 12 Time 0:00:09.636
2013-01-31 13:42:53 (402) main Parallel 16 Time 0:00:12.504
2013-01-31 13:43:09 (561) main Parallel 24 Time 0:00:15.974
2013-01-31 13:43:31 (709) main Parallel 32 Time 0:00:21.923
2013-01-31 13:44:05 (197) main Parallel 48 Time 0:00:32.773

Main wait state looks to be btr_search_latch

mysql> truncate table performance_schema.events_waits_summary_global_by_event_name ;
Query OK, 0 rows affected, 1 warning (0.02 sec)

mysql> use hi02;
Database changed
mysql> exit
Bye
[perf-db03.lab3:/]$vmstat 2
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0    312 347688 1012580 19995148    0    0     7    12    0    0  0  0 100  0  0
 3  0    312 347996 1012580 19995152    0    0     0    20 1150  429  8  0 92  0  0
 2  0    312 348000 1012580 19995152    0    0     0   420 1047  346  8  0 92  0  0
 4  0    312 348500 1012580 19995152    0    0     0    10 1098  403 14  0 86  0  0
 4  0    312 348512 1012580 19995156    0    0     0    20 1068  370 17  0 83  0  0
 4  0    312 348560 1012580 19995156    0    0     0    10 1023  312 17  0 83  0  0
 8  0    312 348532 1012580 19995160    0    0     0    20 1176  511 20  0 80  0  0

<snip>
[perf-db03.lab3:/]$mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 207205
Server version: 5.5.25a-log MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select event_name, count_star, sum_timer_wait/10000000 time from 
    ->   performance_schema.events_waits_summary_global_by_event_name where event_name 
    ->   like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;
+-----------------------------------------------+------------+---------------+
| event_name                                    | count_star | time          |
+-----------------------------------------------+------------+---------------+
| wait/synch/rwlock/innodb/btr_search_latch     |  170108654 | 50911891.9512 |
| wait/synch/mutex/innodb/buf_pool_mutex        |   10192868 | 23605561.2940 |
| wait/synch/mutex/innodb/log_flush_order_mutex |    7554710 |   880044.6446 |
| wait/synch/rwlock/innodb/index_tree_rw_lock   |     883949 |    53036.0507 |
| wait/synch/rwlock/innodb/checkpoint_lock      |        104 |    37912.8009 |
| wait/synch/mutex/innodb/fil_system_mutex      |       8726 |    22322.1135 |
| wait/synch/rwlock/sql/MDL_lock::rwlock        |     249863 |     1749.2439 |
| wait/synch/mutex/sql/MDL_map::mutex           |     247220 |     1602.0584 |
| wait/synch/mutex/sql/THD::LOCK_thd_data       |      42815 |      739.8814 |
| wait/synch/mutex/sql/LOCK_thread_count        |        660 |      451.1033 |
+-----------------------------------------------+------------+---------------+
10 rows in set (0.02 sec)

Previous run had adaptive hash index ON.

This run has adaptive hash index OFF.

mysql> set global innodb_adaptive_hash_index=OFF;
Query OK, 0 rows affected (0.33 sec)

mysql> show global variables like 'innodb_ad%';
+---------------------------------+---------+
| Variable_name                   | Value   |
+---------------------------------+---------+
| innodb_adaptive_flushing        | ON      |
| innodb_adaptive_hash_index      | OFF     |
| innodb_additional_mem_pool_size | 2097152 |
+---------------------------------+---------+
3 rows in set (0.00 sec)

Performance is much, much worse.
2013-01-31 13:48:55 (249) main Parallel 1 Time 0:00:07.614
2013-01-31 13:49:06 (206) main Parallel 2 Time 0:00:10.857
2013-01-31 13:49:18 (658) main Parallel 4 Time 0:00:12.381
2013-01-31 13:49:37 (999) main Parallel 8 Time 0:00:19.227
2013-01-31 13:50:08 (757) main Parallel 12 Time 0:00:30.609
2013-01-31 13:50:50 (434) main Parallel 16 Time 0:00:41.139
2013-01-31 13:51:55 (855) main Parallel 24 Time 0:01:05.232
<I cancelled the 32 and 48 thread runs)

Wait states are different. Looks like we just shifted the problem onto the buffer pool mutex.

mysql> select event_name, count_star, sum_timer_wait/10000000 time from performance_schema.events_waits_summary_global_by_event_name where event_name like 'wait/synch%' and count_star > 0 order by sum_timer_wait desc limit 10;
+---------------------------------------------------+------------+----------------+
| event_name                                        | count_star | time           |
+---------------------------------------------------+------------+----------------+
| wait/synch/mutex/innodb/buf_pool_mutex            |  413972344 | 364754611.9145 |
| wait/synch/rwlock/innodb/index_tree_rw_lock       |  140104101 |   1447339.8741 |
| wait/synch/mutex/innodb/log_flush_order_mutex     |    9600076 |     89843.1713 |
| wait/synch/rwlock/innodb/btr_search_latch         |         68 |     31865.4732 |
| wait/synch/rwlock/innodb/checkpoint_lock          |         68 |      5376.1065 |
| wait/synch/mutex/sql/MDL_map::mutex               |     182725 |      3311.1823 |
| wait/synch/rwlock/sql/MDL_lock::rwlock            |     184404 |      1149.3740 |
| wait/synch/mutex/sql/THD::LOCK_thd_data           |      17225 |       244.7680 |
| wait/synch/mutex/innodb/kernel_mutex              |       7704 |       194.3324 |
| wait/synch/mutex/sql/LOCK_global_system_variables |       9085 |        75.6374 |
+---------------------------------------------------+------------+----------------+
10 rows in set (0.00 sec)

For what its worth I'm running with 4 buffer pool instances.

mysql> show variables like 'innodb_buf%';
+------------------------------+--------------+
| Variable_name                | Value        |
+------------------------------+--------------+
| innodb_buffer_pool_instances | 4            |
| innodb_buffer_pool_size      | 103079215104 |
+------------------------------+--------------+
2 rows in set (0.00 sec)

mysql> show engine innodb mutex;
+--------+------------------------+------------------+
| Type   | Name                   | Status           |
+--------+------------------------+------------------+
| InnoDB | trx0rseg.c:210         | os_waits=1       |
| InnoDB | ibuf0ibuf.c:537        | os_waits=5       |
| InnoDB | ibuf0ibuf.c:534        | os_waits=7       |
| InnoDB | dict0dict.c:698        | os_waits=63      |
| InnoDB | trx0sys.c:196          | os_waits=454637  |
| InnoDB | log0log.c:775          | os_waits=14428   |
| InnoDB | log0log.c:771          | os_waits=113906  |
| InnoDB | buf0buf.c:1201         | os_waits=76      |
| InnoDB | buf0buf.c:1166         | os_waits=124819  |
| InnoDB | buf0buf.c:1201         | os_waits=74      |
| InnoDB | buf0buf.c:1166         | os_waits=4566148 |
| InnoDB | buf0buf.c:1201         | os_waits=78      |
| InnoDB | buf0buf.c:1166         | os_waits=159743  |
| InnoDB | buf0buf.c:1201         | os_waits=44      |
| InnoDB | buf0buf.c:1166         | os_waits=91288   |
| InnoDB | fil0fil.c:1608         | os_waits=62117   |
| InnoDB | srv0srv.c:1018         | os_waits=20542   |
| InnoDB | combined buf0buf.c:915 | os_waits=1751    |
| InnoDB | dict0dict.c:1725       | os_waits=15      |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=7       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=2       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=2       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=2       |
| InnoDB | dict0dict.c:1725       | os_waits=11      |
| InnoDB | dict0dict.c:1725       | os_waits=15      |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=2       |
| InnoDB | dict0dict.c:1725       | os_waits=1       |
| InnoDB | dict0dict.c:1725       | os_waits=3       |
| InnoDB | dict0dict.c:1725       | os_waits=5       |
| InnoDB | dict0dict.c:1725       | os_waits=28      |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=10      |
| InnoDB | dict0dict.c:721        | os_waits=5       |
| InnoDB | dict0dict.c:721        | os_waits=3       |
| InnoDB | dict0dict.c:721        | os_waits=4       |
| InnoDB | dict0dict.c:721        | os_waits=52      |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=3       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=5       |
| InnoDB | dict0dict.c:721        | os_waits=3       |
| InnoDB | dict0dict.c:721        | os_waits=2       |
| InnoDB | dict0dict.c:721        | os_waits=6       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=5       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=3       |
| InnoDB | dict0dict.c:721        | os_waits=3       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:721        | os_waits=2       |
| InnoDB | dict0dict.c:721        | os_waits=1       |
| InnoDB | dict0dict.c:711        | os_waits=33      |
| InnoDB | fil0fil.c:1300         | os_waits=170     |
| InnoDB | log0log.c:832          | os_waits=521350  |
| InnoDB | btr0sea.c:178          | os_waits=1310965 |
| InnoDB | combined buf0buf.c:916 | os_waits=124407  |
+--------+------------------------+------------------+
65 rows in set (0.88 sec)

Patrick,

it's normal you have a better performance with AHI=on, it's as expected.
in your case you should also increase your number of BP instance (from 4 to say 8 or 16) -- you should see then less differences between BP mutex waits on the innodb mutex status output..

for the rest: we're working on it.. As I've mentioned, the problem is long dated, and not just related to MySQL 5.6. Sure, it's pity it was not fixed before, but seems that nobody reported it either until now. But we'll fix it for sure, I don't see any other option :-)

so, please, be patient with us now, as it'll not be simple.

I'll be back to you here once we have any news.

Rgds,
-Dimitri

No worries on the timing; we're clearly very interested in seeing this resolved, but we're also engineers so I don't think any of us have the illusion/expectation that this is going to be either quick or easy.

One thing I would offer is that I do have access to a fairly robust set of test equipment, so if there's specific tests/scenarios you want to run I'd be more than happy to try to either A) run them for you or B) get you access to some lab equipment if you want to try out something different.

Patrick, 

thanks for the offer! :-)

send me a ping to my @oracle.com email or one of the mail address from my site (http://dimitrik.free.fr/blog) - I'll then come back to you if we will need an access to additional HW..

Rgds,
-Dimitri

I pinged you on your gmail account.

Let me know how we can help.

Hi Patrick,

don't know if it'll be too much to ask you, but I need this info for further investigation ;-)  (and sorry for a long silence, but the issue is not simple either)..

what I need from you is:
  - replay the same tests on MySQL 5.6 (get the latest 5.6.12 version now) with:
    - performance_schema = OFF
    - innodb_buffer_pool_instances = 32
    - innodb_spin_wait_delay = 6 (default), 12, 24, 48
  - for every test iteration I'll need the "show engine innodb mutex" output

there will be a lot of data, but I think you may easily script it (as well "innodb_spin_wait_delay" is dynamic, so you can change it on-line via "set global innodb_spin_wait_delay = N")..

sorry again ;-)

Rgds,
-Dimitri

This is really killing us at GE!
How can we get traction on this?

Robert Sacca 
Senior Lead 
Operational Database Technologies 
GE 
robert.sacca@ge.com

Hi Dimitri,

I work for Pat Casey and I'll be re-running these tests today/tomorrow. 

I'll send you the results soon.

Tim Yim

Hi Robert,

>This is really killing us at GE!
>How can we get traction on this?

You can contact Oracle Support if you have a valid contract. At least,
we can see if you run into this problem and if there are workarounds.

Hi Yim

>I work for Pat Casey and I'll be re-running these tests today/tomorrow. 
>I'll send you the results soon.

Thanks !

Best regards,

Arnaud Adant

Hi Dimitri,

I scripted the tests so anyone can run them.

The files and full results will be attached including the tuning parameters for the tests. If you have any recommendations for tuning parameters, let me know and I will re-run the tests.

I hope this posts correctly, a table of the results: (if not, just look for chart.txt in the file attachement)

It looks like you've made some progress with higher spin_wait and more threads but the problem still exists.

                               Threads:              
Spin:  1       2       4      8      16      24      32      48
6:     19.527  13.941  8.146  9.416  11.790  14.982  14.241  15.867
12:    19.736  14.692  8.277  7.824  10.370  11.063  11.224  10.727
24:    19.832  14.637  8.617  7.304  8.590   9.353   9.882   10.035
48:    18.727  14.456  9.326  7.224  8.486   7.627   8.275   8.366

Tests were ran against:
MySQL-client-5.6.12-2.el6.x86_64
MySQL-server-5.6.12-2.el6.x86_64

# Percona Toolkit System Summary Report ######################
        Date | 2013-07-17 18:16:22 UTC (local TZ: PDT -0700)
    Hostname | xxxxxxxxxxxxxxxxxxxxxx
      Uptime | 58 days,  7:15,  1 user,  load average: 0.07, 0.10, 2.49
    Platform | Linux
     Release | CentOS release 6.4 (Final)
      Kernel | 2.6.32-358.6.2.el6.x86_64
Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.12
    Compiler | GNU CC version 4.4.6 20120305 (Red Hat 4.4.6-4).
     SELinux | Disabled
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes
      Speeds | 24x2666.671
      Models | 24xIntel(R) Xeon(R) CPU X5650 @ 2.67GHz
      Caches | 24x12288 KB
# Memory #####################################################
       Total | 126.0G
        Free | 104.4G
        Used | physical = 21.6G, swap allocated = 2.0G, swap used = 0.0, virtual = 21.6G
     Buffers | 489.3M
      Caches | 11.4G
       Dirty | 11864 kB
     UsedRSS | 8.1G
  Swappiness | 1

Files results from latest tests.

Attachment: innodb-scale-bug-68079.tar.gz (application/x-gzip, text), 34.34 KiB.

There are several facets of this bug. Right now, first of the problems is being identified and a plan for the code changes is drafted. There are a couple of other bottlenecks on which we have to work. 

Time required for the proper fix of this bug and amount of changes that is planned can not be possibly introduced into a GA version like 5.6. Hence, this bug will be fixed in a version that is developed. 

So, this bug will be fixed later and a target for the fixes is already planned.

"... can not be possibly introduced into a GA version like 5.6"

but the "Target Version" field still says 5.6?

Posted by developer:
 
Can somebody please confirm these results separately.

The first run is on mysql-trunk (a.k.a 5.7.3): Scales up to 32 threads reasonably well, the HW has 48 logical CPUs.

subains@supra01 ~> mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark
Benchmark
	Average number of seconds to run all queries: 16.856 seconds
	Minimum number of seconds to run all queries: 15.642 seconds
	Maximum number of seconds to run all queries: 18.066 seconds
	Number of clients running queries: 1
	Average number of queries per client: 100000

subains@supra01 ~> mysqlslap -uroot --concurrency=32 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark
Benchmark
	Average number of seconds to run all queries: 1.173 seconds
	Minimum number of seconds to run all queries: 1.024 seconds
	Maximum number of seconds to run all queries: 1.341 seconds
	Number of clients running queries: 32
	Average number of queries per client: 3125

If this were scaling linearly then the avg. should approximately be 0.7s. Compared to the original report, unless I've missed something this seems to solve their problem.

subains@supra01 ~> mysqlslap -uroot --concurrency=64 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark
Benchmark
	Average number of seconds to run all queries: 2.060 seconds
	Minimum number of seconds to run all queries: 2.041 seconds
	Maximum number of seconds to run all queries: 2.077 seconds
	Number of clients running queries: 64
	Average number of queries per client: 1562

mysql-5.6 - has problems at 32 threads
==============================

subains@supra01 ~> mysqlslap -uroot --concurrency=1 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark
Benchmark
	Average number of seconds to run all queries: 15.692 seconds
	Minimum number of seconds to run all queries: 15.009 seconds
	Maximum number of seconds to run all queries: 16.533 seconds
	Number of clients running queries: 1
	Average number of queries per client: 100000

subains@supra01 ~> mysqlslap -uroot --concurrency=32 --create-schema=test --no-drop --number-of-queries=100000 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' Benchmark
Benchmark
	Average number of seconds to run all queries: 2.446 seconds
	Minimum number of seconds to run all queries: 2.418 seconds
	Maximum number of seconds to run all queries: 2.470 seconds
	Number of clients running queries: 32
	Average number of queries per client: 3125

As I understand the original issue, mysql ceased to scale at less than the number of physical cores for this join. So for example, unlike other simpler queries which scale close to 2*the number of cores, beyond 10 concurrent queries on a machine with 12 cores (or 24 hyperthreaded cores) does not improve throughput.

Running 5.7.3 on 48 physical cores might get 32 but does that mean anything is improved? 
(because 48 physical cores >> 24 hyperthreaded cores).

Also, your test run on 5.7.3 was it done with no data? iterations = 100,000 ? Not sure if that means anything but is running with data any different to running with no data?

BTW I ran the test with 5.7.2, on a 32 core virtual server and throughput stalled after 16 threads.

The bottlenecks fixed in 5.6+ are different and not directly related to the underlying problem that affects scalability due to the test in this bug report. This bug has to do with fetching and releasing pages from the buffer pool.

Also, my comment from [18 Oct 5:45] can be ignored, I was running the wrong test :-(

We gave performed a very, very detailed analysis of the problem, with many, many measurements and profiling. Our InnoDB team has also made lot's of code changes, but it just shifted a bottleneck from one piece of code to another, without any significant difference.

It turns out that problem is unsolvable in near or further future. The only solution is to use a workaround , like query re-writing, usage of partitions, re-thinking of indices and similar.

There is some confusion here, there are two issues in this bug report. One is the page mutex and the other is the page re-lock. For this query we acquire it in S mode all the time. If we remove the page mutex overhead then this S mode rwlock becomes the next bottlleneck. However, this doesn't mean that the original problem is not fixed, it simply means that the fix requires further work that is more involved, that's all.

As reference

mysqlslap -u root --concurrency=4096 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category  from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark
Benchmark
	Average number of seconds to run all queries: 29.361 seconds
	Minimum number of seconds to run all queries: 29.312 seconds
	Maximum number of seconds to run all queries: 29.439 seconds
	Number of clients running queries: 4096
	Average number of queries per client: 0

mysqlslap -u root --concurrency=2048 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category  from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark
Benchmark
	Average number of seconds to run all queries: 14.550 seconds
	Minimum number of seconds to run all queries: 14.311 seconds
	Maximum number of seconds to run all queries: 14.808 seconds
	Number of clients running queries: 2048
	Average number of queries per client: 0

These same numbers can be achieved by using "PARTITION BY KEY() PARTITIONS N". By using partitions you can reduce the contention on the block mutex.

To demonstrate the difference, this the same test @2K threads without any changes.

mysqlslap -u root --concurrency=2048 --create-schema=test --no-drop --number-of-queries=1000 --iterations=10 --query="select count(*), category  from task_int tt, incident_int ii where tt.sys_id=ii.sys_id group by 2" Benchmark
Benchmark
	Average number of seconds to run all queries: 105.629 seconds
	Minimum number of seconds to run all queries: 105.494 seconds
	Maximum number of seconds to run all queries: 105.725 seconds
	Number of clients running queries: 2048
	Average number of queries per client: 0

Setting to "verified" since we will discontinue the use of "to be fixed
later".

Fixed as of 5.6.16, 5.7.4 and here's the changelog entry:

"Fetching and releasing pages from the buffer pool and tracking the page
state is an expensive and complex operation. Prior to the bug fix, this
operation was performed using a page mutex. Using a page mutex to track
several things is expensive and does not scale well. The bug fix separates
fetch and release tracking (in-use state) of a page from page I/O state
tracking. Fetch and release is now tracked using atomics where available.
For portability, a new "CMAKE" build option,
"INNODB_PAGE_ATOMIC_REF_COUNT" (default "ON"), disables atomic page
reference counting on platforms where atomics support is not available."

The following information has been added to the changelog entry:

"When atomic page reference counting is enabled (default), "[Note] InnoDB:
Using atomics to ref count buffer pool pages" is printed to the error log
at server startup. If atomic page reference counting is disabled, "[Note]
InnoDB: Using mutexes to ref count buffer pool pages" is printed instead."

5.6$ bzr log -r 5677
------------------------------------------------------------
revno: 5677
committer: Sunny Bains <Sunny.Bains@Oracle.Com>
branch nick: 5.6
timestamp: Tue 2013-12-10 14:30:34 +0530
message:
  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
  
  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
  
  If this option is enabled then use atomic reference counting to track
  block use. If it is off then use the old way.
  
  Approved by Yasufumi Kinoshita rb#3958.

5.6$ bzr log -r 5684
------------------------------------------------------------
revno: 5684
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Fri 2013-12-13 12:52:47 +0900
message:
  The adjustment about UNIV_SUNC_DEBUG is needed for Bug#16249481 fix
  
  Approved by Sunny Bains (IM)
  
    * revno: 5677
    * committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    * branch nick: 5.6
    * timestamp: Tue 2013-12-10 14:30:34 +0530
    * message:
    *  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
    *
    *  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
    *
    *  If this option is enabled then use atomic reference counting to track
    *  block use. If it is off then use the old way.
    *
    *  Approved by Yasufumi Kinoshita rb#3958.

5.6$ bzr log -r 5685
------------------------------------------------------------
revno: 5685
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Fri 2013-12-13 13:23:34 +0900
message:
  Fix the possibile rare race condition at Bug#16249481 fix.
  
  Approved by Sunny Bains (IM)
  
    * revno: 5677
    * committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    * branch nick: 5.6
    * timestamp: Tue 2013-12-10 14:30:34 +0530
    * message:
    *  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
    *
    *  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
    *
    *  If this option is enabled then use atomic reference counting to track
    *  block use. If it is off then use the old way.
    *
    *  Approved by Yasufumi Kinoshita rb#3958.

5.6$ bzr log -r 5692
------------------------------------------------------------
revno: 5692
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Mon 2013-12-16 21:03:59 +0900
message:
  The fix for Bug#16249481 was not enabled for builds.
  
  "#cmakedefine INNODB_PAGE_ATOMIC_REF_COUNT" is added to config.h.cmake
  
  Patch made by Sunny Bains (by mail)
  Approved by Yasufumi Kinoshita
  
    * revno: 5677
    * committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    * branch nick: 5.6
    * timestamp: Tue 2013-12-10 14:30:34 +0530
    * message:
    *  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
    *
    *  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
    *
    *  If this option is enabled then use atomic reference counting to track
    *  block use. If it is off then use the old way.
    *
    *  Approved by Yasufumi Kinoshita rb#3958.

5.6$ bzr log -r 5694
------------------------------------------------------------
revno: 5694
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2013-12-17 20:03:26 +0900
message:
  Follow up for Bug#16249481 fix.
  Atomic operation macro for Soralis and Windows added by Bug#16249481 fix were wrong.
  Aligned same for mysql-trunk.
  
    * revno: 5677
    * committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    * branch nick: 5.6
    * timestamp: Tue 2013-12-10 14:30:34 +0530
    * message:
    *  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
    *
    *  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
    *
    *  If this option is enabled then use atomic reference counting to track
    *  block use. If it is off then use the old way.
    *
    *  Approved by Yasufumi Kinoshita rb#3958.

5.6$ bzr log -r 5695
------------------------------------------------------------
revno: 5695
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2013-12-17 20:30:16 +0900
message:
  Follow up for Bug#16249481 fix.
  backport from mysql-trunk
  
    revno: 7152
    committer: Vasil Dimov <vasil.dimov@oracle.com>
    branch nick: mysql-trunk
    timestamp: Tue 2013-12-17 11:29:47 +0200
    message:
      Followup to sunny.bains@oracle.com-20131210092144-0xfzdmj5vgqlqpq2
  
      Fix a compilation failure on Win32:
  
        buf0buf.ic(1016): error C2664: '_InterlockedExchangeAdd' : cannot
        convert parameter 1 from 'ib_uint32_t *' to 'volatile long *'
        Types pointed to are unrelated; conversion requires reinterpret_cast,
        C-style cast or function-style cast

5.6$ bzr log -r 5720
------------------------------------------------------------
revno: 5720
committer: Yasufumi Kinoshita <yasufumi.kinoshita@oracle.com>
branch nick: mysql-5.6
timestamp: Tue 2013-12-31 12:29:00 +0900
message:
  Follow up for Bug#16249481 fix.
  If flush_rbt is active (in recovery phase), buf_page_t::space and buf_page_t::offset cannot be invalidated until the bpage is removed from flush_list.
  
    * revno: 5677
    * committer: Sunny Bains <Sunny.Bains@Oracle.Com>
    * branch nick: 5.6
    * timestamp: Tue 2013-12-10 14:30:34 +0530
    * message:
    *  Bug#16249481 - INNODB DOES NOT SCALE WELL ON 12 CORE SYSTEM FOR SPECIFIC ALL IN MEMORY SELECT
    *
    *  Add a new build option: INNODB_PAGE_ATOMIC_REF_COUNT, default is ON.
    *
    *  If this option is enabled then use atomic reference counting to track
    *  block use. If it is off then use the old way.
    *   *  Approved by Yasufumi Kinoshita rb#3958.

See also bug 71727.