Bug #57489 performance regression in 5.5.6-rc for IO bound sysbench
Submitted: 15 Oct 2010 18:05 Modified: 26 Dec 2010 18:46
Reporter: Mark Callaghan Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: InnoDB Plugin storage engine Severity:S5 (Performance)
Version:5.5.6-rc OS:Any
Assigned to: Inaam Rana
Tags: innodb, performance, regression

[15 Oct 2010 18:05] Mark Callaghan
Description:
Using my sysbench branch (https://code.launchpad.net/~mdcallag/sysbench/0.4-dev) to do 50% fetch by primary key and 50% update by primary key on an IO bound database, performance for 5.1.50 is much better than 5.5.6-rc

Results below are transactions/second where a transaction is one SELECT by primary key and one UPDATE by primary key. Results are reported for 8, 16, 32, 64, 128, 256, 512 and 1024 concurrent clients.

This uses ~120GB database file and 4GB InnoDB buffer pool. The test server can do ~1500 random disk reads per second and has 8 physical cores and 16 logical cores with HT enabled. HT is enabled.

Results with innodb_buffer_pool_instances=4
310     454     477     483     476     483     487     493               556rc

Results with innodb_buffer_pool_instances=1 or not set for 5.1.50
363     449     555     640     655     670     664     675            5150orig
303     443     459     471     468     471     481     473               556rc

Why isn't 556 faster?

How to repeat:
Use enhanced sysbench from https://code.launchpad.net/~mdcallag/sysbench/0.4-dev

8 sysbench tables are created with 72m rows per table:

for i in $( seq 1 8 ); do
sysbench --batch --batch-delay=60 --test=oltp  --oltp-table-size=72000000 \
 --max-time=1200 --max-requests=0 --mysql-table-engine=innodb --mysql-db=test \
--db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${i} \
--oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols \
--oltp-simple-update --oltp-dist-type=uniform --oltp-range-size=1 \
--num-threads=1 --seed-rng=1 prepare
done

my.cnf for 5.1.50:
plugin-load=innodb=ha_innodb_plugin.so;innodb_trx=ha_innodb_plugin.so;innodb_locks=ha_innodb_plugin.so;innodb_lock_waits=ha_innodb_plugin.so;innodb_cmp=ha_innodb_plugin.so;innodb_cmp_reset=ha_innodb_plugin.so;innodb_cmpmem=ha_innodb_plugin.so;innodb_cmpmem_reset=ha_innodb_plugin.so

innodb_buffer_pool_size=16G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=1
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
innodb_io_capacity=1000
innodb_thread_concurrency=64
max_connections=2000
table_cache=2000
key_buffer_size=2000M

my.cnf for 5.5.6-rc
innodb_buffer_pool_size=16G
innodb_log_file_size=1900M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=1
innodb_flush_method=O_DIRECT
innodb_thread_concurrency=0
innodb_max_dirty_pages_pct=80
innodb_file_format=barracuda
innodb_file_per_table
innodb_io_capacity=1000
innodb_thread_concurrency=64
innodb_buffer_pool_instances=XXXXX use 1 or 4 here XXXXX
max_connections=2000
table_cache=2000
key_buffer_size=2000M

I then run 8 concurrent instances of sysbench for each of the tables. My script is more complicate than what follows, but it should show the general approach. First run 8 instances using 1 thread each, then 8 instances using 2 threads each ...

for nt in 1 2 4 8 16 32 64 128 ; do

for x in $( seq 1 8 ); do

(sysbench --batch --batch-delay=60 --test=oltp --mysql-db=test --oltp-table-size=72000000 \
--max-time=1200 --max-requests=0 --mysql-table-engine=innodb \
--db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${x} \
 --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols \
--oltp-simple-update --oltp-dist-type=uniform --oltp-range-size=1 \
--num-threads=${nt} --seed-rng=${x} run & )

done

done

Suggested fix:
Make 556 faster :)
[15 Oct 2010 18:17] Mark Callaghan
WIth this test, the first run of sysbench with 8 total threads and 1 thread per might have skewed results as the server is restarted prior to it. The tests that follow are started immediately one after another without restarts of mysqld.
[15 Oct 2010 18:24] Mark Callaghan
Sample vmstat output from the highest concurrency test (128 threads per sysbench instance, 1024 total threads). Sorry that the formatting for this is not preserved

Note that 5.1.50 is doing IO faster per the bi and bo columns: ~20,000 versus ~15,000

5.5.6-rc

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

                                                                                           bi         bo       in      cs        us sy id wa st
 3 63    212 58393104 1367716 7838392    0    0 14318 13932 2367 12726  3  1 64 31  0
 0 66    212 58408920 1367716 7838408    0    0 17453 14446 2549 15041  2  1 57 39  0
 0 65    212 58413156 1367716 7838444    0    0 15510 15376 2498 13726  2  1 62 35  0
 2 64    212 58450244 1367716 7838508    0    0 13931 14569 2435 13084  4  1 63 32  0

from 5.1.50

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

                                                                                            bi        bo       in        cs      us sy id wa st
 0 64    212 58423536 1368152 7869328    0    0 20982 21532 3040 18522  4  2 51 44  0
 1 63    212 58428276 1368152 7869348    0    0 19448 19953 2853 17252  2  1 55 41  0
 0  1    212 58428892 1368152 7869368    0    0 19650 20252 2881 17397  2  1 54 42  0
 1 64    212 58428976 1368152 7869380    0    0 20733 21491 2974 18263  3  1 53 43  0
[15 Oct 2010 18:29] Mark Callaghan
Data from iostat, once again I predict formatting will be lost. This is from the highest concurrency test. Again, 5.1.50 is able to do IO at a higher rate which explains higher performance.

What limits performance for 5.5.6-rc? 
* mutex contention?
* something else?

Last two samples from 5.5.6-rc
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00    16.70  903.50  544.60 28912.00 29902.90    40.62    32.87   22.68   0.68  98.96
cciss/c0d0        0.00    13.20  863.50  547.30 27632.00 29857.90    40.75    32.66   23.19   0.70  98.88

Last two samples from 5.1.50
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00     6.40 1365.20  787.50 43686.40 44665.40    41.04    39.91   18.54   0.46  99.99
cciss/c0d0        0.00    18.38 1359.04  780.92 43489.31 44602.90    41.17    40.12   18.77   0.47  99.90
[15 Oct 2010 18:38] Mark Callaghan
Performance per 60 seconds. This is the performance per interval. Normal sysbench has an option to report cumulative performance per N seconds. I changed that to report the performance per interval per N seconds.

Note that TPS for 5.5.6 quickly drops to 65 but is steady at 85 for 5.1.50

for 5.5.6-rc

[1285921215] min: 0.1746  avg: 1.2644  max: 3.1057  percentile: 2.2962  throughput: 99.6500  requests: 5979  interval_secs: 60 sum_secs: 7560  threads: 128
[1285921275] min: 0.1746  avg: 1.8964  max: 3.1309  percentile: 1.3903  throughput: 67.0167  requests: 4021  interval_secs: 60 sum_secs: 7625  threads: 128
[1285921335] min: 0.1746  avg: 1.9073  max: 3.4103  percentile: 1.3341  throughput: 67.4667  requests: 4048  interval_secs: 60 sum_secs: 7721  threads: 128
[1285921395] min: 0.1746  avg: 1.9746  max: 3.4547  percentile: 1.2901  throughput: 64.7667  requests: 3886  interval_secs: 60 sum_secs: 7673  threads: 128
[1285921455] min: 0.1746  avg: 1.6937  max: 3.4547  percentile: 1.2169  throughput: 76.0333  requests: 4562  interval_secs: 60 sum_secs: 7727  threads: 128
[1285921515] min: 0.1746  avg: 2.2509  max: 3.9495  percentile: 1.0601  throughput: 56.7333  requests: 3404  interval_secs: 60 sum_secs: 7662  threads: 128
[1285921575] min: 0.1746  avg: 1.9835  max: 3.9495  percentile: 1.1035  throughput: 64.5167  requests: 3871  interval_secs: 60 sum_secs: 7678  threads: 128
[1285921635] min: 0.1746  avg: 2.0271  max: 3.9495  percentile: 1.0923  throughput: 63.2000  requests: 3792  interval_secs: 60 sum_secs: 7687  threads: 128
[1285921695] min: 0.1746  avg: 1.9205  max: 3.9495  percentile: 1.1118  throughput: 66.6167  requests: 3997  interval_secs: 60 sum_secs: 7676  threads: 128
[1285921755] min: 0.1746  avg: 1.8471  max: 3.9495  percentile: 1.1201  throughput: 69.3333  requests: 4160  interval_secs: 60 sum_secs: 7684  threads: 128
[1285921815] min: 0.1746  avg: 1.8634  max: 3.9495  percentile: 1.1148  throughput: 68.9667  requests: 4138  interval_secs: 60 sum_secs: 7711  threads: 128
[1285921875] min: 0.1746  avg: 1.8908  max: 3.9827  percentile: 1.0913  throughput: 66.8167  requests: 4009  interval_secs: 60 sum_secs: 7580  threads: 128
[1285921935] min: 0.1746  avg: 2.0538  max: 3.9827  percentile: 1.0703  throughput: 62.8333  requests: 3770  interval_secs: 60 sum_secs: 7743  threads: 128
[1285921995] min: 0.1746  avg: 1.9612  max: 3.9827  percentile: 1.0786  throughput: 64.6167  requests: 3877  interval_secs: 60 sum_secs: 7604  threads: 128
[1285922055] min: 0.1746  avg: 1.8996  max: 3.9827  percentile: 1.0910  throughput: 68.1000  requests: 4086  interval_secs: 60 sum_secs: 7762  threads: 128
[1285922115] min: 0.1746  avg: 1.9592  max: 3.9827  percentile: 1.0767  throughput: 65.5000  requests: 3930  interval_secs: 60 sum_secs: 7699  threads: 128
[1285922175] min: 0.1746  avg: 1.9960  max: 3.9827  percentile: 1.0716  throughput: 64.1333  requests: 3848  interval_secs: 60 sum_secs: 7681  threads: 128
[1285922235] min: 0.1746  avg: 1.9835  max: 3.9827  percentile: 1.0722  throughput: 64.5000  requests: 3870  interval_secs: 60 sum_secs: 7676  threads: 128
[1285922295] min: 0.1746  avg: 1.8981  max: 3.9827  percentile: 1.0838  throughput: 67.2500  requests: 4035  interval_secs: 60 sum_secs: 7659  threads: 128

for 5.1.50

[1285960689] min: 0.7676  avg: 1.4803  max: 3.6626  percentile: 2.8836  throughput: 85.6333  requests: 5138  interval_secs: 60 sum_secs: 7606  threads: 128
[1285960749] min: 0.7676  avg: 1.5318  max: 3.6626  percentile: 1.2213  throughput: 83.0500  requests: 4983  interval_secs: 60 sum_secs: 7633  threads: 128
[1285960809] min: 0.7676  avg: 1.5136  max: 3.6626  percentile: 1.0923  throughput: 85.1333  requests: 5108  interval_secs: 60 sum_secs: 7731  threads: 128
[1285960869] min: 0.7676  avg: 1.5704  max: 3.8074  percentile: 1.0421  throughput: 80.8500  requests: 4851  interval_secs: 60 sum_secs: 7618  threads: 128
[1285960929] min: 0.7676  avg: 1.5745  max: 4.0609  percentile: 1.0166  throughput: 81.8000  requests: 4908  interval_secs: 60 sum_secs: 7728  threads: 128
[1285960989] min: 0.7676  avg: 1.5282  max: 4.0609  percentile: 0.9997  throughput: 83.8167  requests: 5029  interval_secs: 60 sum_secs: 7685  threads: 128
[1285961049] min: 0.7676  avg: 1.5584  max: 4.0901  percentile: 0.9845  throughput: 82.3833  requests: 4943  interval_secs: 60 sum_secs: 7703  threads: 128
[1285961109] min: 0.7676  avg: 1.5292  max: 4.0901  percentile: 0.9731  throughput: 83.6500  requests: 5019  interval_secs: 60 sum_secs: 7675  threads: 128
[1285961169] min: 0.7676  avg: 1.5641  max: 4.0901  percentile: 0.9644  throughput: 81.3833  requests: 4883  interval_secs: 60 sum_secs: 7637  threads: 128
[1285961229] min: 0.7676  avg: 1.5612  max: 4.5812  percentile: 0.9584  throughput: 82.2500  requests: 4935  interval_secs: 60 sum_secs: 7704  threads: 128
[1285961289] min: 0.7676  avg: 1.5422  max: 4.5812  percentile: 0.9529  throughput: 83.0667  requests: 4984  interval_secs: 60 sum_secs: 7686  threads: 128
[1285961349] min: 0.7676  avg: 1.5420  max: 4.5812  percentile: 0.9461  throughput: 83.1500  requests: 4989  interval_secs: 60 sum_secs: 7693  threads: 128
[1285961409] min: 0.7676  avg: 1.5553  max: 4.5812  percentile: 0.9436  throughput: 82.4833  requests: 4949  interval_secs: 60 sum_secs: 7697  threads: 128
[1285961469] min: 0.7676  avg: 1.5648  max: 4.5812  percentile: 0.9374  throughput: 80.3000  requests: 4818  interval_secs: 60 sum_secs: 7539  threads: 128
[1285961529] min: 0.7676  avg: 1.5588  max: 4.5812  percentile: 0.9346  throughput: 82.9500  requests: 4977  interval_secs: 60 sum_secs: 7758  threads: 128
[1285961589] min: 0.7676  avg: 1.6296  max: 4.5812  percentile: 0.9301  throughput: 79.2333  requests: 4754  interval_secs: 60 sum_secs: 7747  threads: 128
[1285961649] min: 0.7676  avg: 1.5128  max: 4.5812  percentile: 0.9295  throughput: 82.9000  requests: 4974  interval_secs: 60 sum_secs: 7525  threads: 128
[1285961709] min: 0.7676  avg: 1.5867  max: 4.5812  percentile: 0.9270  throughput: 81.5333  requests: 4892  interval_secs: 60 sum_secs: 7762  threads: 128
[1285961769] min: 0.7676  avg: 1.5422  max: 4.5812  percentile: 0.9270  throughput: 83.7833  requests: 5027  interval_secs: 60 sum_secs: 7752  threads: 128
[15 Oct 2010 18:45] Mark Callaghan
This is not a problem for a read-only workload:

These are QPS for sysbench configured to fetch by primary key for 8, 16, 32, ..., 1024 concurrent clients using the same database

961     1373    1825    2316    2336    2345    2337    2373           5150orig
962     1375    1825    2314    2332    2348    2354    2373              556rc
[15 Oct 2010 19:30] Davi Arnaut
There were a few performance related regressions in 5.5.6 due to MDL, which are now hopefully fixed. See Bug#56715, Bug#56405, Bug#56585. Would be interesting to repeat this test with 5.5.7.
[15 Oct 2010 22:38] Mark Callaghan
I don't think those have any impact on this. One is a windows bug with regression on a read-only benchmark. I use Linux and have no regression on read-only loads.

The other two are more interesting, but I am not doing anything that should require an exclusive MDL lock.
[15 Oct 2010 22:40] Mark Callaghan
I think this is a regression in InnoDB. Maybe I will repeat tests with MyISAM, but concurrent read-write tests on IO-bound MyISAM frequently shows you that MyISAM and a RW-lock per table isn't good (maybe MongoDB users will notice that one day).
[19 Oct 2010 21:32] Inaam Rana
Mark,

Wondering if you have 'show engine innodb status' and 'show engine innodb mutex' listings for both runs, hopefully multiple of them at regular intervals.
[19 Oct 2010 21:46] Mark Callaghan
I don't have that yet. I think this may be caused by support issue 51606 and I am repeating tests to attempt to reproduce that issue. Until today I was not recording the size of the ibdata1 file during my tests so I don't know when things when bad.

Note for those who can't read this the ibdata1 file gets very large and most of the growth is from pages used for the undo log.

For the results previously reported on this bug, the ibdata1 file was eventually ~190GB for 5.5.6-rc. I don't know when/how/why it got that big so all results for 5.5.6-rc are suspect.

Repeating tests with reasonably sized ibdata1 files (~24M versus 190G), performance looks OK so far:

Throughput in TPS for 8 -> 1024 concurrent clients

Using innodb_io_capacity=1000 and innodb_thread_concurrency=64
    8      16      32     64   128   256   512   1024   concurrent clients
513   621   781   916   912   911   914   915     5.1.50 
386   680   769   911   909   914   927   938     5.5.6-rc
[20 Oct 2010 18:57] Mikhail Izioumtchenko
bug#56933 may be related, I'm not confident it went to the 5.5.6-rc.
[20 Oct 2010 19:03] Calvin Sun
The fix of bug#56933 is in 5.5.7-rc.
[7 Dec 2010 18:21] Mark Callaghan
Can we put this on hold until I repeat my tests after getting the fix for http://bugs.mysql.com/bug.php?id=56933
[23 Dec 2010 18:29] Mark Callaghan
I cannot reproduce this with 5.5.8, my current results are...

483     615     781     781     784     784     784     785            5147orig
508     629     799     802     801     802     800     802        5152facebook
487     607     768     772     770     769     770     769             558orig
[26 Dec 2010 18:46] Valerii Kravchuk
Not repeatable with 5.5.8