Bug #68413 performance_schema overhead is at least 10%
Submitted: 18 Feb 2013 2:45 Modified: 25 Sep 2013 2:58
Reporter: Mark Callaghan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Performance Schema Severity:S5 (Performance)
Version:5.6.10, 5.6.13 OS:Any
Assigned to: CPU Architecture:Any

[18 Feb 2013 2:45] Mark Callaghan
Description:
PS doing nothing beyond default behavior costs about 10% in QPS. Is that expected? That is a lot and definitely more than the overhead for user_stats/table_stats from the FB patch. This was reported in 2011 and the overhead hasn't improved for a pure read-only workload -- http://www.mysqlperformanceblog.com/2011/04/25/performance-schema-overhead

I tested:
* 5.6.10 without PS (-DWITH_PERFSCHEMA_STORAGE_ENGINE=0)
* 5.6.10 with PS, and default options. I didn't add anything to my.cnf or change PS tables

This used the same workload as http://bugs.mysql.com/bug.php?id=66473:
* 8 sysbench processes each using a different table
* 1, 2, 4, 8, 16, 32 clients per process
* 16M rows table, database was ~32gb and cached by InnoDB
* sysbench ran on one host, mysqld on a separate host, each host has 24 cores with HT on
* workload is fetch 1 row by PK 
* mysqld linked with jemalloc

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  orig5610 without PS compiled
56161    98147   152856  171499  170448  170035  orig5610 with PS compiled

I don't know what else you need to know the configuration for PS. It was the default config:
mysql> select * from setup_consumers;
+--------------------------------+---------+
| NAME                           | ENABLED |
+--------------------------------+---------+
| events_stages_current          | NO      |
| events_stages_history          | NO      |
| events_stages_history_long     | NO      |
| events_statements_current      | YES     |
| events_statements_history      | NO      |
| events_statements_history_long | NO      |
| events_waits_current           | NO      |
| events_waits_history           | NO      |
| events_waits_history_long      | NO      |
| global_instrumentation         | YES     |
| thread_instrumentation         | YES     |
| statements_digest              | YES     |

How to repeat:
run sysbench

Suggested fix:
reduce the PS overhead
[18 Feb 2013 3:19] Mark Callaghan
Added another result -- the 5.6.10 binary with PS support compiled but disabled in my.cnf via performance_schema=0

8          16           32           64         128        256        concurrent clients
57223   109233  165677  190622  193110  191969  without PS compiled
56161    98147  152856  171499  170448  170035  with PS compiled
56885   106798  164094  184594  187841  186187  with PS,performance_schema=0
[18 Feb 2013 11:19] Dimitri Kravtchuk
Mark,

is it possible to replay the same tests with HT=off ?..
(or just start MySQL server with "taskset -c 0-23 ..." (limit MySQL threads execution to only one core thread))

from my observations, point-selects have a pretty interesting dependence on "runnable cores"..

Rgds,
-Dimitri
[18 Feb 2013 14:53] Mark Callaghan
I don't understand the purpose of the last request? The problem is that the PS has too much overhead. The solution is to disable PS when compiling a MySQL binary.
[18 Feb 2013 15:52] Dimitri Kravtchuk
Mark, it's just to evaluate the HT impact on your server.. - on mine, I have 250K QPS with HT=off, and only 200K with HT=on on a similar 8-tables point-selects test on 24 and 32cores.. 

And it's the only test where I see such a PFS overhead..

Point-select workload is the most fast one, so very aggressive in mutex calls (means mutex instrumentation overhead become very visible here) -- you may compile MySQL 5.6 binaries without mutex instrumentation to check it on your server..

Rgds,
-Dimitri
[19 Feb 2013 2:42] Mark Callaghan
Dimitri - I think you are halfway to reproducing my results. 5.6 does much worse with HT enabled. Is the same true for 5.1?  AFAIK, 5.1 doesn't suffer much from HT enabled and if you are suggesting that many MySQL deployments now need to take downtime to disable HT then 5.6 will have many unhappy users. But this is speculation that can go away if you test 5.1 with and without HT. I will try to repeat some tests with HT disabled.

Note that I run sysbench clients on a different host than mysqld.

I am not the only one who wants MySQL to do well on point-select high-QPS workloads. Note that a lot of the MySQL 5.6 marketing is about how NoSQL isn't needed because 5.6 provides high QPS.
[19 Feb 2013 7:37] Dimitri Kravtchuk
Mark,

seems the discussion with high-QPS is going in parallel here with http://bugs.mysql.com/bug.php?id=66473 -- so, let's back here once we'll finish with #66473 :-)  because I'm getting 250K QPS on 24 and 32cores with MySQL 5.6 compiled with PFS..

Rgds,
-Dimitri
[19 Feb 2013 14:23] Mark Callaghan
Dimitri - Please change you message from "you get 250k QPS" to "you get 250k QPS without HT and 200k QPS with HT". I think you mentioned that previously, and if that is the case then you probably have reproduced this problem already.
[20 Feb 2013 5:00] Yasufumi Kinoshita
If HT causes negative impact, it should be problem at spin wait of mysqld's mutex/lock.

InnoDB already uses asm("pause") during spin wait of mutex/rw_lock. It is important for CPU with HT not to waste CPU time with spin.
[20 Feb 2013 11:07] Dimitri Kravtchuk
Mark,

no, I'm *reaching* 250K QPS with HT-enabled, but then QPS has a decrease..
and I'm *out-passing* 250K QPS with HT-disabled and keep QPS over 250K till 1024 concurrent users ;-)

anyway, my only point was that it's possible to get 250K QPS on point-selects with MySQL server still compiled with PFS, that's all ;-)

then, it's the only RO test, from what I saw until now, which is sensible to such kind of things (and on your 12cores you'll never see it, you need at least 24cores or more)..

Rgds,
-Dimitri

Rgds,
-Dimitri
[20 Feb 2013 11:12] Dimitri Kravtchuk
Yasufumi,

yes, spin delay is playing a big role here, but all relay on events timing..
(I'll get analyze it more in details later)..

Rgds,
-Dimitri
[21 Feb 2013 22:30] Mark Callaghan
Results for sysbench with a read-only & cached workload & fast storage. The overhead in this case is about 7% -- http://mysqlha.blogspot.com/2013/02/mysql-56-is-much-faster-on-io-bound.html
[19 Apr 2013 12:42] Yasufumi Kinoshita
I am still afraid of negative effect of "spin without pause".

If build with WITH_FAST_MUTEXES=ON (default is OFF now), linked with my_pthread_fastmutex_*() functions.
And the my_pthread_fastmutex_lock() uses mutex_delay() for spin wait.
The mutex_delay() doesn't have asm("pause") like InnoDB does.

(If WITH_FAST_MUTEXES=OFF(default), pthread_mutex_lock() seems to be used directly.)

So users who feel negative effect of HT should confirm that the binary was built without the WITH_FAST_MUTEXES option.
[19 Apr 2013 12:57] Mark Callaghan
I too am afraid of spin without pause. Can that be added to official MySQL? I repeated some tests using a binary that did not use fast-mutexes, but the results I report here used fast-mutexes. I did not see a difference. My server had HT enabled. But again, I did not repeat all tests to confirm this.
[22 Apr 2013 2:12] Yasufumi Kinoshita
Mark,

The following patch might avoid the HT specific scale problem.
Please report as bug, if it is effective.

=== modified file 'mysys/thr_mutex.c'
--- mysys/thr_mutex.c   2011-09-07 10:08:09 +0000
+++ mysys/thr_mutex.c   2013-04-22 02:04:34 +0000
@@ -426,7 +426,20 @@
   j = 0;

   for (i = 0; i < delayloops * 50; i++)
+  {
     j += i;
+#if defined(HAVE_PAUSE_INSTRUCTION)
+# ifdef __SUNPRO_CC
+    asm ("pause" );
+# else
+    __asm__ __volatile__ ("pause");
+# endif /* __SUNPRO_CC */
+#elif defined(HAVE_FAKE_PAUSE_INSTRUCTION)
+    __asm__ __volatile__ ("rep; nop");
+#elif defined(MSVC)
+    YieldProcessor();
+#endif
+  }

   return(j);
 }
[22 Apr 2013 12:51] Yasufumi Kinoshita
I was wrong, WITH_FAST_MUTEXES=ON is default for Linux.
The above patch might be worth to try for most cases of negative effect of HT for Linux.
[25 Apr 2013 14:07] Marc Alff
To proceed further with this bug report, more technical data needs to be
collected, which will help analysis.

Changing this report to "Need Feedback", please re run the benchmark per
instructions below.

-- Marc Alff.

-------------------------------------------------------------------

1) Changes to my.cnf

Add the following options to the my.cnf configuration file.

# Disable instrumentation that is not needed
performance_schema_consumer_events_statements_current=OFF
performance_schema_consumer_statements_digest=OFF

# Size the server instrumentation according to the expected workload.
# Good for 1024 user connections:
performance_schema_max_thread_instances=2000
# Good for 1024 connections all opening a few tables at
# the same time in the server:
performance_schema_max_table_handles=10000

2) Run the benchmark

Collect the output of the following SQL statements.

select version();
show global variables like "performance_schema%";
select * from performance_schema.performance_timers;
select * from performance_schema.setup_actors;
select * from performance_schema.setup_objects;
select * from performance_schema.setup_timers;
select * from performance_schema.setup_instruments where enabled='YES';
select * from performance_schema.setup_consumers where enabled='YES';

At the end of the payload, collect the output of the following SQL statements.

show engine performance_schema status;
show global status like "performance_schema%";

Please document the benchmark results, and provide the output collected.

3) Repeat with a custom build

In cmake, use WITH_PERFSCHEMA_STORAGE_ENGINE=ON

Define the following flags:

#define DISABLE_PSI_MUTEX
#define DISABLE_PSI_RWLOCK
#define DISABLE_PSI_COND
#define DISABLE_PSI_SOCKET
#define DISABLE_PSI_STAGE
#define DISABLE_PSI_IDLE
#define DISABLE_PSI_STATEMENT_DIGEST

(See include/mysql/psi/psi.h for details)

In other words, the server will be compiled only with the following
instrumentation:
- File IO
- Table IO
- Threads
- Statements
which are the parts only needed to produce stats comparable to "userstat".

These flags can be added to CMAKE_CXX_FLAGS / CMAKE_C_FLAGS with cmake,
or (which is what I do for simplicity) added in psi.h directly.

Please try again your benchmark with this build,
and document the results.

4) Provide additional platform details.

Assuming Linux, I need the exact, un truncated, content of:
- uname -a
- cat /proc/cpuinfo
- gcc --version
- ldd mysqld

From the ldd output, and for each library XXX listed,
provide the output of 'file XXX'.
Each time a symbolic link is found, follow it and repeat until links land on a
real binary.

Example below, expanded for libc.so.6.

malff@linux-3ezv:mysql-5.6> ldd sql/mysqld
        linux-vdso.so.1 (0x00007fff4d9f7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f6131dfd000)
        libaio.so.1 => /lib64/libaio.so.1 (0x00007f6131bfb000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f61319f3000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00007f61317b8000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f61315b4000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f61312ad000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f6130fb6000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f6130da0000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f61309fb000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f6132019000)

malff@linux-3ezv:mysql-5.6> file /lib64/libc.so.6
/lib64/libc.so.6: symbolic link to `libc-2.15.so'

malff@linux-3ezv:mysql-5.6> file /lib64/libc-2.15.so
/lib64/libc-2.15.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
dynamically linked (uses shared libs),
BuildID[sha1]=0xcf045b7b199fa07daa920a7cdfad3f5cbd0f6c12, for GNU/Linux
2.6.32, not stripped
[25 Apr 2013 17:21] Mark Callaghan
That is too much work for me. My problem is solved by building without the PFS.
[25 Apr 2013 17:40] Marc Alff
Closing this bug as no feedback then.
[29 Jun 2013 20:58] Valeriy Kravchuk
I wonder why this bug is still in "No feedback" status, and here is why. Originally it was about P_S overhead. Does anybody have doubts that this overhead is high (when P_S is compiled in and all default P_S related settings are used) for SELECT ... FROM t WHERE PK=value queries, no matter what is the concurrency?

Personally I don't have any, as I've easily got even up to 18% overhead for this kind of queries on MySQL 5.6.12 comparing to P_S set to OFF and a bit more comparing to it being disabled at compile time, even for a single thread, on a slow VM.

If there are no doubts, maybe it's time to set this report to "Verified" (and provide results of testing on different kinds of hardware and/or with different number of clients, to see is it really 10% or more, or less, depending on hardware etc)? I've heard (during PLMCE 2013 and later) that internally there is work in progress to understand and reduce overhead, but isn't it time then to set bug status accordingly, or were that just rumors?

If there are doubts that original test case really shows overhead then surely users should add more. But then "Open" or "Need feedback" is a better status.

Is the only way to make this bug "Verified" is to provide all data and do test as it is described in recent comments? Then, maybe, somebody from Oracle MySQL Support should try to do this, as original bug reporter is not interested in spending more time on this?
[30 Jun 2013 4:55] Shane Bester
Related bug fixed in 5.6.12, but I have done no benchmarks for it:

Bug 16633515 - PERFORMANCE OVERHEAD IN PERFORMANCE_SCHEMA.THREADS.PROCESSLIST_STATE
[1 Jul 2013 10:37] Valeriy Kravchuk
This is what I see with more complex sysbench test run locally:

MySQL 5.6.12 --no-defaults:

[openxs@chief sysbench]$ sysbench --test=./sysbench/tests/db/oltp.lua --db-driver=mysql --mysql-engine-trx=yes --mysql-table-engine=innodb --mysql-socket=/tmp/mysql.sock --mysql-user=root --oltp-table-size=30000 --num-threads=32 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=3000 --max-requests=100000 run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 32
Random number generator seed is 0 and will be ignored

Threads started!

OLTP test statistics:
    queries performed:
        read:                            1412628
        write:                           403605
        other:                           201803
        total:                           2018036
    transactions:                        100901 (312.37 per sec.)
    deadlocks:                           1      (0.00 per sec.)
    read/write requests:                 1816233 (5622.66 per sec.)
    other operations:                    201803 (624.74 per sec.)

General statistics:
    total time:                          323.0200s
    total number of events:              100901
    total time taken by event execution: 10335.2633s
    response time:
         min:                                 36.36ms
         avg:                                102.43ms
         max:                                756.70ms
         approx.  95 percentile:             217.13ms

Threads fairness:
    events (avg/stddev):           3153.1562/9.20
    execution time (avg/stddev):   322.9770/0.03

MySQL 5.6.12 --no-defaults with P_S disabled at compile time:

[openxs@chief sysbench]$ sysbench --test=./sysbench/tests/db/oltp.lua --db-driver=mysql --mysql-engine-trx=yes --mysql-table-engine=innodb --mysql-socket=/tmp/mysql.sock --mysql-user=root --oltp-table-size=30000 --num-threads=32 --init-rng=on --max-requests=0 --oltp-auto-inc=off --max-time=3000 --max-requests=100000 run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 32
Random number generator seed is 0 and will be ignored

Threads started!

OLTP test statistics:
    queries performed:
        read:                            1413664
        write:                           403897
        other:                           201949
        total:                           2019510
    transactions:                        100973 (347.69 per sec.)
    deadlocks:                           3      (0.01 per sec.)
    read/write requests:                 1817561 (6258.59 per sec.)
    other operations:                    201949 (695.39 per sec.)

General statistics:
    total time:                          290.4106s
    total number of events:              100973
    total time taken by event execution: 9291.2103s
    response time:
         min:                                 33.14ms
         avg:                                 92.02ms
         max:                                964.68ms
         approx.  95 percentile:             148.73ms

Threads fairness:
    events (avg/stddev):           3155.4062/7.57
    execution time (avg/stddev):   290.3503/0.02

Now simple computations show 11% of difference:

mysql> select 347.69/312.37;
+---------------+
| 347.69/312.37 |
+---------------+
|      1.113071 |
+---------------+
1 row in set (0.06 sec)

mysql> select 6258.59/5622.66;
+-----------------+
| 6258.59/5622.66 |
+-----------------+
|        1.113101 |
+-----------------+
1 row in set (0.00 sec)
[1 Jul 2013 10:39] Valeriy Kravchuk
Some details about the box:

[openxs@chief sysbench]$ uname -a
Linux chief 2.6.35.14-106.fc14.x86_64 #1 SMP Wed Nov 23 13:07:52 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
[openxs@chief sysbench]$ ldd /home/openxs/dbs/5.6/bin/mysqld
        linux-vdso.so.1 =>  (0x00007fffbc7ff000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003ffac00000)
        libaio.so.1 => /lib64/libaio.so.1 (0x00000035b9000000)
        librt.so.1 => /lib64/librt.so.1 (0x0000003ffb400000)
        libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000003007000000)
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003ffb000000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003006000000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003ffb800000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ffc000000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003ffa800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003ffa400000)
        libfreebl3.so => /lib64/libfreebl3.so (0x0000003006400000)
[openxs@chief sysbench]$ file /lib64/libc.so.6
/lib64/libc.so.6: symbolic link to `libc-2.13.so'
[openxs@chief sysbench]$ file /lib64/libc-2.13.so
/lib64/libc-2.13.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not stripped
[openxs@chief sysbench]$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8300  @ 2.50GHz
stepping        : 10
cpu MHz         : 2003.000
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 5000.60
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8300  @ 2.50GHz
stepping        : 10
cpu MHz         : 2003.000
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 4999.95
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8300  @ 2.50GHz
stepping        : 10
cpu MHz         : 2003.000
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 4999.96
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Core(TM)2 Quad CPU    Q8300  @ 2.50GHz
stepping        : 10
cpu MHz         : 2003.000
cache size      : 2048 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 4999.98
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

and how binaries were compiled:

...
1076     cd ../../bzr2/mysql-5.6
1077     bzr pull
1078     rm CMakeCache.txt
1079     cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/5.6nops -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DWITH_PERFSCHEMA_STORAGE_ENGINE=0
1080     time make -j 4
1081     make install && make clean
1082     rm CMakeCache.txt
1083     cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/5.6 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF
1084     time make -j 4
1085     make install && make clean
...
[3 Jul 2013 22:11] Sveta Smirnova
Valeriy,

thank you for the feedback. But enabled performance schema is supposed to cause slow downs. What is bug here is noticeable slowdowns with disabled performance schema. And I could see them on my test machine. Results to follow.
[3 Jul 2013 22:20] Sveta Smirnova
Results on my test machine:

Compiled without P_S:

threads:    1           2           4           8           16           32
events/sec: 67300.0000  274066.5000 552191.750  279652.3750 138750.5000  67171.7188
Compiled with P_S (default options):
events/sec: 57558.0000  128095.5000 409787.7500 228752.2500 111974.1250  53722.3750
P_S compiled, but disabled:
events/sec: 56159.0000  240625.0000 549090.2500 278385.3750 137816.4375  66896.8125

Machine has 4 cores.

So we can see 10% slowdown with 1 and 2 threads, 1.5 - 2% slowdown with 4 threads and more.

I compiled with options:

cmake . -DCMAKE_INSTALL_PREFIX=/home/ssmirnov/blade12/build/mysql-5.6 -DBUILD_CONFIG=mysql_release -DMYSQL_UNIX_ADDR=/data/56orig/data/mysql.sock -DWITH_EMBEDDED_SERVER=0 -DWITH_PERFSCHEMA_STORAGE_ENGINE=0 -DIGNORE_AIO_CHECK=1

cmake . -DCMAKE_INSTALL_PREFIX=/home/ssmirnov/blade12/build/mysql-5.6 -DBUILD_CONFIG=mysql_release -DMYSQL_UNIX_ADDR=/data/56orig/data/mysql.sock -DWITH_EMBEDDED_SERVER=0 -DWITH_PERFSCHEMA_STORAGE_ENGINE=1 -DIGNORE_AIO_CHECK=1

mysqld started from MTR test suite:

perl ./mtr --start innodb
perl ./mtr --start --mysqld=--performance_schema=0 innodb

Sysbench used as Mark did in the linked bug report:

$cat ~/blade12/src/bugs/bug68413/sysbench.sh 
#!/bin/bash

table_num=8

for nthreads in `echo 1 2 4 8 16 32`
do
./0.4-dev/sysbench/sysbench --batch --batch-delay=10 --test=oltp  --mysql-db=test --oltp-table-size=16000000 --max-time=180 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-table-name=sbtest${table_num} --oltp-read-only --oltp-skip-trx --oltp-test-mode=simple --oltp-point-select-all-cols --oltp-dist-type=uniform --oltp-range-size=100 --oltp-connect-delay=0 --percentile=99 --num-threads=$nthreads --seed-rng=1 --mysql-user=root --mysql-port=13000 --mysql-host=127.0.0.1 run >>result.txt
echo "\n\n\n===\n\n\n" >>result.txt
done

But I used single sysbench instance on same machine where server runs. And results are still repeatable!
[4 Jul 2013 7:04] Marc Alff
Sveta,

"
mysqld started from MTR test suite:

perl ./mtr --start innodb
perl ./mtr --start --mysqld=--performance_schema=0 innodb
"

When mysql is started **from the MTR test suite**,
the performance schema is **not** used in the default configuration (as defined by default configuration values shipped in the binary), far from it.

The performance schema (and the server in general) is using the configuration designed to run the MTR test suite, which:
- enable all possible instrumentation in setup_instruments
- enable all possible consumers in setup_consumers
- provide low sizing parameters value to cause stress
all this to enforce a maximun test coverage.
That is expected, to make the tests relevant.

If you want to benchmark the server, you need to perform a clean install and start the server normally, not reuse the MTR test environment.
This is also true for any benchmark, not just the performance schema.

See in particular the file:
mysql-test/include/default_mysqld.cnf

The previous benchmark is invalid.
[4 Jul 2013 13:40] Sveta Smirnova
Marc,

I don't want to say my results with `./mtr --start innodb` are something to care about. I added them only for logging purpose. What I think is relevant is results with --performance_schema=0, when P_S is disabled. Content of mysql-test/include/default_mysqld.cnf does not matter in this case. Before running test I connected to the server and ensured that P_S is really disabled.
[15 Jul 2013 12:12] Valeriy Kravchuk
Useful related post, and some results for similar tests in other environments:

http://dimitrik.free.fr/blog/archives/2013/07/mysql-performance-why-performance-schema-ove...
[31 Jul 2013 22:05] Yoshinori Matsunobu
Here is another benchmark that performance schema overhead is 33%
http://yoshinorimatsunobu.blogspot.com/2013/08/another-reason-to-disable-performance.html
[25 Sep 2013 2:58] Mark Callaghan
The focus was on single-threaded sysbench but there are some results for the overhead of the PS here - http://mysqlha.blogspot.com/2013/09/mysql-572-single-threaded-performance.html