Bug #51285 performance regression after mysql 5.1.38
Submitted: 18 Feb 2010 15:04 Modified: 17 Mar 2010 18:53
Reporter: Mark Callaghan Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.1.39 OS:Any
Assigned to: CPU Architecture:Any
Tags: performance, regression, sysbench

[18 Feb 2010 15:04] Mark Callaghan
Description:
There is a performance regression that appears after MySQL 5.1.38. I found it while running more tests for http://bugs.mysql.com/bug.php?id=51249. The performance loss occurs on myisam and innodb using 8 core servers and sysbench fetch-by-primary key tests with many concurrent clients.

See http://bugs.mysql.com/bug.php?id=51249 for details on my.cnf, the server and sysbench command lines. The server is an 8-core Nehelem with hyperthreading turned off.

My results for 8 cores shows a performance regression between 5.1.38 and 5.1.40 at high levels of concurrency. I report results for sysbench with 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 threads. The binaries are:
my5138 - mysql 5.1.38 (innodb plugin 1.0.4)
my5140 - mysql 5.1.40 (innodb plugin 1.0.4)
my5141 - mysql 5.1.41 (innodb plugin 1.0.5)

Innodb plugin, regression after 5.1.38 is obvious at 128 threads
  1     2     4     8      16     32     64    128   256   512   1024
8835 17482 38935 85122  84690   84111  82990  81326 75501 45811 21732 my5138
8879 18405 40494 85424  83698   81055  78629  52533 32070 19957 12040 my5140
8534 18023 38781 85857  83245   81657  78754  52102 32207 19919 12431 my5141

MyISAM, regression after 5.1.38 is obvious at 128 threads
  1     2     4     8      16     32     64    128   256   512   1024
8991 16794 39066 80860  81091   81668  80831  78566 72943 35295 20908 my5138
9287 18211 39800 82929  79917   78941  76866  52276 32674 20039 11963 my5140
8426 17116 39631 83269  80263   78765  77597  57462 32483 19887 12060 my5141

How to repeat:
more details on the test are in http://bugs.mysql.com/bug.php?id=51249

Suggested fix:
don't know yet
[18 Feb 2010 16:09] Mark Callaghan
I used poor mans profiler (pmp) to aggregate stack traces during a test to determine where threads are running/waiting. I don't see a big difference between 5.1.38 and 5.1.40. For both most of the waiting occurs in opening and closing tables so I will guess that the regression has something to do with LOCK_open

A possibly interesting result is the ratio of threads in open_and_lock_tables_derived versus close_thread_tables

* for 5.1.40 the ratio is 2:1
* for 5.1.38 the ratio is 1:2

results are : count, thread_stack

top thread stacks by frequency for 5.1.40
    195 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
    120 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone

top thread stacks by frequency for 5.1.38
    209 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
    107 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone

all thread stacks for open/close tables from 5.1.40
    195 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
    120 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
      4 __lll_mutex_lock_wait,_L_lock_14295,free,free_root,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 hashcmp,my_hash_next,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 handler::ha_reset,close_thread_table,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone

all thread stacks for open/close tables from 5.1.38
    209 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
    107 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      2 __lll_mutex_lock_wait,_L_lock_14295,free,free_root,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 __pthread_mutex_unlock_usercnt,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 lock_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 _int_malloc,malloc,my_malloc,init_alloc_root,init_sql_alloc,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 "\200]\t\020",,hashcmp,my_hash_next,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
[18 Feb 2010 17:17] Mark Callaghan
More data from poor mans profiler, this time with more samples. There are many more threads waiting on LOCK_open for 5.1.40 than 5.1.38. What changed? I compared sql/sql_table.cc and not much has changed. The hash table, open_cache uses the same charset.

Note that for 5.1.40 the top two stacks have 252+236 occurrences and for 5.1.38 they have 229+117 occurrences.

Aggregated thread stacks from 5 samples taken 10 seconds apart for 5.1.40
    252 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
    236 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      4 __lll_mutex_lock_wait,_L_lock_14295,free,free_root,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      2 open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,thr_lock,thr_multi_lock,mysql_lock_tables,lock_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 _int_free,free,free_root,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 ha_innodb::store_lock,get_lock_data,mysql_lock_tables,lock_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 "\220�\234\016",,hashcmp,my_hash_next,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone

Aggregated thread stacks from 5 samples taken 10 seconds apart for 5.1.38
    229 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,close_thread_tables,dispatch_command,do_command,handle_one_connection,start_thread,clone
    117 __lll_mutex_lock_wait,_L_mutex_lock_1133,pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      3 __lll_mutex_lock_wait,_L_lock_14295,free,free_root,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 pthread_mutex_lock,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 optimized,hashcmp,my_hash_next,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 malloc_consolidate,_int_malloc,malloc,my_malloc,init_alloc_root,init_sql_alloc,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 hashcmp,my_hash_first,open_table,open_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 ??,??,gettimeofday,my_micro_time,mysql_lock_tables,lock_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
      1 check_trx_exists,ha_innobase::store_lock,get_lock_data,mysql_lock_tables,lock_tables,open_and_lock_tables_derived,execute_sqlcom_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,handle_one_connection,start_thread,clone
[18 Feb 2010 18:13] Mark Callaghan
I have an odd thread stack from 5.1.40 and the InnoDB plugin. innobase_commit() is called from JOIN::optimize. The code in ha_innobase::external_lock has a comment that doesn't match what is done here. We had an discussion on internals@lists not too long ago about the obscurity of the current protocol for detected end-of-statement. Clearly, this is not the end of a statement.

Extra calls to trx_commit are horrible for InnoDB as they lock kernel_mutex and kernel_mutex is already way too hot.

from ha_innobase::external_lock

        /* If the MySQL lock count drops to zero we know that the current SQL
        statement has ended */

        if (trx->n_mysql_tables_in_use == 0) {

                trx->mysql_n_tables_locked = 0;
                prebuilt->used_in_HANDLER = FALSE;

                if (!thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN)) {
                        if (trx->active_trans != 0) {
                                innobase_commit(ht, thd, TRUE);
                        }

Thread 110 (Thread 1309841728 (LWP 11823)):
#0  0x0000003151c0a496 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002aaaab437ec5 in os_event_wait_low (event=0xa1f9d70, reset_sig_count=<value optimized out>) at os/os0sync.c:396
#2  0x00002aaaab47e660 in sync_array_wait_event (arr=0xa1f91e0, index=5) at sync/sync0arr.c:461
#3  0x00002aaaab47f795 in mutex_spin_wait (mutex=0xa1f9d20, file_name=0x2aaaab4b70c4 "trx/trx0trx.c", line=1562) at sync/sync0sync.c:597
#4  0x00002aaaab491106 in trx_commit_for_mysql (trx=0xf174bd0) at ./include/sync0sync.ic:221
#5  0x00002aaaab411d7a in innobase_commit (hton=<value optimized out>, thd=0xeb28c30, all=true) at handler/ha_innodb.cc:2469
#6  0x00002aaaab41584c in ha_innodb::external_lock (this=0x2aab3805c440, thd=0xeb28c30, lock_type=2) at handler/ha_innodb.cc:7920
#7  0x000000000065663c in handler::ha_external_lock (this=0xa1f9dac, thd=0x0, lock_type=42916) at handler.cc:4612
#8  0x000000000057a312 in unlock_external (thd=0xeb28c30, table=<value optimized out>, count=1) at lock.cc:786
#9  0x000000000057a49f in mysql_unlock_tables (thd=0xeb28c30, sql_lock=0x2aab3c004760) at lock.cc:391
#10 0x000000000057a4ea in mysql_unlock_some_tables (thd=0xeb28c30, table=<value optimized out>, count=<value optimized out>) at lock.cc:408
#11 0x00000000005f1187 in JOIN::optimize (this=0x2aab3c01d230) at sql_select.cc:990
#12 0x00000000005f8a8d in mysql_select (thd=0xeb28c30, rref_pointer_array=0xeb2ac20, tables=0xeb345c0, wild_num=0, fields=@0xeb2ab58, conds=0xeb34a78,
 og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2147764736, result=0xeb34c68, unit=0xeb2a628, select_lex=0xeb2aa50) at sql
_select.cc:2380
#13 0x00000000005f9427 in handle_select (thd=0xeb28c30, lex=0xeb2a588, result=0xeb34c68, setup_tables_done_option=0) at sql_select.cc:268
[19 Feb 2010 18:23] Alexey Stroganov
Mark, I've run tests on my box(4x2 Intel E5345/2.3Ghz) but don't see such pattern as in you results. There is some variance around 1024 threads but no drops after 128 threads. 

I will rerun test on 16 and 24 core boxes and will update bug report with results.

Mark, what binaries you used for your testing and how these binaries were built?

#
# Test: POINT_SELECT_ALL_COLUMNS:throughput, TPS
#
# Server 1 - 5.1.38
# Server 2 - 5.1.39
# Server 3 - 5.1.40
# Server 4 - 5.1.41
#
#             Server 1  Server 2  Server 3  Server 4
# Thread        MYISAM    MYISAM    MYISAM    MYISAM
          1   11948.72  11868.56  14799.11  13883.07
          2   25921.76  21581.55  17198.84  28954.01
          4   50690.86  53535.69  39724.03  39001.71
          8   83575.34  83342.21  85081.69  82064.69
         16   80843.22  81240.42  80724.08  81718.55
         32   78039.32  78919.29  79145.31  78654.31
         64   84530.62  82007.11  80995.82  86933.44
        128   85424.93  86250.29  87743.51  86116.62
        256   83867.76  83381.16  85116.40  85915.14
        512   74068.01  71641.60  70509.61  70481.52
       1024   56154.09  65792.35  58510.49  64096.31
#
#             Server 1  Server 2  Server 3  Server 4
# Thread        INNODB    INNODB    INNODB    INNODB
          1     596.19   3965.83   1113.30    784.42
          2   13558.74  12395.01   9551.63  10951.00
          4   32857.94  25267.17  36983.73  43598.07
          8   73366.12  74272.95  75874.34  78999.87
         16   79011.90  79635.48  83797.54  81176.23
         32   79260.00  81626.37  80443.58  80148.26
         64   81436.09  83974.67  82489.75  84962.97
        128   84657.08  86606.88  84226.19  84321.79
        256   80321.24  80861.04  80714.52  81664.59
        512   65189.88  66225.77  67183.30  69624.20
       1024   52572.11  51163.17  51284.12  55172.14
[19 Feb 2010 18:53] Mark Callaghan
I rebuilt and then retested 5.1.38. Interesting results:
* this regression may have been fixed after 5.1.42
* myisam is slowly getting slower from 5.1.38 to 5.1.44 for low concurrency

configure command line:
./configure --enable-thread-safe-client --prefix=$b --exec-prefix=$b --with-plugins=csv,blackhole,myisam,heap,innodb_plugin --without-plugin-innobase --without-fast-mutexes --with-unix-socket-path=$b/var/mysql.sock --with-extra-charsets=all C_EXTRA_FLAGS="-g -O2 -fno-omit-frame-pointer"

I use gcc 4.1.2

Results are TPS from sysbench for 1 .. 1024 concurrent threads
sysbench client & mysqld run on same host

myisam on 8 core Nehalem (L5520 @ 2.27GHz)
  1      2     4     8    16    32     64  128    256  512   1024 
10101 18452 41772 84840 82536 81078 79353 51093 32531 20233 11985 -- 5.1.38
9976  18237 41707 85008 82282 80366 78908 51571 32721 20275 12272 -- 5.1.40
9409  18941 40796 84854 82595 80995 78993 52073 32574 20167 12024 -- 5.1.41
8562  16498 39998 83907 83547 83442 83001 80164 81565 32641 48719 -- 5.1.44
[19 Feb 2010 18:55] Mark Callaghan
I also use centos 5.2, what OS are you using?
[19 Feb 2010 20:39] Alexey Stroganov
Mark,

binaries that I've tested were built with gcc 4.3.2, O3 and --with-fast-mutexes. These tests were run on SuSE 10.
[19 Feb 2010 20:44] Mark Callaghan
I will repeat a few tests after building using '--with-fast-mutexes'
Can you do the same after building without it?
It will take me a day to get results.

I doubt that CentOS vs SuSE has anything to do with it.

I think --with-fast-mutexes needs some work. Davi has more details on that. For example, when it is enabled there is no reason to use the fast mutex initializer as that will enable the use of a pthread mutex that has a busy wait loop. As the --with-fast-mutex option already provides that, I doubt the extra busy wait loop in the pthreads code is a good thing to do.
[22 Feb 2010 20:27] Mark Callaghan
I can repeat your results when I build using --with-fast-mutexes. Performance is much better with that. Also, I initially claimed in this bug report that there was a regression after 5.1.38 and I cannot repeat that anymore. 5.1.38 is no better than 5.1.40.

In general, 5.1.44 is faster than 5.1.41 at high levels of concurrency.

I think this should be closed. However, I also think that the fast mutex code in thr_mutex.c should not use the 'fast' pthread mutex initializer as that will create a pthread mutex on Linux that does busy waiting. As the fast mutex code in thr_mutex already does that, doing it recursively seems like a bad idea.

binaries:
my5138 - 5.1.38 -O2, -fno-omit-frame-pointer --without-fast-mutexes
my5140 - 5.1.40 -O2, -fno-omit-frame-pointer --without-fast-mutexes
my5141 - 5.1.41 -O2, -fno-omit-frame-pointer --without-fast-mutexes
my5144 - 5.1.44 -O2, -fno-omit-frame-pointer --without-fast-mutexes
my5141wfm - 5.1.41 -O2, -fno-omit-frame-pointer --with-fast-mutexes
my5144wfm - 5.1.44 -O2, -fno-omit-frame-pointer --with-fast-mutexes
my5144opt - 5.1.44 --without-fast-mutexes and uses -O3 by default
my5144wfmopt - 5.1.44 --with-fast-mutexes and uses -O3 by default

Note that for InnoDB, plugin versions 1.0.4, 1.0.5 and 1.0.6 are used

Results for sysbench readonly fetch by primary key using InnoDB
   1     2     4     8    16    32    64   128   256   512   1024
14100 22319 44794 91443 85632 76836 75243 65416 45812 27146 18475  my5138
13647 23312 44747 85269 85989 81128 75820 62555 46835 28263 18537  my5140
13485 22477 44789 90090 86393 84847 83432 82294 73145 67839 58514  my5141wfm
14009 22682 42102 90478 83948 80668 70553 64446 45552 28230 18229  my5141
12708 24316 46979 90713 89708 87916 87264 86176 75585 69821 59190  my5144wfmopt
14225 23229 45840 90263 86204 84475 82851 81335 75971 66438 63548  my5144wfm
14159 24764 49456 91863 84016 85423 82782 76924 65421 52325 41844  my5144opt
13274 22348 45066 90896 86878 84299 82709 76027 66084 53133 30536  my5144

Results for sysbench readonly fetch by primary key using MyISAM
   1     2     4     8    16    32    64   128   256   512   1024
12630 21826 43186 79955 74217 66743 67702 60953 48784 27125 18574  my5138
13644 22262 42461 79350 71717 67172 66578 59776 44568 28390 18526  my5140
13102 23677 47337 81197 78554 78423 78196 77730 72356 66279 70692  my5141wfm
13569 23542 42107 79825 69132 69302 66142 59934 46317 28135 18416  my5141
12788 23813 44850 83077 80241 79627 79775 79667 70691 71240 59467  my5144wfmopt
12504 22295 47100 83471 78383 77677 77101 79023 74091 61096 53057  my5144wfm
14176 24422 45648 79199 75807 76956 76226 71605 62156 68335 72267  my5144opt
12494 21871 44976 77089 75643 76539 76030 72195 61402 44265 27470  my5144
[17 Mar 2010 18:53] Sveta Smirnova
Thank you for the feedback.

> I think this should be closed.

Closing as "Can't repeat" as we could not repeat described problem on our side.

> However, I also think that the fast mutex code in
> thr_mutex.c should not use the 'fast' pthread mutex initializer as that will create a
> pthread mutex on Linux that does busy waiting. As the fast mutex code in thr_mutex
> already does that, doing it recursively seems like a bad idea.

I think this should be separate bug report.
[18 Mar 2010 7:13] Sveta Smirnova
Regarding to MY_MUTEX_INIT_FAST I tried to modify thr_mutex.c:

$bzr diff
=== modified file 'mysys/thr_mutex.c'
--- mysys/thr_mutex.c   2008-10-15 22:21:00 +0000
+++ mysys/thr_mutex.c   2010-03-17 19:17:46 +0000
@@ -47,7 +47,7 @@
 
 void safe_mutex_global_init(void)
 {
-  pthread_mutex_init(&THR_LOCK_mutex,MY_MUTEX_INIT_FAST);
+  pthread_mutex_init(&THR_LOCK_mutex,NULL);
 }

Then run sysbench test sysbench --test=oltp --oltp-table-size=2000000 --max-time=60 --max-requests=0 --mysql-table-engine=innodb --db-ps-mode=disable --mysql-engine-trx=yes --oltp-read-only --oltp-skip-trx --mysql-socket=/tmp/mysql_ssmirnova.sock --mysql-user=root --num-threads=X run

Results show removing MY_MUTEX_INIT_FAST does not improve performance:

							1		2		4		8		16
---------------------------------------------------------------
with MY_MUTEX_INIT_FAST:	8821	8045	5197	1365	707
without MY_MUTEX_INIT_FAST:	8749	6655	5189	1359	572