Bug #30738 Concurrent disk sorts from InnoDB tables are much slower in MySQL 5
Submitted: 31 Aug 2007 3:36 Modified: 19 Nov 2011 9:50
Reporter: Mark Callaghan Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.0.37 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: concurrent, disk, innodb, regression, slow, sort

[31 Aug 2007 3:36] Mark Callaghan
Description:
Concurrent queries that require disk sorts and join innodb tables are much slower in MySQL5 compared to MySQL4.0.26. The slowdown occurs once the temp table used for an order by is spilled to disk. At that point, CPU utilization on a 4 core server drops from 85% to 25%. There appears to be a serial bottleneck. If the query result set is made small enough so that the temp table does not spill to disk, then the problem does not occur and MySQL5 is much faster than MySQL4 (probably because of the fix for bug 15815).

How to repeat:
The following SQL reproduces the problem on a server with 4 CPU cores and 12 disks with SW RAID 0 and:
innodb_thread_concurrency=8
innodb_thread_sleep_delay=50000
innodb_buffer_pool_size=5000M

drop table if exists t2048;
create table t2048 (i int not null auto_increment primary key, j int default 0, cdrop table if exists t2048;
create table t2048 (i int not null auto_increment primary key, j int default 0, c char(50) default "abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde") type=innodb;
insert into t2048 values (), (), (), (), (), (), (), ();
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
insert into t2048 (i) select null from t2048;
update t2048 set j = i;

for i in $( seq 0 39 ); do ( time mysql -uroot -pymI#2? -A scratch < t1s.sql > /dev/null & ); done

Suggested fix:
I wish I knew. Has code change between MySQL4 and MySQL5 with respect to calling ha_release_temporary_latches. I searched and did not see any changes, but if changes were made to when an InnoDB session gives up its concurrency tickets, then that could make MySQL5 much slower for this case. 

Does an InnoDB session give up its tickets when results are written to a temp table?

Are there synchronization bottlenecks while writing to on disk (MyISAM) temp tables?
[31 Aug 2007 3:38] Mark Callaghan
vmstat output for MySQL 4.0.26. Note that CPU utilization remains at 80%. I cutoff the output at the time at which the on disk temp table is full and the sort starts.

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 9  0     56  47160  97520 14167996    0    0   497  1501   68    25 43  8 46  3
 5  0     56  45048  74804 14114144    0    0   174    16 1032 13045 84 11  4  0
 9  0     56  47440  69016 14110616    0    0    92   628 1045 14410 86  9  4  0
 9  3     56  45304  55888 14108036    0    0   110 188246 2018 14476 82 12  5  1
 7  4     56  45040  52832 14106808    0    0    82 198704 1887 12767 81 12  2  5
 8  2     56  43928  50800 14111356    0    0    30 160129 1975 14323 82 11  3  4
 8  1     56  46496  47868 14104768    0    0    86 209886 2213 16641 82 12  4  2
 4  1     56  42032  33940 14121212    0    0    76 203277 2153 18678 81 12  4  3
 6  2     56  45952  28588 14122144    0    0   149 135128 1974 19209 81 12  4  4
 6  2     56  41520  24656 14129680    0    0    88 146142 1933 17340 80 12  4  4
14  1     56  45656  21004 14130476    0    0   100 154403 1935 13303 82 11  3  4
 8  0     56  42176  21572 14134260    0    0    84 49999 1508 13391 85 10  4  1
 9  0     56  42824  22200 14132884    0    0    82 159710 2125 17685 82 12  5  1
 8  0     56  42880  22676 14132476    0    0   121 160562 2085 15818 82 12  5  1
 5  2     56  44872  22092 14130544    0    0   112 182780 2202 14865 81 12  4  4
16  2     56  39000  22048 14135824    0    0   186 153669 2017 14228 80 11  3  6
 8  2     56  41664  21756 14134416    0    0   296 141492 2006 16417 82 11  2  4
 7  1     56  41616  21816 14134628    0    0   160 150432 2024 15184 82 11  5  2
 4  1     56  45456  21748 14130616    0    0    84 163190 2082 14450 83 11  5  1
 3  1     56  45392  22024 14130136    0    0   114 154883 2002 15815 80 11  5  4
 7  3     56  46984  22564 14127964    0    0   134 145262 1960 16131 80 11  3  5
10  0     56  41816  22676 14131116    0    0    49 194023 2165 15921 81 12  3  3
 8  1     56  47408  23024 14126280    0    0    73 175178 2107 13564 82 12  4  2
12  1     56  46632  22876 14125816    0    0   143 196120 2065 13970 82 12  4  3
10  1     56  42768  22832 14130756    0    0   137 155012 2080 14929 82 12  4  2
[31 Aug 2007 3:39] Mark Callaghan
vmstat output for MySQL 5.0.37. Note that CPU utilization quickly falls off.

vmstat 10 100
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  2     56 14207884  48436 1224388    0    0   182  1264  107    44 17  1 79  3
10  1     56 13639396  52024 1177960    0    0  5608  1462 1674  6518 24  6 50 20
10  5     56 9704476  55816 5022356    0    0  1343 104943 5092  2181 83 16  0  0
11  6     56 5873204  59484 8768208    0    0   264 171564 2470  2203 83 16  0  1
14  3     56 1984940  63188 12562576    0    0   224 95187 2525  3110 84 15  0  0
19  5     56  47396  35604 14478452    0    0   345 94204 2496  3673 82 17  0  0
 8  9     56  35396  37768 14487984    0    0   291 91910 2635  3708 79 21  0  1
 0 14     56  36228   6024 14514764    0    0   436 105907 2690  6018 30 14  0 56
 0 14     56  45788   6744 14504796    0    0   554 110822 2698  7916 21  9  0 70
 3 13     56  35372   7560 14509488    0    0   619 128301 2467  6258 24  9  1 66
13  2     56  45788   8824 14495780    0    0   351 202230 2269  4301 35 13  0 52
17  6     56  38108  10084 14504244    0    0   362 188190 2233  2419 35 13  1 51
 0 16     60  35420  11420 14510248    0    0   698 162536 2356  3342 31 12  0 56
 7 12     60  35716  12980 14508008    0    0   572 110507 2554  5145 24 10  0 65
 1 14     60  35292  14280 14502968    0    0   528 105590 2504  4805 22  9  0 68
 2 13     60  35228  15484 14498364    0    0   955 110892 2552  4757 20  8  0 71
 0 13     60  35420  16644 14491696    0    0   869 97495 2375  4620 19  8  1 73
 9  6     60  43868  17860 14477696    0    0   452 115229 2612  5205 23  9  1 66
18  7     60  45860  10264 14479308    0    0   388 113643 2690  3681 60 17  1 22
 0 17     60  35316   8252 14459156    0    0  3402 133678 2710  2617 33 11  0 56
 0 18     60  35452   8864 14429916    0    0 36264 130934 2679  3868 19  9  0 72
[31 Aug 2007 3:45] Mark Callaghan
We don't normally run 40 concurrent queries on a 4 core server with innodb_thread_concurrency=8. But query pileups do happen because of query plans that go bad --usually because of changes in stats sampled by InnoDB.

The problem is that the performance for MySQL5 is much worse than for MySQL4 in this case. The problem with my dataset that I cannot share is much worse than with the testcase above. CPU utilization drops to 15%, IO/second drops to 50 but 30+ queries are active for MySQL5, while MySQL4 does not slow down.

And when I change the testcase to keep all temp tables in memory, the problem is reversed -- MySQL5 is several times faster (the sync array changes in InnoDB might explain this).

I have used OProfile and Google Perftools to get profiling, but the profile shows where time is spent and not where threads are blocking. I need to know where threads are blocking.

InnoDB has some code to report on wait events on InnoDB mutexes. Does MySQL have anything similar for any blocking that might occur while writing to an on-disk temp table?
[31 Aug 2007 4:04] Mark Callaghan
The testcase might not work for you. For now I will be happy if my questions get answered as I will try to debug this.
[31 Aug 2007 4:12] Mark Callaghan
On my test server, 20 concurrent queries will reproduce the problem. The server has 4 CPU cores and 12 disks.
[31 Aug 2007 4:16] Valeriy Kravchuk
Mark,

Please, send/upload the content of t1s.sql.
[31 Aug 2007 4:19] Mark Callaghan
My test servers use Linux 2.6 with NPTL although MySQL 4.0.26 is also faster than MySQL5 when it uses LinuxThreads.
[31 Aug 2007 14:18] Valeriy Kravchuk
Mark, please, send the exact statements you had used in your t1s.sql scripts. Send also the results of

show global variables like 'char%';

from your 5.0.37 testing environment.
[31 Aug 2007 14:30] Mark Callaghan
The contents of t1s.sql:
select straight_join *
from t2048 a, t2048 b , t2048 c, t2048 d, t2048 e, t2048 f, t2048 g, t2048 h, t2048 i
where a.j < 6 and b.j < 6 and c.j < 6 and d.j < 6 and e.j < 6 and
f.j < 6 and g.j < 6 and h.j < 6 and i.j < 6 and
length(a.c) = 50 and length(b.c) = 50 and length(c.c) = 50 and length(d.c) = 50 and
length(e.c) = 50 and length(f.c) = 50 and length(g.c) = 50 and length(h.c) = 50 and length(i.c) = 50
order by a.c, a.i limit 1;
[31 Aug 2007 14:33] Mark Callaghan
For MySQL 4.0.26:
 mysql  -e 'show global variables like "char%"'
+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Variable_name  | Value                                                                               |
+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| character_set  | latin1                                                                               |
| character_sets | latin1 big5 czech euc_kr gb2312 gbk latin1_de sjis tis620 ujis dec8 dos german1 hp8 koi8_ru latin2 swe7 usa7 cp1251 danish hebrew win1251 estonia hungarian koi8_ukr win1251ukr greek win1250 croat cp1257 latin5 |

For MySQL 5.0.37:
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
[31 Aug 2007 16:55] Heikki Tuuri
Mark,

what does SHOW INNODB STATUS\G print during the slow operation?

The temp tables are MyISAM type. Something may have changed also in MySQL/MyISAM.

Regards,

Heikki
[31 Aug 2007 17:25] Mark Callaghan
I am running my original testcase again as the one I posted is flaky.

I temporarily turned off mutex stats to see if the extra cache activity from updating shared variables was part of the problem. From my memory, there was more mutex activity per second on MySQL4 (more usage, more OS waits, more rounds) and this is reasonable because queries were running faster on MySQL4.

Thread states are much more likely to be *waiting in InnoDB queue* for MySQL5 and *sleeping before joining InnoDB queue* for MySQL4.

This is vmstat output for MySQL5 when the slowdown occurs. Note that the CPU is pinned for some time (when queries are writing to in memory temp tables) and the  drop in CPU utilization occurs when queries begin to spill to disk.

Is ha_release_temporary_latches called?
* when an in memory temp table is spilled to disk (this can run long)
* when rows are written to disk based temp table

 9  0     56 11914172  75324 2155484    0    0   134   174 1120  1482 99  0  1  0
 6  0     56 11911804  75348 2156548    0    0   108    74 1043  1375 99  0  1  0
10  0     56 11909740  75372 2157612    0    0    89    63 1058  1505 99  0  1  0
12  0     56 11905644  75404 2159824    0    0   163   185 1136  1591 99  0  1  0
 9  0     56 11903052  75432 2161428    0    0   114    81 1161  1685 99  0  1  0
11  0     56 11901004  75460 2162488    0    0   105    72 1041  1372 99  0  1  0
 9  1     56 11898324  75520 2164060    0    0   148   170 1051  1405 99  0  1  0
14  0     56 11887540  75568 2165644    0    0   109   377 1052  1446 98  1  1  0
 9  0     56 11886764  75600 2167516    0    0    88   376 1050  1448 99  0  1  0
23  1     56 11881740  75668 2169012    0    0   143   175 1097  1536 99  0  1  0
 0  0     56 11847804  75816 2172060    0    0   268    84 1147 11069 81  2 17  0
 0  0     56 11820052  75860 2173376    0    0   114    54 1082  8882 36  1 62  0
 9  0     56 11800732  75968 2175308    0    0   169   178 1118  6806 22  1 77  1
 0  0     56 11770588  76024 2176952    0    0   132    53 1116 13083 36  2 62  1
 0  0     56 11752220  76080 2178392    0    0   121    10 1130  8352 22  1 77  1
 0  0     56 11723900  76156 2181104    0    0   234   485 1133 12710 31  2 67  0
[31 Aug 2007 17:27] Mark Callaghan
Top 20 functions in MySQL5 from OProfile once the slowdown has started:
samples  %        app name                 symbol name
2639899   9.9375  libpthread-2.3.5.so      pthread_mutex_trylock
2221655   8.3631  libpthread-2.3.5.so      pthread_mutex_unlock
1954730   7.3583  mysqld                   row_search_for_mysql
1665278   6.2687  mysqld                   btr_search_guess_on_hash
1508263   5.6777  libpthread-2.3.5.so      __pthread_mutex_unlock_usercnt
1492409   5.6180  mysqld                   cmp_dtuple_rec_with_match
1455102   5.4775  mysqld                   rec_get_offsets_func
825131    3.1061  mysqld                   btr_cur_search_to_nth_level
810887    3.0525  libc-2.3.5.so            memcpy
744389    2.8022  mysqld                   buf_page_get_gen
597741    2.2501  mysqld                   ut_delay
516241    1.9433  mysqld                   mutex_spin_wait
484341    1.8232  mysqld                   row_sel_store_mysql_rec
413573    1.5568  mysqld                   page_cur_search_with_match
404812    1.5239  mysqld                   ha_innobase::general_fetch(char*, unsigned int, unsigned int)
380421    1.4320  mysqld                   btr_search_info_update_slow
347627    1.3086  mysqld                   mtr_memo_slot_release

Results for MySQL4 from similar point in the benchmark:
samples  %        app name                 symbol name
26002487 39.5086  libpthread-2.3.5.so      pthread_mutex_trylock
8627212  13.1083  mysqld                   mutex_spin_wait
7681806  11.6719  libpthread-2.3.5.so      __pthread_mutex_unlock_usercnt
2041599   3.1020  mysqld                   .plt
1564911   2.3777  mysqld                   mtr_memo_slot_release
1378054   2.0938  mysqld                   btr_search_guess_on_hash
1340988   2.0375  mysqld                   row_search_for_mysql
1293789   1.9658  libpthread-2.3.5.so      pthread_mutex_unlock
1278852   1.9431  mysqld                   cmp_dtuple_rec_with_match
1087260   1.6520  mysqld                   ha_innobase::general_fetch(char*, unsigned int, unsigned int)
878946    1.3355  mysqld                   buf_page_optimistic_get_func
792475    1.2041  mysqld                   mutex_exit
744995    1.1320  mysqld                   page_dir_find_owner_slot
731574    1.1116  mysqld                   buf_page_get_gen
701585    1.0660  mysqld                   btr_cur_search_to_nth_level
632754    0.9614  mysqld                   row_sel_store_mysql_rec
595511    0.9048  libc-2.3.5.so            memcpy
[31 Aug 2007 17:30] Mark Callaghan
One more point about the vmstat output above, the regression comes from an increase in idle time, not in wait time. Is something sleeping more than desired?
[31 Aug 2007 20:00] Mark Callaghan
This problem continues to baffle me. Eventually the workload pins the CPU again. But there is a stretch in the middle of the test where the system is much more idle than expected. From SHOW PROCESSLIST output, the slowdown occurs immediately before thread status changes from 'Copying to tmp table' to 'Converting HEAP to MyISAM'. Once status changes to 'Copy to tmp table on disk', then CPU is pinned again with a lot of system time.

vmstat output on 10 second intervals
19  3     56 2181060 128716 11658268    0    0   712   135 1060  2364 98  1  1  0
10  0     56 2167876 128724 11659348    0    0    95    15 1041  2259 98  0  1  0
10  0     56 2153668 128728 11659752    0    0    31    54 1026  2317 98  1  1  0
14  0     56 2142796 128756 11661900    0    0   179   144 1058  2204 98  1  1  0
14  0     56 2136460 128760 11665840    0    0   364    57 1044  2184 98  0  1  0
 7  0     56 2128892 128780 11666228    0    0    45    44 1027  2215 99  0  1  0
13  0     56 2124492 128800 11667704    0    0   120   133 1053  2176 98  0  1  0
 1  0     56 2107404 128808 11668580    0    0    69   373 1036  5990 74  1 25  0
 0  0     56 2078476 128816 11669320    0    0    47   378 1048 12748 32  1 65  1
 0  0     56 2061828 128836 11671136    0    0   154   291 1058  5705 12  1 85  2
 0  0     56 2044284 128860 11671860    0    0    74    47 1035 10162 20  1 78  1
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0     56 2029276 128872 11673480    0    0   124   379 1068  6768 14  1 84  2
 1  0     56 2015572 128888 11674484    0    0    88   134 1058  5928 13  1 85  1
 0  0     56 1989804 128892 11678492    0    0   386   378 1054 11086 28  1 69  1
 0  0     56 1964932 128904 11679840    0    0    48   559 1055 12258 29  1 69  1
 0  0     56 1942948 128920 11681048    0    0    64   490 1066 11646 28  1 69  1
 0  0     56 1914988 128940 11681708    0    0    52   456 1045 14129 36  2 61  1
 0  0     56 1892316 128952 11682784    0    0    76   389 1063 10528 25  1 73  1
 0  0     56 1866228 128960 11686516    0    0   333   496 1066 10848 25  1 73  0
 7  0     56 1838132 128968 11687324    0    0    57    22 1044 14129 33  2 66  0
 0  0     56 1822324 128996 11688316    0    0    91   366 1057  9242 19  1 78  1
 0  0     56 1806124 129016 11689248    0    0    59   466 1062  7877 17  1 82  1
 0  0     56 1785084 129064 11691716    0    0   221   558 1069  9390 19  1 79  1
 6  0     56 1702084 129132 11745164    0    0    78   544 1039 14984 31  5 64  1
 4  0     56 1556884 129288 11882572    0    0    75 16748 1113 360092 13 50 34  2
 5  0     56 1418420 129488 12016944    0    0   536    29 1168 347147 18 45 36  0
 5  0     56 1288724 129652 12145164    0    0    83    54 1029 373713 11 52 37  1
 8  1     56 1170868 129804 12260408    0    0   160 39867 1140 284897 30 42 26  2
 7  0     56 1063804 129964 12364696    0    0    66   467 1051 200680 49 33 18  0
[1 Sep 2007 16:07] Heikki Tuuri
Mark,

both oprofile outputs show lots of InnoDB mutex activity, much more in 4.1. But 4.1 is the fast version!

Concerning the question about ha_releases_temp_latches(), I do not think it gets called when MySQL decided to convert a HEAP temp table to a MyISAM temp table. Thus, it might happen that threads flock to wait for the adaptive hash latch to be released. But then you should see SHOW INNODB STATUS\G showing lots of threads waiting for it. What does it show?
[2 Sep 2007 1:44] Mark Callaghan
ha_releaes_temporary_latches is not called when converting a temp table from heap to myisam. I experimented by adding a callback so it would be called when my_write was called for the IO_CACHE used by the MyISAM temp table, but that did not help.
[2 Sep 2007 2:02] Mark Callaghan
My best description of the performance problem using data that I cannot give to you. The testcase is ~20 concurrent queries each doing an order by that spills to disk (session status changes from 'Copying to tmp table' to 'Copy to tmp table on disk').

0: MySQL4 finishes the queries in ~220 minutes vs ~346 for MySQL5

1: MySQL5.0 drops from 90% CPU to 25% CPU on a 4-core server. The state of the queries remains as 'Copying to tmp table' for several minutes after the drop so the problem appears to start before converting the tmp table from heap to myisam and before doing IO for MyISAM. Sample MySQL5 vmstat output before drop:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
11  0     56 1230348  53536 12576444    0    0   293    18 1033  2269 98  1  1  0
And after drop, note that user CPU is down, idle CPU and context switches are up
 1  1     56 1067740  54320 12603948    0    0   113   432 1038 11848 38  2 60  1

2: When user CPU drops, idle CPU goes up.

3: MySQL5 does io for the MyISAM tmp tables at half the rate of MySQL4. I will do more tests to confirm that the disk throughput is similar independent of MySQL on the test machines. Sample vmstat output showing typical IO rate for MySQL4:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 8  0     56  44680  67144 10012020    0    0    87  3621 1055 75087 87  7  6  0
And for MySQL5, not that IO rate is 3621 vs 2037
 0  0     56  42900  45900 12410680    0    0    29  2037 1046 10535 24  1 74  1

4: The number of runnable procs reported by vmstat for MySQL4 is usually between 5 and 8. For MySQL5 it is usually 0. This will explain the high percentage of idle time mentioned above. Both MySQL4 and MySQL5 use NPTL. I assume that a runnable thread counts as a runnable proc.

5: MySQL4 has many more context switches. I assume some of this is a result of the fixes for the sync array. vmstat output for MySQL4
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 8  0     56  44680  67144 10012020    0    0    87  3621 1055 75087 87  7  6  0
And for MySQL5
 0  0     56  42900  45900 12410680    0    0    29  2037 1046 10535 24  1 74  1
[2 Sep 2007 2:04] Mark Callaghan
SHOW INNODB STATUS for MySQL4 at the end of the testcase:

*************************** 1. row ***************************
Status:
=====================================
070901 18:37:28 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 5 seconds
----------
BACKGROUND THREAD
----------
srv_master_thread loops: 62599 1_second, 62598 sleeps, 6131 10_second, 3456 background, 3156 flush
srv_master_thread log flush: 62650 sync, 45271 async
fsync callers: 26284 buffer pool, 0 other, 152 checkpoint, 10864 log aio, 56549 log sync, 0 archive
----------
SEMAPHORES
----------
Lock wait timeouts 0
OS WAIT ARRAY INFO: reservation count 238728266, signal count 140825445
Mutex spin waits 12029782620, rounds 35419816117, OS waits 64317016
RW-shared spins 385401401, OS waits 96964144; RW-excl spins 319291740, OS waits 40482911
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread) reads 0 writes 0 requests 0 io secs 0.000000 io msecs/request 0.000000 max_io_wait 0.000000
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 8310 requests 8310 io secs 0.208978 io msecs/request 0.025148 max_io_wait 114.045000
I/O thread 2 state: waiting for i/o request (read thread) reads 70741 writes 0 requests 39193 io secs 389.418743 io msecs/request 9.935926 max_io_wait
3366.653000
I/O thread 3 state: waiting for i/o request (write thread) reads 0 writes 736836 requests 464850 io secs 569.769490 io msecs/request 1.225706 max_io_wait 4040.025000
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
279712 OS file reads, 1406249 OS file writes, 93849 OS fsyncs
0.80 reads/s, 16384 avg bytes/read, 21.20 writes/s, 2.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf for space 0: size 1, free list len 1128, seg size 1130,
303440 inserts, 303440 merged recs, 114391 merges
Hash table size 10375991, used cells 7940167, node heap has 22180 buffer(s)
96.98 hash searches/s, 18.20 non-hash searches/s
---
LOG
---
Log sequence number 128 1718444680
Log flushed up to   128 1718441148
Last checkpoint at  128 1718333760
0 pending log writes, 0 pending chkp writes
922687 log i/o's done, 4.60 log i/o's/second, 56543 syncs, 9149 checkpoints
log sync callers: 736836 buffer pool, background 62686 sync and 45271 async, 16 internal, checkpoint 9149 sync and 0 async, 0 archive, commit 0 sync and 840789 async
log sync syncers: 2722 buffer pool, background 53820 sync and 0 async, 0 internal, checkpoint 1 sync and 0 async, 0 archive, commit 0 sync and 0 async
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 5927978378; in additional pool allocated 10121472
Buffer pool size   320000
Free buffers       0
Database pages     297820
Modified db pages  117
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages read 311122, created 29028, written 736836
0.80 reads/s, 0.00 creates/s, 24.80 writes/s
Buffer pool hit rate 999 / 1000
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
Main thread process no. 19047, id 1132489056, state: sleeping
Number of rows inserted 1891870, updated 2125081, deleted 632148, read 2080952285
0.80 inserts/s, 54.39 updates/s, 0.00 deletes/s, 54.39 reads/s
------------
TRANSACTIONS
------------
Trx id counter 0 2059683696
Purge done for trx's n:o < 0 2059683692 undo n:o < 0 0
Total number of lock structs in row lock hash table 0
[2 Sep 2007 2:05] Mark Callaghan
SHOW INNODB STATUS for MySQL5 at the end of the testcase

*************************** 1. row ***************************
Status:
=====================================
070901 18:37:32 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 32 seconds
----------
BACKGROUND THREAD
----------
srv_master_thread loops: 18782 1_second, 18781 sleeps, 1829 10_second, 1148 background, 1036 flush
srv_master_thread log flush: 18892 sync, 15826 async
srv_wait_thread_mics 233042029178 microseconds, 233042.0 seconds
spinlock delay for 5 delay 20 rounds is 4 mics
fsync callers: 23779 buffer pool, 23 other, 29 checkpoint, 2497 log aio, 16731 log sync, 0 archive
----------
SEMAPHORES
----------
Lock wait timeouts 0
OS WAIT ARRAY INFO: reservation count 99493060, signal count 68066543
Mutex spin waits 1804993118, rounds 8500902893, OS waits 8381279
RW-shared spins 252062961, OS waits 47106255; RW-excl spins 234269355, OS waits 30858427
Spin rounds per wait: 4.71 mutex, 6.16 RW-shared, 3.25 RW-excl
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread) reads 14 writes 0 requests 7 io secs 0.032885 io msecs/request 4.697857 max_io_wait
23.025000
I/O thread 1 state: waiting for i/o request (log thread) reads 0 writes 2497 requests 2497 io secs 1.661534 io msecs/request 0.665412 max_io_wait 331.285000
I/O thread 2 state: waiting for i/o request (read thread) reads 10436 writes 0 requests 6314 io secs 235.918854 io msecs/request 37.364405 max_io_wait
1286.202000
I/O thread 3 state: waiting for i/o request (read thread) reads 4987 writes 0 requests 2756 io secs 104.830433 io msecs/request 38.037167 max_io_wait 1508.804000
I/O thread 4 state: waiting for i/o request (read thread) reads 4000 writes 0 requests 1904 io secs 61.204302 io msecs/request 32.145117 max_io_wait 1438.879000
I/O thread 5 state: waiting for i/o request (read thread) reads 3765 writes 0 requests 2421 io secs 63.081140 io msecs/request 26.055820 max_io_wait 1406.591000
I/O thread 6 state: waiting for i/o request (write thread) reads 0 writes 79623 requests 59754 io secs 89.803908 io msecs/request 1.502894 max_io_wait
2252.062000
I/O thread 7 state: waiting for i/o request (write thread) reads 0 writes 47713 requests 36961 io secs 87.605921 io msecs/request 2.370226 max_io_wait
2034.358000
I/O thread 8 state: waiting for i/o request (write thread) reads 0 writes 43826 requests 34597 io secs 87.361416 io msecs/request 2.525115 max_io_wait
2738.745000
I/O thread 9 state: waiting for i/o request (write thread) reads 0 writes 40770 requests 32503 io secs 87.283270 io msecs/request 2.685391 max_io_wait
1805.893000
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
127533 OS file reads, 284055 OS file writes, 43385 OS fsyncs
0.94 reads/s, 16384 avg bytes/read, 4.81 writes/s, 1.44 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 56642, seg size 56644,
52316 inserts, 101064 merged recs, 42700 merges
Hash table size 10375991, used cells 7218277, node heap has 18217 buffer(s)
48.00 hash searches/s, 24.91 non-hash searches/s
---
LOG
---
Log sequence number 57 3848610184
Log flushed up to   57 3848609713
Last checkpoint at  57 3816843045
0 pending log writes, 0 pending chkp writes
114762 log i/o's done, 3.56 log i/o's/second, 16731 syncs, 2842 checkpoints
log sync callers: 211932 buffer pool, background 18902 sync and 15826 async, 3 internal, checkpoint 2842 sync and 0 async, 0 archive, commit 0 sync and 90080 async
log sync syncers: 676 buffer pool, background 16054 sync and 0 async, 0 internal, checkpoint 1 sync and 0 async, 0 archive, commit 0 sync and 0 async
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 6046890062; in additional pool allocated 17983488
Buffer pool size   320000
Free buffers       156566
Database pages     145217
Modified db pages  19378
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages read 137213, created 8042, written 211932
0.94 reads/s, 0.25 creates/s, 14.12 writes/s
Buffer pool hit rate 997 / 1000
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
Main thread process no. 23724, id 1191237984, state: sleeping
Number of rows inserted 338412, updated 500472, deleted 246240, read 2085938366
8.66 inserts/s, 24.06 updates/s, 0.72 deletes/s, 25.16 reads/s
------------
TRANSACTIONS
------------
Trx id counter 0 39672366
Purge done for trx's n:o < 0 39672350 undo n:o < 0 0
History list length 12
Total number of lock structs in row lock hash table 0
[2 Sep 2007 2:06] Mark Callaghan
Top N waits from SHOW MUTEX STATUS for MySQL5 at the end of the testcase

btr0sea.c       139     6887692
buf0buf.c       545     806330
dict0dict.c     3689    231175
buf0buf.c       494     160150
buf0buf.c       497     112489
buf0buf.c       494     97965
mem0pool.c      205     47795
dict0dict.c     3689    6509
buf0buf.c       494     5761
srv0srv.c       969     3864
ibuf0ibuf.c     399     1781
[2 Sep 2007 3:47] Mark Callaghan
The problem is btr_search_latch. I have simplified my testcase to 4 concurrent queries. In this case, user CPU utilization is at 50% for MySQL5 and 99% for MySQL4. I then recompiled MySQL5 with srv_use_adaptive_hash_indexes=FALSE and the dropoff does not occur.

Does processing a select statement ever get an X lock on btr_search_latch? Or is that only done for insert/update/delete?

I will post numbers tomorrow. What has changed with the adaptive hash index between MySQL4 and MySQL5 other than the changes to the sync array?
[2 Sep 2007 4:01] Mark Callaghan
Disabling the adaptive hash index fixes the problem. Why is this worse with MySQl5

Time for 4 concurrent queries:
MySQL4 with adaptive hash index -> ~29 minutes each
MySQL5 with adaptive hash index -> ~38 minutes each
MySQL5 without adaptive hash index -> ~17 minutes each
[2 Sep 2007 14:47] Mark Callaghan
Time for ~20 concurrent queries:
MySQL4 with adaptive hash indexes    --> ~220 minutes
MySQL5 with adaptive hash indexes    --> ~346 minutes
MySQL5 without adaptive hash indexes -->  ~95 minutes
[2 Sep 2007 14:49] Mark Callaghan
See http://bugs.mysql.com/bug.php?id=26442
[3 Sep 2007 13:57] Heikki Tuuri
Mark,

InnoDB does want an X-latch on the adaptive hash index also in a SELECT when it builds the hash index for a page. It has to modify the adaptive hash => that is why an X-latch is needed.

The difference between 4.0 and 5.0 may be some random effect.

The general principle in the adaptive hash latch is that it should not be held if a query does extensive processing inside MySQL, outside of InnoDB. The conversion HEAP -> MyISAM is a typical operation where the adaptive hash should not be held.

Fix:

1) we could do so that InnoDB would not keep the adaptive hash if the MySQL query involves a sort (how do we know this?);

or:

2) we could call release_temporary_latches() when MySQL decides to convert a HEAP temp table to MyISAM (where does it do that?).

I am assigning this to Inaam.

Regards,

Heikki
[3 Sep 2007 15:04] Mark Callaghan
Heikki,

The slowdown begins before converting the tmp table from heap to MyISAM and continues long after the conversion is done.
[3 Sep 2007 15:29] Heikki Tuuri
Mark,

ok, then 1) might be the better solution until we have http://bugs.mysql.com/bug.php?id=26442 fixed.

Best regards,

Heikki
[4 Sep 2007 17:24] Mark Callaghan
When the adaptive hash index is enabled, MySQL 5.0.37 does IO for the tmp table at half the rate of MySQL 4.0.26. The numbers below are for ~20 concurrent queries that all do an external sort.

As Valerii suggested, some of this is from the use of the new external sort in MySQL5 (sort keys and all data) versus the old sort in MySQL4 (only sort keys and row pointers). I reran the tests with max_length_for_sort_data=2 in MySQL5 so that the old sort would be used and the IO rate was higher, but still less than that for MySQL4.

binary                  io_rate
mysql4                  ~3.6 MB
mysql5 & new sort       ~2.0 MB
mysql5 & old sort       ~2.9 MB

Sample vmstat output showing typical IO rate for MySQL4:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 8  0     56  44680  67144 10012020    0    0    87  3621 1055 75087 87  7  6  0

And for MySQL5, not that IO rate is 3621 vs 2037
 0  0     56  42900  45900 12410680    0    0    29  2037 1046 10535 24  1 74  0

And for MySQL5 with max_length_for_sort_data=2
 0 0     56 2806804 217044 10597132    0    0     6  2929 1050 12708 36  2 62  0
[4 Sep 2007 21:47] Mark Callaghan
Much of the regression between MySQL4 and MySQL5 is from the use of the new external sort algorithm as Valerii suggested might be the case. I reran the test for MySQL5 with the adaptive hash index enabled and force the old external sort algorithm by setting max_length_for_sort=2.

Time for ~20 concurrent queries:
MySQL4 with adaptive hash index    --> ~220 minutes
MySQL5 with adaptive hash index    --> ~346 minutes
MySQL5 without adaptive hash index -->  ~95 minutes

MySQL5 with adaptive hash index and old sort -> ~253 minutes
[18 Sep 2007 1:41] James Day
For other readers, the new sorting algorithm that Mark mentioned is described at http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html . It avoids lots of random row reads after sorting but stores all fields returned by the query instead of just the fields being sorted and keys to look up the rest of the records. It halves the number of row reads but it can mean that a larger temporary table size or sort buffer size is helpful for InnoDB in 4.1 compared to 4.0. Or a smaller max_length_for_sort_data than the default. Queries that are particularly greatly affected are those with large numbers of rows matching WHERE but a much smaller LIMIT or any query with much larger data to return than the size of the fields to sort by. For Mark's case this raises the time from 253 to 346 minutes.

But the big result he saw is that some 72% of his query time in 5.0 is caused by the adaptive hash index so there is considerable speed improvement possible by improving it. Because the new sort put more data on disk it seems that the existing problem of holding the adaptive hash index lock while the sorting is happening is made worse in 5.0. Heikki's fix 1 might reduce the 5.0 time from 346 to 253 minutes but would still leave the potential gain of a reduction to 95 minutes available. How much of that remaining gain potential is from reducing lock contention rather than just the other cost of updating the hash index is unclear. Running tests with different numbers of concurrent queries might help to establish this.
[24 Oct 2007 5:08] Inaam Rana
Mark,

I tried the tls.sql test and it quickly became IO bound. The vmstat output was more like the one showed by you at the top of this report i.e.: increase in iowait time not idle time. A few questions for you:

1) The idle time increase that you mentioned, it happens on your internal workload or tls.sql?

2) I see that the 5.0 binary is using four reader/writer InnoDB threads. Is it a customized build? If yes, what else has been changed in it?

3) When you ran with adaptive hash index disabled, did you saw an increase in iowait time and/or user time? Was the idle time reduced to what it was in 4.0?

4) Can you help us in testing some tentative fixes? I am asking for this because the workload that is triggering this particular utilization is not available to us for testing. Any help in this regard is appreciated.
[24 Oct 2007 5:10] Inaam Rana
patch to release btr_search_latch if the current THD has build temp tables on disk.

Attachment: ahi_5.0_v0.patch (text/x-diff), 532 bytes.

[24 Oct 2007 5:12] Inaam Rana
patch based on Yasufumi's idea of fixing missed os_event_set() call in sync0arr.

Attachment: os_event_5.0_v2.patch (text/x-diff), 17.22 KiB.

[28 Oct 2007 18:22] Mark Callaghan
>>>>>
I tried the tls.sql test and it quickly became IO bound. The vmstat output was more like
the one showed by you at the top of this report i.e.: increase in iowait time not idle
time. A few questions for you:
>>>>>

We have someone working on this fulltime. I hope to have a better testcase ready soon. The testcase that I shared is not as good as the testcase run on data we can't share.

>>>>>
1) The idle time increase that you mentioned, it happens on your internal workload or
tls.sql?
>>>>>

It happens on our internal workload. tls.sql is not a good testcase.

>>>>>
2) I see that the 5.0 binary is using four reader/writer InnoDB threads. Is it a
customized build? If yes, what else has been changed in it?
>>>>>

There are many changes. Most are described at http://code.google.com/p/google-mysql-tools/w/list. We use MySQL 4.0.26 and 5.0.37 and both versions have the same changes by us. Performance for 5.0.37 is much worse than 4.0.26 with this bug. 

>>>>>
3) When you ran with adaptive hash index disabled, did you saw an increase in iowait time
and/or user time? Was the idle time reduced to what it was in 4.0?
>>>>>

I will make sure that the person working on this provides an answer.

>>>>>
4) Can you help us in testing some tentative fixes? I am asking for this because the
workload that is triggering this particular utilization is not available to us for
testing. Any help in this regard is appreciated.
>>>>>

We will do our best to get a testcase to you.

I don't think the patch to release the btr_search_latch when spilling temp tables to disk does much for this problem. I say that because I tried such a change when first debugging this. I am curious why a thread can hold an S lock on btr_search_latch for a long time -- for example when leaving InnoDB. Given that there can be frequent requests for an X lock on this, shouldn't the code minimize the duration for which locks on btr_search_latch are held?

We will try out the other patch for the missing os_event_set call.
[9 Nov 2007 17:36] Heikki Tuuri
Inaam, do we need to change the plugin interface in 5.1?
[5 Dec 2007 4:12] Mark Callaghan
We have begun testing the fix for http://bugs.mysql.com/bug.php?id=29560 and the results look great on servers with 4 CPU cores and 20 concurrent queries.
[16 Dec 2007 18:53] Inaam Rana
We are waiting on (hopefully good) results from Mark and his team on this.
[17 Jan 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[19 Aug 2010 19:25] James Day
For cases where bug #29560 isn't the solution, the work to release the adaptive hash during filesorts that was done in bug #32149 should solve the problem.
[1 Oct 2011 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[20 Nov 2011 7:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".