Bug #17332 changing key_buffer_size on a running server can crash under load
Submitted: 12 Feb 2006 22:38 Modified: 23 Jan 16:41
Reporter: Shane Bester
Status: Closed
Category:Server: MyISAM Severity:S2 (Serious)
Version:4.1BK,5.0.18, 5.0.19-bk,5.0.21-bk OS:Linux (Linux,Windows)
Assigned to: Ingo Strüwing Target Version:
Tags: myisam, load index, key_buffer_size, crash
Triage: D1 (Critical)

[12 Feb 2006 22:38] Shane Bester
Description:
Accasionally changing the MyISAM key_buffer_size on a running server can lead to a crash. 
It seems heavy INSERT/UPDATE activity can provoke it. Also inserting multi-row queries
increases chance of crash.

Below are some stack traces from such crashes:

---
0x816d5f1 handle_segfault + 461
0xfbb420 (?)
0x9b7c00 (?)
0x83ec832 key_cache_write + 2494
0x83ecbae key_cache_write + 3386
0x83ecf1c flush_key_blocks + 425
0x83e9e5e resize_key_cache + 321
0x822a96d _Z19ha_resize_key_cacheP12st_key_cache + 235
0x8174200 _ZN23sys_var_key_buffer_size6updateEP3THDP7set_var + 468
0x8173209 _ZN7set_var6updateEP3THD + 63
0x8174357 _Z17sql_set_variablesP3THDP4ListI12set_var_baseE + 141
0x818454c _Z21mysql_execute_commandP3THD + 12546
0x8188fa6 _Z11mysql_parseP3THDPcj + 330
0x8189832 _Z16dispatch_command19enum_server_commandP3THDPcj + 1924
0x818aa37 _Z10do_commandP3THD + 537
0x818b59a handle_one_connection + 2740
0x880b80 (?)
0x597dee (?)
---

0x816d5f1 handle_segfault + 461
0x1f7420 (?)
0x12c (?)
0x822a96d _Z19ha_resize_key_cacheP12st_key_cache + 235
0x8174200 _ZN23sys_var_key_buffer_size6updateEP3THDP7set_var + 468
0x8173209 _ZN7set_var6updateEP3THD + 63
0x8174357 _Z17sql_set_variablesP3THDP4ListI12set_var_baseE + 141
0x818454c _Z21mysql_execute_commandP3THD + 12546
0x8188fa6 _Z11mysql_parseP3THDPcj + 330
0x8189832 _Z16dispatch_command19enum_server_commandP3THDPcj + 1924
0x818aa37 _Z10do_commandP3THD + 537
0x818b59a handle_one_connection + 2740
0x880b80 (?)
0x597dee (?)

---

0x816d5f1 handle_segfault + 461
0x229420 (?)
0x4cb000 (?)
0x83ec832 key_cache_write + 2494
0x83ecbae key_cache_write + 3386
0x83ece49 flush_key_blocks + 214
0x83be94c mi_close + 540
0x822f747 _ZN9ha_myisam5closeEv + 33
0x81af41f _Z8closefrmP8st_table + 93
0x81a7337 _Z18intern_close_tableP8st_table + 47
0x81a73d0 _Z18intern_close_tableP8st_table + 200
0x83f8110 hash_delete + 927
0x81a5ce9 _Z18close_thread_tableP3THDPP8st_table + 219
0x81a76d8 _Z19close_thread_tablesP3THDbb + 682
0x818a498 _Z16dispatch_command19enum_server_commandP3THDPcj + 5098
0x818aa37 _Z10do_commandP3THD + 537
0x818b59a handle_one_connection + 2740
0x880b80 (?)
0x597dee (?)

How to repeat:
Open two mysql sessions.

[session1]
source thread1.sql;

[session2]
source thread2.sql;

Let the above two scripts run at the same time. Repeat if a crash doesn't happen quick
enough, or even run thread1.sql in multiple sessions at the same time.

Suggested fix:
Not sure..
[12 Feb 2006 22:39] Shane Bester
insert statements

Attachment: thread1.zip (application/zip, text), 4.24 KiB.

[12 Feb 2006 22:42] Shane Bester
set key_buffer_size

Attachment: thread2.zip (application/zip, text), 190.47 KiB.

[12 Feb 2006 23:43] Shane Bester
proper backtrace, using 5.0.19-debug on Windows.

Attachment: win_stack_trace.txt (text/plain), 1.46 KiB.

[17 Feb 2006 16:02] Ingo Strüwing
It took me three attempts with four threads to repeat it. This means that I have to do a
lot of tests until I can feel positive about a fix.

It is a keycache locking problem. When it crashed for me, two threads were flushing the
cache. One for its inserts, one for the cache resizing. Both had a list of modified blocks
(probably the same list) and tried to write them out and free each block. When freeing
they assumed that every block from the list is not freed yet... 

I need to learn more about the keycache architecture until I can propose a reasonable
fix.
[6 Mar 2006 15:08] Ingo Strüwing
After fixing a couple of crash paths I have now infinite loops. Need more time.
[14 Mar 2006 13:46] Ingo Strüwing
I need more time to fix this satisfactorily.

When I started I could quickly repeat a crash. I saw almost instantly what happened and
found a way to fix it. But then it crashed at another place. After fixing about four crash
paths, I decided to rework the key cache locking more fundamentally. After this I was no
longer able to produce any crashes.

But now I saw index corruptions. But only after about 100 of the proposed tests on a 4-CPU
machine with 4 threads competing on the key cache. This is pretty good, but I'm not
satisfied yet.

Unfortunately, I don't know if the corruptions result from my changes, or if they were
present before, just masked by the crashes.

Finding the corruptions is much more complicated than finding the crashes. While the
crashes leave a stack backtrace behind, which points directly to the place in code where a
problem exists, the corruption is only deteced after a command is complete. There is no
hint, where in the code it had been corrupted.

I plan to instrument the code with a lot of counters, hoping that I would see if one or
more counters were increased before the corruption, but not increased for other commands.
But this takes some time.
[5 Apr 2006 8:59] Ingo Strüwing
Other, higher priority tasks are stepping in.
And I still have no clue what could cause the index corruptions. It is difficult to track
down due to the high load required to repeat them.
[20 Apr 2006 23:28] Shane Bester
5.0.21-bk, crash when INSERT and key_cache_block_size changes.

Attachment: key_cache_block_size.stack.txt (text/plain), 2.09 KiB.

[20 Apr 2006 23:34] Shane Bester
Ingo, I have a C load testing program to generate these crashes within ~1 minute. 

Attached is another crash, but instead of using key_buffer_size, I run the following to
provoke it:

50 threads: SET GLOBAL key_cache_block_size={rand(4096,1048576)}
10 threads: INSERT INTO <table> with primary key.

Crash looks similiar to the others reported.

!unlink_block  Line 1011 + 0xb	C
!free_block  Line 2226 + 0xd	C
!flush_cached_blocks  Line 2303 + 0xd	C
!flush_key_blocks_int  Line 2456 + 0x1f	C
!flush_all_key_blocks  Line 2580 + 0x15	C
!resize_key_cache)  Line 515 + 0x9	C
!ha_resize_key_cache  Line 2259 + 0x32	C++
!sys_var_key_cache_long::update  Line 2434 + 0x9	C++
!set_var::update  Line 3101 + 0x1b	C++
!sql_set_variables  Line 2986 + 0xf	C++
!mysql_execute_command  Line 3530 + 0x10	C++
!mysql_parse  Line 5709 + 0x9	C++
!dispatch_command  Line 1719 + 0x1d	C++
!do_command  Line 1515 + 0x31	C++
!handle_one_connection  Line 1158 + 0x9	C++
!pthread_start  Line 63 + 0x7	C
!_threadstart  Line 196 + 0xd	C
kernel32.dll!_BaseThreadStart@8()  + 0x52
[20 Apr 2006 23:35] Shane Bester
above crash is on today's 5.0.21-bk built on windows.
[26 Apr 2006 12:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/5557
[26 Apr 2006 12:47] Ingo Strüwing
Back to "In progress". The patch is not yet complete. It is just a preview for testing.

Shane, if you have time, please apply the patch to a clean 5.0 tree and test with it.

Regards, Ingo
[10 May 2006 12:45] Ingo Strüwing
I dedetcted an unrelated bug (Bug #19604) that I need to fix before I can continue
testing.
[1 Jun 2006 15:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/7163
[23 Jun 2006 3:01] Shane Bester
multithreaded C program. see top of file for compile details

Attachment: bug20540.c (text/x-csrc), 5.37 KiB.

[23 Jun 2006 3:03] Shane Bester
Ingo, Sorry for delay. compile the attached C program and run like this to kill the
server:

create table test(id int not null auto_increment primary key);

Then,

./bug20540 200 60 10 1 "insert into test values (),(),();set global
key_buffer_size=2048675;set global key_buffer_size=1048576;insert into test values
(),(),()" 192.168.250.3 3306 test root

crash happened within 1 minute on my box, over ethernet.
[26 Jun 2006 12:51] Ingo Strüwing
The last patch had a problem with FLUSH_IGNORE_CHANGED.
[26 Jun 2006 12:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8239
[18 Jul 2006 17:10] Ingo Strüwing
I will split the fix in smaller pieces for easier review.
[1 Sep 2006 12:35] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11242

ChangeSet@1.2208, 2006-09-01 12:34:55+02:00, istruewing@chilla.local +2 -0
  Bug#17332 - changing key_buffer_size on a running server
              can crash under load
  
  Resizing a key cache while it was in heavy use could crash the
  server. There were several race conditions.
  
  I reworked some of the algorithms to fix the race conditions.
[4 Oct 2006 20:07] RIchard Murphy
I am experiencing what would seem to be the same problem using Ver 14.7 Distrib 4.1.20,
for pc-linux-gnu
[1 Nov 2006 16:54] Ingo Strüwing
It is quite possible that this bug exists in 4.1 too. I'll check if we should fix it there
too.
[1 Nov 2006 17:38] Shane Bester
yes, 4.1.22 is also affected..

Attachment: 4.1_bk.txt (plain/text, text), 1010 bytes.

[2 Nov 2006 15:12] Ingo Strüwing
In a meeting with Monty, Sanja, and Ingo we came to the following conclusions:

- Ingo's last patch solves a couple of critical problems and needs
  to be added to 5.0 and 5.1. [Comment by Ingo: and probably 4.1,
  but this has not been decided yet.]
- Sanja and Monty have done a full review of every single line of Ingo's
  last patch and only found one problem (which Ingo will correct).
- We don't think that Igor's proposed solution would create a smaller
  patch.  What is sure it would be partly same as Ingo's patch, but
  less efficient.
- Monty, Sanja, and Ingo did an extra review of all code that is not
  part of the new resize code to ensure that there are no new bugs
  in normal operation (ie, when we don't do a resize).
  (As resize is already broken, it's not as critical if there would be
  new bugs in this code)
- Monty found some small cleanup things in the patch that Ingo has
  promised to fix (mostly to move some repeated code into
  functions to make the code easier to manage)
- Some of the key cache description of Ingos email will be moved to
  internal docs, some will be moved into mf_keycache.c
- Ingo will spend 3-6 hours on running more test with code coverage
  in an effort to ensure that as many as possible of the changed lines
  are tested.

To ensure we don't break 5.0 (especially as we may not get extensive
community testing on this), we come up with the following plan:

- Ingo will add the patch in 5.1. If we within one month after the
  next 5.1 release don't get any bug reports for the new code in the
  key cache, he will also apply the code in the 5.0 tree.

  (The alternative would be to disable resize of key cache in 5.0 and
  5.1)
[2 Nov 2006 15:23] Ingo Strüwing
For the purpose of a better test coverage I changed the test script. This revealed
deadlocks in the key cache code. I have no fix for it yet. But I guess it will require
changes that need another review.
[14 Nov 2006 16:30] Ingo Strüwing
After almost two weeks of fixing one small synchronization problem after the other and
eating all the CPU I could get, I was thrown off the test machine. So I stopped working on
this bug for now.
[30 Nov 2006 9:47] Ingo Strüwing
I do now have test machines for it.
[8 Dec 2006 14:57] Ingo Strüwing
My current test script. Run on an installed version (BASEDIR, DATADIR).

Attachment: bug17332-8.sh (application/x-sh, text), 24.24 KiB.

[11 Jan 2007 16:12] Scott Wilson
Hi,  I wasn't able to get this test script to work under freebsd (didn't have time to try
very hard), but I did replicate the changing key_buffer_size causing server crash on both
5.0.24 and 5.0.32 under freebsd 6.1 64-bit.

 -scott
[31 Jan 2007 18:45] Ingo Strüwing
My current test script. Run on an installed version (BASEDIR, DATADIR).

Attachment: bug17332-9.sh (application/x-sh, text), 14.30 KiB.

[31 Jan 2007 18:48] Ingo Strüwing
Detailed description for todays changeset and some general key cache information.

Attachment: keycache-changes.txt (text/plain), 63.38 KiB.

[31 Jan 2007 18:49] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/19110

ChangeSet@1.2406, 2007-01-31 18:49:07+01:00, istruewing@chilla.local +7 -0
  Bug#17332 - changing key_buffer_size on a running server
              can crash under load
  
  Resizing a key cache while it was in heavy use could crash the
  server. There were several race conditions.
  
  I reworked some of the algorithms to fix the race conditions.
  
  No test case. Repeating the crashes requires heavy concurrent
  load on the key cache. A test script is attached to the bug report.
  
  More explanations to the changes are contained in a text file
  attached to the bug report.
[27 Feb 2007 14:39] Ingo Strüwing
I ran the above mentioned program (bug20540.c) successfully for two hours each on a
Pentuim4, a Dual Xeon, and a Quad Itanium (all three with Linux). The latter two could not
create 200 threads so I used 100 there.
[27 Feb 2007 20:42] Ingo Strüwing
I repeated also tests with the original streams from thread1.zip and thread2.zip.
[22 Mar 2007 9:35] Ingo Strüwing
Again I am having difficulties with the test machines. This will defer the completion of
the tests.
[22 Mar 2007 13:13] Shane Bester
Ingo, the sql statement "load index into cache `idx`" also crashes if run concurrently. 
Just noting so you can test it too with your fixes.  Uploading stack trace now.
[22 Mar 2007 13:15] Shane Bester
stack trace and variables

Attachment: load_index_crash.txt (text/plain), 2.93 KiB.

[23 Mar 2007 11:53] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22752

ChangeSet@1.2496, 2007-03-23 11:52:45+01:00, istruewing@chilla.local +2 -0
  Bug#17332 - changing key_buffer_size on a running server
              can crash under load
  After review fixes
[23 Mar 2007 12:17] Ingo Strüwing
Shane, thank you. I did already notice a couple of problems with LOAD INDEX during my
tests. I believe I have fixed them all.
[10 May 2007 17:22] Ingo Strüwing
Please remember. This will first go to 5.1 only. If it doesn't cause problems for some
time, it will go to 5.0 too.
[14 May 2007 11:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/26580

ChangeSet@1.2516, 2007-05-14 11:33:47+02:00, istruewing@chilla.local +1 -0
  Bug#17332 - changing key_buffer_size on a running server
              can crash under load
  Post-post-review fixes.
  Fixed a typo == -> =
  Optimized normal flush at end of statement (FLUSH_KEEP),
  but let other flush types be stringent.
  Added comments.
  Fixed debugging.
[14 May 2007 11:53] Ingo Strüwing
Queued to 5.1-engines.
Note that it is intended to push it also to 5.0 after a probationary period.

Docs Team:

This patch lifts an restriction of the LOAD INDEX INTO CACHE statement. Only with the
IGNORE LEAVES modifier we need to have the same block size on all indexes in a table. In
other words, we can now load the indexes of a table even if they have different block
sizes. But only if we load the leaves too.

You can change the first sentence of the last paragraph in
http://dev.mysql.com/doc/refman/5.1/en/load-index.html to:
LOAD INDEX INTO CACHE ... IGNORE LEAVES fails unless all indexes in a table have the same
block size.
[14 May 2007 20:52] James Day
The restriction of the LOAD INDEX INTO CACHE statement is in bug #3705. I'm also adding a
note there about the partial removal of the restriction.
[24 May 2007 9:05] Bugs System
Pushed into 5.1.19-beta
[24 May 2007 17:02] Shane Bester
I tried the fix in 5.1.19-BK and couldn't crash the server, even when running 5000 'set
global key_buffer_size'  per second in 50 threads, all that in combination with huge
inserts into many tables.  Fix looks good for now!
[27 May 2007 20:48] Paul DuBois
Noted in 5.1.19 changelog.

Changing the size of a key buffer that is under heavy use could cause
a server crash. The fix partially removes the limitation that LOAD
INDEX INTO CACHE fails unless all indexes in a table have the same
block size. Now the statement fails only if IGNORE LEAVES is
specified.

Resetting report to Patch Queued pending push of fix into 5.0.x.
[23 Jan 16:41] Paul DuBois
Closing this report. No changelog entry for 5.0; the patch may go into 5.0 after 5.1 has
been GA for a while.