Bug #28519 falcon crash with signal 4
Submitted: 18 May 2007 14:42 Modified: 3 Dec 2007 14:21
Reporter: Shane Bester (Platinum Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S1 (Critical)
Version:6.0.1BK OS:Linux (suse 9.3 x86)
Assigned to: Christopher Powers CPU Architecture:Any

[18 May 2007 14:42] Shane Bester
Description:
during some simple stress tests, I found falcon crashed with this Signal 4 crash:

070518 16:03:02 - mysqld got signal 4;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=8388572
read_buffer_size=131072
max_used_connections=16
max_threads=151
threads_connected=16
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 337606 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x8e79750
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x459e83bc, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x81fd57b handle_segfault + 541
0xffffe410 _end + -142580448
0x83c0588 _ZN5Error10debugBreakEv + 18
0x83c04ff _ZN5Error5errorEPKcz + 87
0x83c056e _ZN5Error15assertionFailedEPKci + 32
0x840fea0 _ZN16SRLUpdateRecords4thawEP13RecordVersion + 218
0x839d609 _ZN11Transaction4thawEP13RecordVersion + 101
0x83e8055 _ZN13RecordVersion4thawEv + 85
0x83e7dab _ZN13RecordVersion12fetchVersionEP11Transaction + 97
0x838a770 _ZN15StorageDatabase7nextRowEP12StorageTableib + 126
0x838e772 _ZN12StorageTable4nextEib + 46
0x8383750 _ZN16StorageInterface8rnd_nextEPc + 136
0x82dd683 _Z13rr_sequentialP14st_read_record + 135
0x826bef1 _Z10sub_selectP4JOINP13st_join_tableb + 223
0x826ba25 _Z9do_selectP4JOINP4ListI4ItemEP8st_tableP9Procedure + 759
0x825aeaf _ZN4JOIN4execEv + 7045
0x825b38a _Z12mysql_selectP3THDPPP4ItemP13st_table_listjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_sel + 638
0x825610d _Z13handle_selectP3THDP6st_lexP13select_resultm + 365
0x82114c4 _Z21execute_sqlcom_selectP3THDP13st_table_list + 772
0x820a039 _Z21mysql_execute_commandP3THD + 1595
0x8212fe1 _Z11mysql_parseP3THDPcj + 299
0x82084ad _Z16dispatch_command19enum_server_commandP3THDPcj + 2241
0x8207be0 _Z10do_commandP3THD + 612
0x8206753 handle_one_connection + 253
0x40250aa7 _end + 933596087
0x401e6c2e _end + 933162302
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do 
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x45a19798  is invalid pointer
thd->thread_id=8

How to repeat:
will upload a .c testcase later.

Suggested fix:
.
[18 May 2007 16:18] Kevin Lewis
I am assuming that this assert failed;
	ASSERT(record->state == recChilled);
because it is the only assert in SRLUpdateRecords::thaw(RecordVersion *record)

The problem is that two functions up the call stack in RecordVersion::thaw() we find;
	ASSERT(state == recChilled);
So one moment the record was chilled, and the next moment it wasn't...
We need some coordination between threads here.
[19 May 2007 9:42] Shane Bester
testcase. sorry for the random nature of it, but it seems you already have idea how to fix the bug?

Attachment: bug28519.c (text/plain), 7.40 KiB.

[25 Jun 2007 20:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/29561

ChangeSet@1.2560, 2007-06-25 15:38:16-05:00, chris@terrazzo.site +12 -0
  1. Bug#28519 "Falcon crash with signal 4" - Replaced state == chilled
     assertions in record thaw methods with a benign function return. 
  2. Reformatted chill/thaw debug output.
  3. Check deferredIndex->index == null in Transaction::dropTable(). 
  4. Added RecordVersion::getRecordData() to encapsulate data.record.
     If necessary, getRecordData() peforms a record thaw before returning
     a pointer to the record data.
[25 Jun 2007 20:59] Christopher Powers
The testcase exposes a race condition in record thawing such that one thread thaws a record out from underneath another thread performing a thaw:

StorageInterface::rnd_next()
  StorageTable::next()
     StorageDatabase::nextRow()
        RecordVersion::fetchVersion()          recChilled == TRUE here
           RecordVersion::thaw()
              Transaction::thaw()
                  SRLUpdateRecords::thaw()
                     ASSERT(state == recChilled)   FAILS

In this case, the record being thawed was thawed by another thread sometime between fetchVersion() and SRLUpdateRecords::thaw().

This isn't necessarily a bad thing, because the only operations during a thaw that must be serialized are, 1) activating the serial log window from which the record data will be restored, and 2) setting the data.record pointer after the data has been restored.

Serial log window access is serialized by an exclusive lock, and changes to record.data are performed atomically via compare and exchange. So, if during a thaw the record is thawed on another thread, then the current thaw operation should simply return without error.
[26 Jun 2007 6:24] Hakan Küçükyılmaz
I could successfully run a modified version of the testcase with 51 threads and 1800 seconds runtime. I also tried the original testcase with 25 threads and 5600 seconds runtime.

The test does not crash Falcon anymore but it hangs sometimes.
[27 Nov 2007 8:42] Hakan Küçükyılmaz
I successfully ran the test case for 5600 seconds.
[3 Dec 2007 14:21] MC Brown
A note has been added to the 6.0.4 changelog: 

Under heavy load when updating Falcon tables, a race condition could occur that would ultimately result in a crash.