MySQL Bugs: #33933: Falcon assertionFailed() in fetchNext()

Bug #33933	Falcon assertionFailed() in fetchNext()
Submitted:	18 Jan 2008 20:28	Modified:	1 Oct 2008 18:09
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0.4-p3-debug	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	simpler_testcase_needed

Description:
About an hour into the iuds2 test, Falcon crashed with the following stack trace:

#0  0x00002b88d306e4c5 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00000000005df018 in handle_segfault (sig=4) at mysqld.cc:2313
#2  <signal handler called>
#3  0x00002b88d3070fed in raise () from /lib64/libpthread.so.0
#4  0x00000000007b8d4c in Error::debugBreak () at Error.cpp:92
#5  0x00000000007b8e42 in Error::error (string=<value optimized out>) at Error.cpp:69
#6  0x00000000007b8ea2 in Error::assertionFailed (fileName=0x4 <Address 0x4 out of bounds>, line=20000) at Error.cpp:76
#7  0x0000000000785e4f in Table::fetchNext (this=0x2aaaaacfe228, start=<value optimized out>) at Table.cpp:522
#8  0x000000000077949f in StorageDatabase::nextRow (this=<value optimized out>, storageTable=0x2aaaaf01edc0, recordNumber=20491, lockForUpdate=false)
    at StorageDatabase.cpp:244
#9  0x000000000077d65d in StorageTable::next (this=0x0, recordNumber=4, lockForUpdate=255) at StorageTable.cpp:161
#10 0x0000000000775e20 in StorageInterface::rnd_next (this=0x2aaab8fd23d0, buf=0x2aaab8fd2670 "О©Ґ\201\003") at ha_falcon.cpp:506
#11 0x00000000006c8fb3 in rr_sequential (info=0x159f108) at records.cc:369
#12 0x00000000006519aa in sub_select (join=0x2aaab8e4a9e0, join_tab=0x159f080, end_of_records=4) at sql_select.cc:13290
#13 0x0000000000652f6f in do_select (join=0x2aaab8e4a9e0, fields=0x2aaab8e4c430, table=0x0, procedure=0x0) at sql_select.cc:13038
#14 0x0000000000660d30 in JOIN::exec (this=0x2aaab8e4a9e0) at sql_select.cc:2713
#15 0x0000000000661a2e in mysql_select (thd=0x2aaab8f78900, rref_pointer_array=0x1190920, tables=0x10f26a8, wild_num=0, fields=@0x1190840, conds=0x0,
    og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2416200192, result=0x10f2550, unit=0x11902f8, select_lex=0x1190738)
    at sql_select.cc:2901
#16 0x0000000000661f64 in handle_select (thd=0x2aaab8f78900, lex=0x1190258, result=0x10f2550, setup_tables_done_option=0) at sql_select.cc:282
#17 0x00000000005eab2e in execute_sqlcom_select (thd=0x2aaab8f78900, all_tables=0x10f26a8) at sql_parse.cc:4517
#18 0x00000000005ed459 in mysql_execute_command (thd=0x2aaab8f78900) at sql_parse.cc:1845
#19 0x000000000066e569 in Prepared_statement::execute (this=0x118ff40, expanded_query=<value optimized out>, open_cursor=false) at sql_prepare.cc:3100
#20 0x000000000066f2bf in mysql_sql_stmt_execute (thd=<value optimized out>) at sql_prepare.cc:2391
#21 0x00000000005ed486 in mysql_execute_command (thd=0x2aaab8f78900) at sql_parse.cc:1854
#22 0x00000000005f3482 in mysql_parse (thd=0x2aaab8f78900, inBuf=0x159d610 "EXECUTE sel_vals", length=16, found_semicolon=0x45bd9090) at sql_parse.cc:5410
#23 0x00000000005f42b2 in dispatch_command (command=COM_QUERY, thd=0x2aaab8f78900, packet=<value optimized out>, packet_length=16) at sql_parse.cc:921
#24 0x00000000005f50a2 in do_command (thd=0x2aaab8f78900) at sql_parse.cc:697
#25 0x00000000005e6990 in handle_one_connection (arg=<value optimized out>) at sql_connect.cc:1146
#26 0x00002b88d306a193 in start_thread () from /lib64/libpthread.so.0
#27 0x00002b88d37b145d in clone () from /lib64/libc.so.6
#28 0x0000000000000000 in ?? ()

How to repeat:
Core file from debug binary will be available shortly. Please let me know if any additional action is required, and I will do it.

This crash is consistently reproducible with a debug binary.

Philip,  This asserts after three attempts to get a new record added to the table.  The falcon debug log would add more information to what happened.  Can you attach a debug log?

Can you please take a look if the falcon debug log is of any use?

Last week, Chris got this same assertion error while running the test for 33634 (iuds6, I think). That is;

23 Error::assertionFailed() storage/falcon/Error.cpp:76
22 Table::fetchNext() storage/falcon/Table.cpp:523
21 StorageDatabase::nextRow() storage/falcon/StorageDatabase.cpp:244
20 StorageTable::next() storage/falcon/StorageTable.cpp:161
19 StorageInterface::rnd_next() storage/falcon/ha_falcon.cpp:552

Jim responded with this; "Interesting. I suspect it is a relative of the bug I'm trying to run to ground. Remember how you got it. When I find my bug and push a fix, try it again. Maybe it's the same, maybe it's different. It's certainly in the same area...."

Yesterday, Jim found and fixed the record reuse bug he was seeing. He emailed this comment;

/start comment/ "The problem was inadequate synchronization of gopher threads. There are two records in different transactions, a insert and the deletion. The
gopher starts to apply the insert, checks the record tree for conflict, everything looks fine, and calls Section::updateRecord. In the meantime, apparently, the delete happens, get committed, and a gopher fires up, and somehow beats the insert through Section::updateRecord.

I've got to push the test of the record tree until after Section::updateRecord has an exclusive lock on the record locator page. Then I have to rip out a ton of debugging code.

As I've said before, if there is any hole, no matter how small, an SMP system and a heavy load will find it." /end comment/

So it looks like this bug may have been the bug Jim was fixing. I am making this Patch Approved, since I did a code review. Let us know if this call stack occurs again.

Patch is in mysql-6.0-falcon-team and mysql-6.0-release version 6.0.4

This call stack occured again using the latest 6.0.4 binary. I am a bit concerned about the line which says Error::assertionFailed (fileName=0x1f22 <Address 0x1f22 out of bounds>, line=7708) at Error.cpp:76.

#0  0x0000003ba880b132 in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000000650e12 in handle_segfault (sig=4) at mysqld.cc:2313
#2  <signal handler called>
#3  0x0000003ba880dd4d in raise () from /lib64/libpthread.so.0
#4  0x0000000000854f1c in Error::debugBreak () at Error.cpp:92
#5  0x0000000000854e30 in Error::error (string=0x0) at Error.cpp:69
#6  0x0000000000854f06 in Error::assertionFailed (fileName=0x1f22 <Address 0x1f22 out of bounds>, line=7708) at Error.cpp:76
#7  0x0000000000824e68 in Table::fetchNext (this=0x2aaaab0554a8, start=1576) at Table.cpp:524
#8  0x0000000000819eb6 in StorageDatabase::nextRow (this=0x1f22, storageTable=0x2aaab4ae7d40, recordNumber=1574, lockForUpdate=false)
    at StorageDatabase.cpp:244
#9  0x000000000081e87c in StorageTable::next (this=0x0, recordNumber=7708, lockForUpdate=255) at StorageTable.cpp:161
#10 0x000000000080eb26 in StorageInterface::rnd_next (this=0x2aaaff642f00, buf=0x2aaac26b6c98 "О©ҐJ\f") at ha_falcon.cpp:552
#11 0x0000000000740f99 in ha_partition::rnd_next (this=0x2aaac26b6970, buf=0x2aaac26b6c98 "О©ҐJ\f") at ha_partition.cc:3203
#12 0x00000000007300ab in rr_sequential (info=0x2aaab000dc88) at records.cc:383
#13 0x00000000006bdf7c in sub_select (join=0x200a9598, join_tab=0x2aaab000dc00, end_of_records=4) at sql_select.cc:13290
#14 0x00000000006bdb0d in do_select (join=0x200a9598, fields=0x200aafe8, table=0x0, procedure=0x0) at sql_select.cc:13038
#15 0x00000000006a87d2 in JOIN::exec (this=0x200a9598) at sql_select.cc:2713
#16 0x00000000006a8d54 in mysql_select (thd=0x2aaab009e070, rref_pointer_array=0x1fdf3c20, tables=0x3938c8b8, wild_num=0, fields=@0x1fdf3b00, conds=0x0,
    og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2416200192, result=0x3938c760, unit=0x1fdf35f8, select_lex=0x1fdf3a38)
    at sql_select.cc:2901
#17 0x00000000006a3272 in handle_select (thd=0x2aaab009e070, lex=0x1fdf3558, result=0x3938c760, setup_tables_done_option=0) at sql_select.cc:270
#18 0x0000000000661999 in execute_sqlcom_select (thd=0x2aaab009e070, all_tables=0x3938c8b8) at sql_parse.cc:4517
#19 0x000000000065ceb2 in mysql_execute_command (thd=0x2aaab009e070) at sql_parse.cc:1845
#20 0x00000000006d8009 in Prepared_statement::execute (this=0x1fdf3240, expanded_query=0x40a80980, open_cursor=false) at sql_prepare.cc:3100
#21 0x00000000006d69df in mysql_sql_stmt_execute (thd=0x2aaab009e070) at sql_prepare.cc:2391
#22 0x000000000065cecc in mysql_execute_command (thd=0x2aaab009e070) at sql_parse.cc:1854
#23 0x000000000066317e in mysql_parse (thd=0x2aaab009e070, inBuf=0x200a93a0 "EXECUTE sel_vals", length=16, found_semicolon=0x40a81590) at sql_parse.cc:5410
#24 0x000000000065b895 in dispatch_command (command=COM_QUERY, thd=0x2aaab009e070, packet=0x200a93b0 "", packet_length=16) at sql_parse.cc:948
#25 0x000000000065b092 in do_command (thd=0x2aaab009e070) at sql_parse.cc:697
#26 0x0000000000659e4f in handle_one_connection (arg=0x1f22) at sql_connect.cc:1146
#27 0x0000003ba88062f7 in start_thread () from /lib64/libpthread.so.0
#28 0x0000003ba80ce85d in clone () from /lib64/libc.so.6

Those lines are in the falcon_debug_mask-enabled error log from the second the crash occured. Is it possible that Falcon asserts because of bug #33479 ?

080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition
080216  9:29:16 [ERROR] Failed to calculate auto_increment value for partition

Philip, Can you reporoduce this?  Which test were you running?  What MySQL features does it involve besides partitions?  I think the assertion should not occur even with the partition problem in 33479.  Please try to make this repeatable.

Ok I will work on a simpler testcase.

Both iuds2 (no partitions) and iuds6 (partitions) will arrive to that crash sooner or later. The difference is that the backtrace would include a ha_partition segment if partitions are involved.

Philip, There was a patch for Bug#34990 on [5 Mar 2008 21:27] that may have affected this.  Section::insertStub() was returning the same Record Number to
two different threads that did an insert of a new record.  It may be that this code got involved in that mixup.  In this case, a newly read existing record cannot be put into the record cache as the base record version because the same thing is happening by another thread.  Sounds similar to #34990.  Can you check if this callstack still occurs.  The assert is;
	for (int n = 0; (record = databaseFetch(bitNumber)); ++n)
		{
		. . .
		ASSERT(n < 2);
		}

This call stack still occurs with the 6.0.5 build. The assertion is  ASSERT(n < 2), however n = 2.

Philip, Please check if the value of bitNumber is the same as recordNumber when this error occurs.

Verified by code inspection from Jim Starkey.

This may have been fixed with this patch;

ChangeSet@1.2670.3.3, 2008-05-21 10:58:08-04:00, jas@pendragon. +4 -0
  Do some quick fixes for race conditions.

This is fixed in 6.0.6.

When would a user-initiated action(s) would trigger this bug? Could this issue occur with any user-partitioned Falcon table with an auto_increment column?

Thanks.

This assert may have been caused by 2 or three conditions.  They all had to do with inserts and deletes to the same table in a highly concurrent environment.  The assert indicated a failure to find a record that should have been in the table.

Documented in the 6.0.6 changelog as follows:

        In some cases, concurrent INSERT and DELETE statements on the same
        Falcon table could cause the server to crash, due to a failure to find a
        record that should have been in the table.