Bug #33933 | Falcon assertionFailed() in fetchNext() | ||
---|---|---|---|
Submitted: | 18 Jan 2008 20:28 | Modified: | 1 Oct 2008 18:09 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0.4-p3-debug | OS: | Any |
Assigned to: | Kevin Lewis | CPU Architecture: | Any |
Tags: | simpler_testcase_needed |
[18 Jan 2008 20:28]
Philip Stoev
[18 Jan 2008 21:32]
Philip Stoev
This crash is consistently reproducible with a debug binary.
[22 Jan 2008 21:36]
Kevin Lewis
Philip, This asserts after three attempts to get a new record added to the table. The falcon debug log would add more information to what happened. Can you attach a debug log?
[29 Jan 2008 12:27]
Philip Stoev
Can you please take a look if the falcon debug log is of any use?
[10 Feb 2008 4:21]
Kevin Lewis
Last week, Chris got this same assertion error while running the test for 33634 (iuds6, I think). That is; 23 Error::assertionFailed() storage/falcon/Error.cpp:76 22 Table::fetchNext() storage/falcon/Table.cpp:523 21 StorageDatabase::nextRow() storage/falcon/StorageDatabase.cpp:244 20 StorageTable::next() storage/falcon/StorageTable.cpp:161 19 StorageInterface::rnd_next() storage/falcon/ha_falcon.cpp:552 Jim responded with this; "Interesting. I suspect it is a relative of the bug I'm trying to run to ground. Remember how you got it. When I find my bug and push a fix, try it again. Maybe it's the same, maybe it's different. It's certainly in the same area...." Yesterday, Jim found and fixed the record reuse bug he was seeing. He emailed this comment; /start comment/ "The problem was inadequate synchronization of gopher threads. There are two records in different transactions, a insert and the deletion. The gopher starts to apply the insert, checks the record tree for conflict, everything looks fine, and calls Section::updateRecord. In the meantime, apparently, the delete happens, get committed, and a gopher fires up, and somehow beats the insert through Section::updateRecord. I've got to push the test of the record tree until after Section::updateRecord has an exclusive lock on the record locator page. Then I have to rip out a ton of debugging code. As I've said before, if there is any hole, no matter how small, an SMP system and a heavy load will find it." /end comment/ So it looks like this bug may have been the bug Jim was fixing. I am making this Patch Approved, since I did a code review. Let us know if this call stack occurs again.
[11 Feb 2008 20:51]
Kevin Lewis
Patch is in mysql-6.0-falcon-team and mysql-6.0-release version 6.0.4
[16 Feb 2008 8:45]
Philip Stoev
This call stack occured again using the latest 6.0.4 binary. I am a bit concerned about the line which says Error::assertionFailed (fileName=0x1f22 <Address 0x1f22 out of bounds>, line=7708) at Error.cpp:76. #0 0x0000003ba880b132 in pthread_kill () from /lib64/libpthread.so.0 #1 0x0000000000650e12 in handle_segfault (sig=4) at mysqld.cc:2313 #2 <signal handler called> #3 0x0000003ba880dd4d in raise () from /lib64/libpthread.so.0 #4 0x0000000000854f1c in Error::debugBreak () at Error.cpp:92 #5 0x0000000000854e30 in Error::error (string=0x0) at Error.cpp:69 #6 0x0000000000854f06 in Error::assertionFailed (fileName=0x1f22 <Address 0x1f22 out of bounds>, line=7708) at Error.cpp:76 #7 0x0000000000824e68 in Table::fetchNext (this=0x2aaaab0554a8, start=1576) at Table.cpp:524 #8 0x0000000000819eb6 in StorageDatabase::nextRow (this=0x1f22, storageTable=0x2aaab4ae7d40, recordNumber=1574, lockForUpdate=false) at StorageDatabase.cpp:244 #9 0x000000000081e87c in StorageTable::next (this=0x0, recordNumber=7708, lockForUpdate=255) at StorageTable.cpp:161 #10 0x000000000080eb26 in StorageInterface::rnd_next (this=0x2aaaff642f00, buf=0x2aaac26b6c98 "О©ҐJ\f") at ha_falcon.cpp:552 #11 0x0000000000740f99 in ha_partition::rnd_next (this=0x2aaac26b6970, buf=0x2aaac26b6c98 "О©ҐJ\f") at ha_partition.cc:3203 #12 0x00000000007300ab in rr_sequential (info=0x2aaab000dc88) at records.cc:383 #13 0x00000000006bdf7c in sub_select (join=0x200a9598, join_tab=0x2aaab000dc00, end_of_records=4) at sql_select.cc:13290 #14 0x00000000006bdb0d in do_select (join=0x200a9598, fields=0x200aafe8, table=0x0, procedure=0x0) at sql_select.cc:13038 #15 0x00000000006a87d2 in JOIN::exec (this=0x200a9598) at sql_select.cc:2713 #16 0x00000000006a8d54 in mysql_select (thd=0x2aaab009e070, rref_pointer_array=0x1fdf3c20, tables=0x3938c8b8, wild_num=0, fields=@0x1fdf3b00, conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2416200192, result=0x3938c760, unit=0x1fdf35f8, select_lex=0x1fdf3a38) at sql_select.cc:2901 #17 0x00000000006a3272 in handle_select (thd=0x2aaab009e070, lex=0x1fdf3558, result=0x3938c760, setup_tables_done_option=0) at sql_select.cc:270 #18 0x0000000000661999 in execute_sqlcom_select (thd=0x2aaab009e070, all_tables=0x3938c8b8) at sql_parse.cc:4517 #19 0x000000000065ceb2 in mysql_execute_command (thd=0x2aaab009e070) at sql_parse.cc:1845 #20 0x00000000006d8009 in Prepared_statement::execute (this=0x1fdf3240, expanded_query=0x40a80980, open_cursor=false) at sql_prepare.cc:3100 #21 0x00000000006d69df in mysql_sql_stmt_execute (thd=0x2aaab009e070) at sql_prepare.cc:2391 #22 0x000000000065cecc in mysql_execute_command (thd=0x2aaab009e070) at sql_parse.cc:1854 #23 0x000000000066317e in mysql_parse (thd=0x2aaab009e070, inBuf=0x200a93a0 "EXECUTE sel_vals", length=16, found_semicolon=0x40a81590) at sql_parse.cc:5410 #24 0x000000000065b895 in dispatch_command (command=COM_QUERY, thd=0x2aaab009e070, packet=0x200a93b0 "", packet_length=16) at sql_parse.cc:948 #25 0x000000000065b092 in do_command (thd=0x2aaab009e070) at sql_parse.cc:697 #26 0x0000000000659e4f in handle_one_connection (arg=0x1f22) at sql_connect.cc:1146 #27 0x0000003ba88062f7 in start_thread () from /lib64/libpthread.so.0 #28 0x0000003ba80ce85d in clone () from /lib64/libc.so.6
[16 Feb 2008 8:55]
Philip Stoev
Those lines are in the falcon_debug_mask-enabled error log from the second the crash occured. Is it possible that Falcon asserts because of bug #33479 ? 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition 080216 9:29:16 [ERROR] Failed to calculate auto_increment value for partition
[19 Feb 2008 18:34]
Kevin Lewis
Philip, Can you reporoduce this? Which test were you running? What MySQL features does it involve besides partitions? I think the assertion should not occur even with the partition problem in 33479. Please try to make this repeatable.
[19 Feb 2008 19:56]
Philip Stoev
Ok I will work on a simpler testcase.
[7 Mar 2008 16:25]
Philip Stoev
Both iuds2 (no partitions) and iuds6 (partitions) will arrive to that crash sooner or later. The difference is that the backtrace would include a ha_partition segment if partitions are involved.
[2 May 2008 4:52]
Kevin Lewis
Philip, There was a patch for Bug#34990 on [5 Mar 2008 21:27] that may have affected this. Section::insertStub() was returning the same Record Number to two different threads that did an insert of a new record. It may be that this code got involved in that mixup. In this case, a newly read existing record cannot be put into the record cache as the base record version because the same thing is happening by another thread. Sounds similar to #34990. Can you check if this callstack still occurs. The assert is; for (int n = 0; (record = databaseFetch(bitNumber)); ++n) { . . . ASSERT(n < 2); }
[7 May 2008 13:27]
Philip Stoev
This call stack still occurs with the 6.0.5 build. The assertion is ASSERT(n < 2), however n = 2.
[21 May 2008 13:30]
Kevin Lewis
Philip, Please check if the value of bitNumber is the same as recordNumber when this error occurs.
[21 May 2008 13:34]
Philip Stoev
Verified by code inspection from Jim Starkey.
[21 May 2008 17:02]
Kevin Lewis
This may have been fixed with this patch; ChangeSet@1.2670.3.3, 2008-05-21 10:58:08-04:00, jas@pendragon. +4 -0 Do some quick fixes for race conditions.
[22 Aug 2008 17:14]
Kevin Lewis
This is fixed in 6.0.6.
[30 Sep 2008 18:30]
Jon Stephens
When would a user-initiated action(s) would trigger this bug? Could this issue occur with any user-partitioned Falcon table with an auto_increment column? Thanks.
[1 Oct 2008 14:23]
Kevin Lewis
This assert may have been caused by 2 or three conditions. They all had to do with inserts and deletes to the same table in a highly concurrent environment. The assert indicated a failure to find a record that should have been in the table.
[1 Oct 2008 18:09]
Jon Stephens
Documented in the 6.0.6 changelog as follows: In some cases, concurrent INSERT and DELETE statements on the same Falcon table could cause the server to crash, due to a failure to find a record that should have been in the table.