Bug #45845 | Falcon crashes while running falcon_bug_36294-big test | ||
---|---|---|---|
Submitted: | 30 Jun 2009 8:33 | Modified: | 26 May 2010 17:53 |
Reporter: | John Embretsen | Email Updates: | |
Status: | Unsupported | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0.12-bzr (revno: 2750) | OS: | Any |
Assigned to: | Kevin Lewis | CPU Architecture: | Any |
Tags: | F_MEMORY |
[30 Jun 2009 8:33]
John Embretsen
[30 Jun 2009 8:37]
John Embretsen
Stack trace from core on OpenSolaris (snv_113 X86, 64-bit): current thread: t@1 [1] _lwp_kill(0x1, 0x6, 0xffffff11e18c8520, 0x0, 0x0, 0x0), at 0xfffffd7fff2142da [2] thr_kill(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2088fd =>[3] my_write_core(sig = 6), line 309 in "stacktrace.c" [4] signal_handler(sig = 6), line 7472 in "mysqltest.cc" [5] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20b0a6 [6] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fdfaf [7] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fe1be ---- called from signal handler with signal 6 (SIGABRT) ------ [8] __read(0x7, 0x870088, 0x4000, 0x0, 0x8691f0, 0x8691f0), at 0xfffffd7fff213f0a [9] read(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1f9ecf [10] vio_read(vio = 0x8691f0, buf = 0x870088 "^G", size = 16384U), line 44 in "viosocket.c" [11] vio_read_buff(vio = 0x8691f0, buf = 0x854098 "\n", size = 4U), line 83 in "viosocket.c" [12] my_real_read(net = 0x6ca980, complen = 0xfffffd7fffdfcf68), line 837 in "net.c" [13] my_net_read(net = 0x6ca980), line 1020 in "net.c" [14] cli_safe_read(mysql = 0x6ca980), line 677 in "client.c" [15] cli_read_query_result(mysql = 0x6ca980), line 2816 in "client.c" [16] mysql_read_query_result(mysql = 0x6ca980), line 4893 in "libmysql.c" [17] run_query_normal(cn = 0x6ca980, command = 0x8850f0, flags = 3, query = 0x7091a8 "CALL p1()", query_len = 9, ds = 0x6c5c90, ds_warnings = 0xfffffd7fffdfd250), line 6490 in "mysqltest.cc" [18] run_query(cn = 0x6ca980, command = 0x8850f0, flags = 3), line 7193 in "mysqltest.cc" [19] main(argc = 12, argv = 0xfffffd7fffdfd608), line 7823 in "mysqltest.cc" Stacktrace from Solaris 10 x86_64: current thread: t@6 [1] __lwp_kill(0x6, 0xb, 0xfffffe81dd8ef160, 0x0, 0x0, 0x0), at 0xfffffd7ffedc42ea [2] _thr_kill(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedbf2d3 =>[3] my_write_core(sig = ???) (optimized), at 0x12f29b9 (line ~309) in "stacktrace.c" [4] handle_segfault(sig = ???) (optimized), at 0xa7211f (line ~2711) in "mysqld.cc" [5] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedc1216 [6] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedb64a2 [7] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedb6688 ---- called from signal handler with signal 11 (SIGSEGV) ------ [8] MemFreeBlock::findNextLargest(this = ???, size = ???) (optimized), at 0xce264c (line ~153) in "MemFreeBlock.cpp" [9] MemMgr::alloc(this = ???, s = ???) (optimized), at 0xc67daa (line ~453) in "MemMgr.cpp" [10] MemMgr::allocate(this = ???, size = ???) (optimized), at 0xc67fd5 (line ~556) in "MemMgr.cpp" [11] MemMgrPoolAllocate(pool = ???, s = ???) (optimized), at 0xc674f9 (line ~183) in "MemMgr.cpp" [12] operator new[](s = ???, pool = ???) (optimized), at 0xc6bd52 (line ~95) in "MemoryManager.h" [13] Record::allocRecordData(this = ???, length = ???) (optimized), at 0xc6b9ae (line ~1081) in "Record.cpp" [14] Record::setRecordData(this = ???, dataIn = ???, dataLength = ???, bytesReallocated = ???) (optimized), at 0xc6b58d (line ~911) in "Record.cpp" [15] SRLUpdateRecords::thaw(this = ???, record = ???, bytesReallocated = ???) (optimized), at 0xc7ec9f (line ~129) in "SRLUpdateRecords.cpp" [16] Transaction::thaw(this = ???, record = ???, bytesReallocated = ???) (optimized), at 0xc9fb06 (line ~572) in "Transaction.cpp" [17] RecordVersion::thaw(this = ???) (optimized), at 0xc6c857 (line ~430) in "RecordVersion.cpp" [18] RecordVersion::getRecordData(this = ???) (optimized), at 0xc6cc81 (line ~567) in "RecordVersion.cpp" [19] Record::isNull(this = ???, fieldId = ???) (optimized), at 0xc6b3b6 (line ~828) in "Record.cpp" [20] RecordScavenge::inventoryRecord(this = ???, record = ???) (optimized), at 0xcf05ba (line ~130) in "RecordScavenge.cpp" [21] RecordLeaf::pruneRecords(this = ???, table = ???, base = ???, recordScavenge = ???) (optimized), at 0xcef009 (line ~137) in "RecordLeaf.cpp" [22] RecordGroup::pruneRecords(this = ???, table = ???, base = ???, recordScavenge = ???) (optimized), at 0xd22ad3 (line ~122) in "RecordGroup.cpp" [23] RecordGroup::pruneRecords(this = ???, table = ???, base = ???, recordScavenge = ???) (optimized), at 0xd22a25 (line ~122) in "RecordGroup.cpp" [24] Table::pruneRecords(this = ???, recordScavenge = ???) (optimized), at 0xc93cd0 (line ~1956) in "Table.cpp" [25] Database::pruneRecords(this = ???, recordScavenge = ???) (optimized), at 0xcc29f2 (line ~1937) in "Database.cpp" [26] Database::scavengeRecords(this = ???, forced = ???) (optimized), at 0xcc2851 (line ~1876) in "Database.cpp" [27] Database::scavenge(this = ???, forced = ???) (optimized), at 0xcc268f (line ~1814) in "Database.cpp" [28] Database::scavengerThreadMain(this = ???) (optimized), at 0xcc2ca7 (line ~2014) in "Database.cpp" [29] Database::scavengerThreadMain(database = ???) (optimized), at 0xcc2c49 (line ~2003) in "Database.cpp" [30] Thread::thread(this = ???) (optimized), at 0xc9b447 (line ~167) in "Thread.cpp" [31] Thread::thread(parameter = ???) (optimized), at 0xc9b3b2 (line ~146) in "Thread.cpp" [32] _thr_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedc0edb [33] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffedc1110
[30 Jun 2009 9:00]
Olav Sandstå
Some comments: The crash on Linux looks very similar to Bug #45746 "falcon_backlog test fails due to Gopher hitting "record memory is exhausted"" The call stack on Solaris 10 seems to crash in memory manager in the same part of the code as reported in Bug #41472 "Falcon crash in MemFreeBlock::findNextLargest"
[30 Jun 2009 15:29]
Kevin Lewis
This failure is rather unique to the pushbuild environment where memory is limited, the engine runs in a VM with other things competing for the CPU, and the application quits waiting after a limited time. This test creates a table with an int and char(500). It inserts one record and then doubles the file size 22 times. That results in a table with 4Mb records and a tablespace file that is over 2Gb in size. According to the call stack dump, there is one thread trying to do an insert which cannot allocate a record. It is waiting patiently for the scavenger thread which is hard at work looking for records to retire. Thread 2 (process 7695): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 Synchronize::sleep (milliseconds=100, callersMutex=0x0) at Synchronize.cpp:183 #2 Synchronize::sleep (milliseconds=1249796352) at Synchronize.cpp:136 #3 Record::allocRecordData (length=513) at Record.cpp:1098 #4 Record::setEncodedRecord (stream, interlocked=91) at Record.cpp:746 #5 Table::allocRecord (recordNumber=123939, stream) at Table.cpp:3767 #6 Table::databaseFetch (recordNumber=123939) at Table.cpp:732 #7 Table::fetchNext (start=0) at Table.cpp:575 #8 StorageDatabase::nextRow (storageTable, recordNumber=123939, lockForUpdate=false) at StorageDatabase.cpp:286 #9 StorageInterface::rnd_next (buf) at ha_falcon.cpp:654 #10 rr_sequential (info=0x1cf19490) at records.cc:390 #11 sub_select (join, join_tab, end_of_records) at sql_select.cc:16303 #12 do_select (join, fields, table, procedure) at sql_select.cc:15844 #13 JOIN::exec () at sql_select.cc:2468 #14 mysql_select (thd, rref_pointer_array, tables, wild_num, fields, conds, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=3489942016, result, unit=0x1cec88c0, select_lex) at sql_select.cc:3067 #15 handle_select (thd, lex, result, setup_tables_done_option) at sql_select.cc:310 #16 mysql_execute_command (thd)at sql_parse.cc:3316 #17 mysql_parse (thd, inBuf= "INSERT INTO ftest(col1) SELECT col1 FROM ftest", length=46, found_semicolon) at sql_parse.cc:5979 Thread 13 (process 31928): #0 RecordScavenge::canBeRetired () at RecordScavenge.cpp:93 #1 RecordLeaf::retireRecords (table,base=6029,recordScavenge) at RecordLeaf.cpp:186 #2 RecordGroup::retireRecords (table,base,recordScavenge) at RecordGroup.cpp:135 #3 RecordGroup::retireRecords (table,base,recordScavenge) at RecordGroup.cpp:135 #4 RecordGroup::retireRecords(table,base,recordScavenge) at RecordGroup.cpp:135 #5 Table::retireRecords(recordScavenge) at Table.cpp:1974 #6 Database::retireRecords(recordScavenge) at Database.cpp:1969 #7 Database::scavengeRecords(forced=false) at Database.cpp:1884 #8 Database::scavenge(forced=false) at Database.cpp:1814 #9 Database::scavengerThreadMain() at Database.cpp:2014 This should not be a beta bug. If the virtual machine that runs pushbuild 2 is only 2Gb, then we should probably fix this by taking the last INSERT INTO ftest(col1) SELECT col1 FROM ftest; out of the test, so that the final file size is 1Gb.
[30 Jun 2009 17:45]
Kevin Lewis
I notice in the previous call stack that the scavenge cycle is not a forced cycle. So the client thread that is waiting on memory to be released is not the thread that started the cycle. The Scavenge cycle started on a timed event, which means that it will do a full attempt to retire records that are in the oldest age group. The client thread makes a loop like this; ======================================= for (int n = 1;; ++n) { try { POOL_NEW(format->table->database->recordDataPool) char[length]; } catch (SQLException& exception) { // If the error was out-of-memory, signal the scavenger, // sleep(n),and try again. But try a limited number of times. if ( exception.getSqlcode() != OUT_OF_RECORD_MEMORY_ERROR || n > 10) throw; format->table->database->signalScavenger(true); // Give the scavenger thread a chance to release memory. // Increase the wait time per iteration. Thread *thread = Thread::getThread("Database::ticker"); thread->sleep(n * 50); } } ======================================= So the client thread will wait 50 + 100 + 150 + . . . + 500 msec, or a total of 2.75 seconds, having tried 10 times during that interval to allocate a record buffer. The client thread does not wait to be signaled by the scavenger. The scavenger must accomplish enough memory recycling during that interval. If there is a better algorithm, for this test, maybe someone can suggest it.
[30 Jun 2009 19:03]
Ann Harrison
This test stresses Falcon in low memory situations and shows a number of problems. The first traceback listed (from Linux) show a situation where we need either an extra catch, or one fewer throw. #8 0x00007f3c7d1ea4e3 in std::terminate () from /usr/lib/libstdc++.so.6 #9 0x00007f3c7d1ea565 in __cxa_rethrow () from /usr/lib/libstdc++.so.6 #10 0x00000000009f0fcb in Thread::thread (this=0x7f3c7bd1cfe8) at Thread.cpp:202 #11 0x00000000009f10ad in Thread::thread (parameter=0x7f3c7bd1cfe8) at Thread.cpp:146 #12 0x00007f3c7da8d3ba in start_thread () from /lib/libpthread.so.0 There is no catch in Thread::thread(void * parameter) <- line 146 approx. so when Thread::thread() rethrows the exception at line 202, after logging a fatal error, it takes down the server. Perhaps when FALCON_DB is defined we shouldn't throw into space? As Olav said, two other tracebacks look like existing bugs. One looks as if the thread waiting for space timed out ... Kevin knows more about that.
[2 Jul 2009 13:47]
John Embretsen
This test fails in various ways in various environments, including systems with a large amount of memory. Looking at Bug#36294 it seems it has always been this way. Also, a comment in the test says: # This test produces no output. It is only here to crash the server # once insert/select runs into error 305 (record memory exhaused). # Exact point of crash is generally unpredictable. So I am not sure what the expected behavior should be here. Do we expect Falcon to fail gracefully (no segmentation fault, or other type of "random" crash) when running out of record cache (falcon_record_memory_max)? Are any crashes acceptable, given that we run on systems which can handle Falcon's default memory settings? (default falcon_record_memory_max is 250 MB). Pushbuild 2 runs no virtual machines, so that is not the reason for these crashes. The PB2 linux host on which this test failed has 8 GB of RAM. It my be shared, but should still have enough RAM to account for the default Falcon memory limits. The OS X machine also has 8 GB of RAM. I tried running this test on a 64-bit machine with OpenSolaris, using the latest falcon-team branch. This machine is an 8-core with 64 GB of RAM. With default settings I saw the following stack trace: current thread: t@6 [1] _lwp_kill(0x6, 0xb, 0xffffff11e187e200, 0x0, 0x0, 0x0), at 0xfffffd7fff2142da [2] thr_kill(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff2088fd =>[3] my_write_core(sig = 11), line 309 in "stacktrace.c" [4] handle_segfault(sig = 11), line 2711 in "mysqld.cc" [5] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20b0a6 [6] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fdfaf [7] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fe1be ---- called from signal handler with signal 11 (SIGSEGV) ------ [8] TransactionState::committedBefore(this = 0xeeeeeeeeeeeeeeee, transactionId = 176U), line 57 in "TransactionState.cpp" [9] RecordVersion::committedBefore(this = 0x59dcf58, transId = 176U), line 249 in "RecordVersion.cpp" [10] RecordScavenge::inventoryRecord(this = 0xfffffd7ffe091a50, record = 0x59dcf58), line 157 in "RecordScavenge.cpp" [11] RecordLeaf::pruneRecords(this = 0x2c1bcb8, table = 0x2b139f8, base = 250, recordScavenge = 0xfffffd7ffe091a50), line 137 in "RecordLeaf.cpp" [12] RecordGroup::pruneRecords(this = 0x2c0f5f0, table = 0x2b139f8, base = 2, recordScavenge = 0xfffffd7ffe091a50), line 122 in "RecordGroup.cpp" [13] RecordGroup::pruneRecords(this = 0x2bf65a0, table = 0x2b139f8, base = 0, recordScavenge = 0xfffffd7ffe091a50), line 122 in "RecordGroup.cpp" [14] Table::pruneRecords(this = 0x2b139f8, recordScavenge = 0xfffffd7ffe091a50), line 1958 in "Table.cpp" [15] Database::pruneRecords(this = 0x29501d0, recordScavenge = 0xfffffd7ffe091a50), line 1937 in "Database.cpp" [16] Database::scavengeRecords(this = 0x29501d0, forced = false), line 1876 in "Database.cpp" [17] Database::scavenge(this = 0x29501d0, forced = false), line 1814 in "Database.cpp" [18] Database::scavengerThreadMain(this = 0x29501d0), line 2014 in "Database.cpp" [19] Database::scavengerThreadMain(database = 0x29501d0), line 2003 in "Database.cpp" [20] Thread::thread(this = 0x2b8fac8), line 167 in "Thread.cpp" [21] Thread::thread(parameter = 0x2b8fac8), line 146 in "Thread.cpp" [22] _thrp_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20ad25 [23] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20afe0 With the MTR option --mysqld=falcon-record_memory_max set to 8 GB I saw this: =>[3] my_write_core(sig = 11), line 309 in "stacktrace.c" [4] handle_segfault(sig = 11), line 2711 in "mysqld.cc" [5] __sighndlr(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20b0a6 [6] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fdfaf [7] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff1fe1be ---- called from signal handler with signal 11 (SIGSEGV) ------ [8] Record::isNull(this = 0xae49e8c8, fieldId = -1366968000), line 836 in "Record.cpp" [9] RecordScavenge::inventoryRecord(this = 0xfffffd7ffe091a50, record = 0xae49e8c8), line 130 in "RecordScavenge.cpp" [10] RecordLeaf::pruneRecords(this = 0xadb32a18, table = 0x2b11218, base = 35066, recordScavenge = 0xfffffd7ffe091a50), line 137 in "RecordLeaf.cpp" [11] RecordGroup::pruneRecords(this = 0xac231d28, table = 0x2b11218, base = 350, recordScavenge = 0xfffffd7ffe091a50), line 122 in "RecordGroup.cpp" [12] RecordGroup::pruneRecords(this = 0x69cbfe78, table = 0x2b11218, base = 3, recordScavenge = 0xfffffd7ffe091a50), line 122 in "RecordGroup.cpp" [13] RecordGroup::pruneRecords(this = 0x21ee2430, table = 0x2b11218, base = 0, recordScavenge = 0xfffffd7ffe091a50), line 122 in "RecordGroup.cpp" [14] Table::pruneRecords(this = 0x2b11218, recordScavenge = 0xfffffd7ffe091a50), line 1958 in "Table.cpp" [15] Database::pruneRecords(this = 0x29501d0, recordScavenge = 0xfffffd7ffe091a50), line 1937 in "Database.cpp" [16] Database::scavengeRecords(this = 0x29501d0, forced = false), line 1876 in "Database.cpp" [17] Database::scavenge(this = 0x29501d0, forced = false), line 1814 in "Database.cpp" [18] Database::scavengerThreadMain(this = 0x29501d0), line 2014 in "Database.cpp" [19] Database::scavengerThreadMain(database = 0x29501d0), line 2003 in "Database.cpp" [20] Thread::thread(this = 0x2b8fac8), line 167 in "Thread.cpp" [21] Thread::thread(parameter = 0x2b8fac8), line 146 in "Thread.cpp" [22] _thrp_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20ad25 [23] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff20afe0 I am currently running it one more time and on a different machine to verify.
[2 Jul 2009 14:00]
John Embretsen
Just to be clear (the previous comment contained a typo), the MTR comand I used for running with an 8GB record cache was: ./mysql-test-run.pl --suite=falcon --big-test --testcase-timeout=1400 --suite-timeout=1400 --mysqld=--falcon_record_memory_max=8589934592 --retry-failure=0 falcon_bug_36294-big
[3 Jul 2009 11:57]
John Embretsen
I did three re-runs of this test on the OpenSolaris 64 GB RAM host using 8 GB record cache, and it passed. The same test run (8 GB record cache) on a Solaris 10 host with a slower chip and 32 GB RAM results in a different failure: mysqltest: At line 67: query 'DROP TABLESPACE gimdata ENGINE $engine' failed: 1727: Tablespace 'gimdata' not empty Not sure why that is, yet.