Bug #33735 mysql hangs when running systems test
Submitted: 8 Jan 2008 9:39 Modified: 5 Oct 2008 14:52
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:6.0.4-p3 OS:Any
Assigned to: Kevin Lewis CPU Architecture:Any

[8 Jan 2008 9:39] Philip Stoev
Description:
When running the iuds2 systems test, MySQL hanged . No new connections were accepted and it was impossible to shutdown the server without kill -9.

The symtoms are exactly as in bug #33480 however the iuds2 test involves only Insert/Update/Delete, triggers, SP and transactions and no partitions or events.

How to repeat:
Run the systems test framework, scenario iuds2, with all instances of innodb and myisam in iuds2.tst changed to Falcon.

More information on runnning the test is available at 

https://inside.mysql.com/wiki/SystemQATestPlan

Sysbench is not required.
[8 Jan 2008 10:49] Philip Stoev
Recovery fails with the following stack:

0x83fce71 Error::error(char const*, ...) + 73
0x8472ea5 Cache::fetchPage(Dbb*, int, PageType, LockType) + 441
0x83f5d22 Dbb::handoffPage(Bdb*, int, PageType, LockType) + 34
0x844a9f2 Section::getSectionPage(Dbb*, int, int, LockType, unsigned int) + 554
0x844ce28 Section::getSectionPage(int, LockType, unsigned int) + 184
0x844b6c1 Section::fetchLocatorPage(int, int, LockType, unsigned int) + 65
0x844b199 Section::reInsertStub(int, unsigned int) + 37
0x83f5f1d Dbb::reInsertStub(int, int, unsigned int) + 141
0x8446abf SRLUpdateRecords::redo() + 403
0x844fed2 SerialLog::recover() + 1230
0x83ec68b Database::openDatabase(char const*) + 367
0x83e8fc1 Connection::getDatabase(char const*, char const*, Threads*) + 193
0x83e68d6 Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*) + 74
0x83bb3cb StorageDatabase::getOpenConnection() + 63
0x83bf170 StorageHandler::initialize() + 108
0x83bebda StorageHandler::getStorageConnection(StorageTableShare*, THD*, int, OpenOption) + 426
0x83b158a StorageInterface::open(char const*, int, unsigned int) + 98
0x830abfa handler::ha_open(st_table*, char const*, int, int) + 34
0x828b39b open_table_from_share(THD*, st_table_share*, char const*, unsigned int, unsigned int, unsigned int, st_table*, open_table_mode) + 783
0x8288600 open_unireg_entry(THD*, st_table*, TABLE_LIST*, char const*, char*, unsigned int, st_mem_root*, unsigned int) + 124
0x8283162 open_table(THD*, TABLE_LIST*, st_mem_root*, bool*, unsigned int) + 1954
0x82843cf open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int) + 1103
0x8284719 open_normal_and_derived_tables(THD*, TABLE_LIST*, unsigned int) + 25
0x8330a5f get_all_tables(THD*, TABLE_LIST*, Item*) + 1367
0x83386fa get_schema_tables_result(JOIN*, enum_schema_table_state) + 370
0x8296b00 JOIN::exec() + 6492
0x8296d93 _Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select + 339
0x82920ab handle_select(THD*, st_lex*, select_result*, unsigned long) + 307
0x825df7d execute_sqlcom_select(THD*, TABLE_LIST*) + 149
0x825543b mysql_execute_command(THD*) + 747
0x825b5ad mysql_parse(THD*, char const*, unsigned int, char const**) + 269
0x82540c5 dispatch_command(enum_server_command, THD*, char*, unsigned int) + 909
[8 Jan 2008 11:07] Philip Stoev
This bug is reproducible. Setting to verified.

The reason no new connections are accepted is that the setup uses query log tables v.s. log files. This means that the server attempts to write the log in attempt into the log table, and than blocks on trying to open it. Here is a backtrace for a thread that is hanging like that:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0061092e in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2  0x0060c79c in _L_mutex_lock_85 () from /lib/libpthread.so.0
#3  0x0060c2dd in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x08282f7f in open_table ()
#5  0x082844e4 in open_ltable ()
#6  0x082880e6 in open_performance_schema_table ()
#7  0x082cba89 in Log_to_csv_event_handler::log_general ()
#8  0x082ccd37 in LOGGER::general_log_write ()
#9  0x082d0c2f in general_log_print ()
#10 0x08251a89 in check_user ()
#11 0x08252db3 in check_connection ()
#12 0x082520fc in login_connection ()
#13 0x082524d4 in handle_one_connection ()
#14 0x0060a2db in start_thread () from /lib/libpthread.so.0
#15 0x0048914e in clone () from /lib/libc.so.6
[8 Jan 2008 11:19] Philip Stoev
Backtraces of interesting Falcon threads taken after no new connections are no longer accepted, during a period of 100% CPU usage (that is, 1 core of 8 occupied).

erorr log is:

Commit transaction 2528545
3916: Transactions: 13118 committed, 89 rolled back, 134 active, 393 post-commit, oldest 3824 seconds
3916: SerialLog: 0 reads, 2569 writes, 1388 transactions, 9433 completed, 6 stalls, 367188 blocks, 2 windows
3916: Activity on falcon_master.fts: 66885 fetches, 2 reads, 1 writes, 23 flushWrites
3916: Activity on falcon_temporary.fts: 20492 fetches, 1 reads, 0 writes, 6 flushWrites
3916: Activity on falcon_user.fts: 299389 fetches, 154 reads, 0 writes, 513 flushWrites
Exception: lock timed out after 1000 milliseconds

Stalled threads
  Thread 0xaf3b30a4 (-321266800) sleep=0, grant=0, locks=1, who 0, parent=(nil)
    Pending Database::commitSystemTransaction state 0 (1) syncObject 0xf72dee14
  Thread 0xaf3bb55c (-319575152) sleep=0, grant=1, locks=0, who 1, parent=(nil)
Stalled synchronization objects:
  SyncObject f72dee14: state -1, monitor 0, waiters 1
    Exclusive thread af3bb55c (-319575152), type 1;
    Waiting th
(the end is missing due to no autoflushing)

thread 13:

#0  0x083d4539 in Transaction::releaseSavePoints ()
#1  0x083d1cd5 in Transaction::commit ()
#2  0x083e594e in Connection::commit ()
#3  0x083ed81c in Database::commitSystemTransaction ()
#4  0x083c905a in Table::drop ()
#5  0x083edf77 in Database::dropTable ()
#6  0x08461488 in Nfs::Statement::executeDDL ()
#7  0x08467609 in Nfs::Statement::execute ()
#8  0x0845fb23 in Nfs::Statement::execute ()
#9  0x083bbd16 in StorageDatabase::deleteTable ()
#10 0x083c16d0 in StorageTableShare::deleteTable ()
#11 0x083bfd63 in StorageTable::deleteTable ()
#12 0x083b235c in StorageInterface::delete_table ()
#13 0x08285154 in rm_temporary_table ()
#14 0x08282218 in close_temporary ()
#15 0x0831cf5d in mysql_alter_table ()
#16 0x08255bcf in mysql_execute_command ()
#17 0x0825b5ad in mysql_parse ()
#18 0x082540c5 in dispatch_command ()
#19 0x08253d08 in do_command ()
#20 0x08252557 in handle_one_connection ()
#21 0x0060a2db in start_thread () from /lib/libpthread.so.0
#22 0x0048914e in clone () from /lib/libc.so.6

thread 28:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0060e146 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x08469d6a in Synchronize::sleep ()
#3  0x083c4709 in SyncObject::wait ()
#4  0x083c4437 in SyncObject::lock ()
#5  0x083c3f4c in Sync::lock ()
#6  0x083ed80e in Database::commitSystemTransaction ()
#7  0x083c905a in Table::drop ()
#8  0x083edf77 in Database::dropTable ()
#9  0x08461488 in Nfs::Statement::executeDDL ()
#10 0x08467609 in Nfs::Statement::execute ()
#11 0x0845fb23 in Nfs::Statement::execute ()
#12 0x083bbd16 in StorageDatabase::deleteTable ()
#13 0x083c16d0 in StorageTableShare::deleteTable ()
#14 0x083bfd63 in StorageTable::deleteTable ()
#15 0x083b235c in StorageInterface::delete_table ()
#16 0x08285154 in rm_temporary_table ()
#17 0x08282218 in close_temporary ()
#18 0x082820d1 in drop_temporary_table ()
#19 0x083197cf in mysql_rm_table_part2 ()
#20 0x083195e1 in mysql_rm_table ()
#21 0x08256059 in mysql_execute_command ()
#22 0x0825b5ad in mysql_parse ()
#23 0x082540c5 in dispatch_command ()
#24 0x08253d08 in do_command ()
#25 0x08252557 in handle_one_connection ()
#26 0x0060a2db in start_thread () from /lib/libpthread.so.0
#27 0x0048914e in clone () from /lib/libc.so.6

thread 60:

[Switching to thread 60 (Thread 4019751824 (LWP 10049))]#0  0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0060e146 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0x08469d6a in Synchronize::sleep ()
#3  0x083c4709 in SyncObject::wait ()
#4  0x083c4437 in SyncObject::lock ()
#5  0x083c3f4c in Sync::lock ()
#6  0x083eff1d in Database::updateCardinalities ()
#7  0x083ee966 in Database::scavenge ()
#8  0x08447577 in Scavenger::execute ()
#9  0x084485c2 in Scheduler::schedule ()
#10 0x083cf957 in Thread::thread ()
#11 0x083cf7e0 in Thread::thread ()
#12 0x0060a2db in start_thread () from /lib/libpthread.so.0
#13 0x0048914e in clone () from /lib/libc.so.6
[8 Jan 2008 11:28] Philip Stoev
Thread #13 appears to be running an ALTER TABLE on a TEMPORARY falcon table. The only ALTER used in this test is:

ALTER TABLE systest1.t1_tmp ADD pk INTEGER AUTO_INCREMENT, ADD PRIMARY KEY(pk)

This table is CREATE-d using SELECT from another Falcon table.
[10 Jan 2008 20:34] Jeffrey Pugh
Kevin believes this is fixed in patch for 33480.
[11 Jan 2008 0:53] Kevin Lewis
Philip,  Since this bug had the symptom of 100% CPU usage (that is, 1
core of 8 occupied), it is likely to have been caused by the undetected deadlock that was fixed in Bug#33480.  Please verify.
[11 Jan 2008 8:49] Philip Stoev
Bug #33759 prevents a test run that is long enough for me to verify that the fix for bug #33480 fixes this bug. I will keep trying though.
[18 Jan 2008 16:04] Philip Stoev
This bug remains to cover only the initial hang condition. The recovery failure has been separated as a new bug #33923.
[11 Feb 2008 20:49] Kevin Lewis
Patch is in mysql-6.0-falcon-team and mysql-6.0-release version 6.0.4
[5 May 2008 20:30] Kevin Lewis
It looks like this bug was deemed to be the same as Bug#33480.  That conclusion was not repeated when it was moved to Patch Pending.  Sorry about the confusion in the handling of this bug.  It prbably should have been put to QA testing and then to Duplicate if the bug went away due to 33480 as expected.

In summary, if a deadlock happens between threads which are checking for duplicates, the code in Table::checkUniqueRecordVersion() was not handling the detected deadlock by throwing an exception.  So it was retrying the checkUniqueRecordVersion() call.  This can cause 100% CPU utilization hangs between threads, where only 1 thread is spinning.  One is waiting and the other is looping in Table::checkUnigueIndexes().  This can also cause the serial log to grow indefinitely.
[7 Jul 2008 18:20] Philip Stoev
I have nothing else to do with this bug, so I am assigning it to Kevin so that it shows up in his reports.
[5 Oct 2008 14:52] Jon Stephens
See Bug #33480 for changelog entry.