Bug #22165 | ALTER crash if two interleaving transactions | ||
---|---|---|---|
Submitted: | 9 Sep 2006 9:40 | Modified: | 30 Sep 2008 17:54 |
Reporter: | Georg Richter | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0, 6.0-falcon tree | OS: | Linux (Linux) |
Assigned to: | Vladislav Vaintroub | CPU Architecture: | Any |
[9 Sep 2006 9:40]
Georg Richter
[7 Oct 2006 21:43]
Hakan Küçükyılmaz
Test case is falcon_bug_195.test. There was a change in the event handling. I updated the test case. Now I get this error: falcon_bug_195 [ fail ] Errors are (from /home/hakan/work/mysql/mysql-5.1-falcon/mysql-test/var/log/mysqltest-time) : mysqltest: Result length mismatch (the last lines may be the most important ones) Below are the diffs between actual and expected results: ------------------------------------------------------- *** r/falcon_bug_195.result 2006-10-07 23:36:43.000000000 +0300 --- r/falcon_bug_195.reject 2006-10-07 23:39:03.000000000 +0300 *************** *** 32,34 **** --- 32,36 ---- DROP event db6.e2// DROP PROCEDURE db6.pf4// DROP DATABASE db6// + Warnings: + Note 1051 Unknown table '#sql-46cb_35' ------------------------------------------------------- Regards, Hakan
[16 Nov 2006 23:44]
Hakan Küçükyılmaz
Still failing with the same error message. I used Linux 32-bit, change set 1.2363, 2006-11-15. TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_195 [ fail ] ERROR: Mantis bug #195 2006-08-24 hakank (Get pushbuild green) Below are the diffs between actual and expected results: ------------------------------------------------------- *** r/falcon_bug_195.result 2006-10-08 01:21:00.000000000 +0300 --- r/falcon_bug_195.reject 2006-11-17 02:02:56.000000000 +0300 *************** *** 32,34 **** --- 32,36 ---- DROP EVENT db1.e2// DROP PROCEDURE db1.p1// DROP DATABASE db1// + Warnings: + Note 1051 Unknown table '#sql-4818_2b' ------------------------------------------------------- Please follow the instructions outlined at http://www.mysql.com/doc/en/Reporting_mysqltest_bugs.html to find the reason to this problem and how to report this. Regards, Hakan
[14 Apr 2007 13:38]
Hakan Küçükyılmaz
Renamed test case to falcon_bug_22165.test.
[30 Apr 2007 18:22]
Kevin Lewis
This crash does not occur anymore with the current code. However, there seems to be a disconnect between the test file and the result. Also, the test script takes a while to kill the scheduler on Linux and is unable to kill it on Windows. Also, the test outputs an unex[ected warning; + Warnings: + Note 1051 Unknown table '#sql-117d_6b' Hakan, can you clean up this test and determine if anything is really wrong with it anymore?
[1 May 2007 9:27]
Hakan Küçükyılmaz
Running the test case with InnoDB reveals no such problems. TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_22165 [ pass ] 60142 ------------------------------------------------------- Stopping All Servers All 1 tests were successful. The servers where restarted 1 times Spent 60.142 seconds actually executing testcases Whereas running the test case with Falcon gives: TEST RESULT TIME (ms) ------------------------------------------------------- falcon_bug_22165 [ fail ] ERROR: Bug#22165 2006-08-24 hakank (Get pushbuild green) Below are the diffs between actual and expected results: ------------------------------------------------------- *** r/falcon_bug_22165.result 2007-05-01 12:25:24.000000000 +0300 --- r/falcon_bug_22165.reject 2007-05-01 12:26:39.000000000 +0300 *************** *** 32,34 **** --- 32,36 ---- DROP EVENT db1.e2// DROP PROCEDURE db1.p1// DROP DATABASE db1// + Warnings: + Note 1051 Unknown table '#sql-1457_7a' ------------------------------------------------------- Best regards, Hakan
[1 May 2007 9:29]
Hakan Küçükyılmaz
Crash does not happen anymore.
[3 May 2007 10:52]
MC Brown
A note has been added to the 6.0.0 changelog.
[25 Apr 2008 8:10]
Vladislav Vaintroub
Crash still appears The test case crashes on about every linux in pushbuild. On Windows, I was able to reproduce it once, while increasing number of threads (i.e events in the test case) to 35. Crash happens during implicit COMMIT during the in ALTER TABLE. From crash report: [thd->query at 000000000C670F30=ALTER TABLE t1 MODIFY COLUMN c timestamp] The valgrind message before the crash, 2 days ago: pb-valgrind: ==3550== Thread 25: ==3550== Invalid read of size 8 ==3550== at 0x8D54CA: Table::updateRecord(RecordVersion*) (LinkedList.h:81) ==3550== by 0x8E5122: Transaction::commit() (Transaction.cpp:281) ==3550== by 0x8FC220: Connection::commit() (Connection.cpp:269) ==3550== by 0x8C68BA: StorageConnection::commit() (StorageConnection.cpp:146) ==3550== by 0x8C9DCD: StorageHandler::commit(THD*) (StorageHandler.cpp:264) ==3550== by 0x8C15C4: StorageInterface::commit(handlerton*, THD*, bool) (ha_falcon.cpp:1143) ==3550== by 0x7AC9D9: ha_commit_one_phase(THD*, bool) (handler.cc:1168) ==3550== by 0x7ACE1F: ha_commit_trans(THD*, bool) (handler.cc:1137) ==3550== by 0x6A1CC9: end_trans(THD*, enum_mysql_completiontype) (sql_parse.cc:579) ==3550== by 0x6A53FF: mysql_execute_command(THD*) (sql_parse.cc:3863) ==3550== by 0x8313BB: sp_instr_stmt::exec_core(THD*, unsigned*) (sp_head.cc:2861) ==3550== by 0x834779: sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned*, bool, sp_instr*) (sp_head.cc:2691) ==3550== by 0x837458: sp_instr_stmt::execute(THD*, unsigned*) (sp_head.cc:2804) ==3550== by 0x835A50: sp_head::execute(THD*) (sp_head.cc:1215) ==3550== by 0x8386F7: sp_head::execute_procedure(THD*, List<Item>*) (sp_head.cc:1939) ==3550== by 0x6A378A: mysql_execute_command(THD*) (sql_parse.cc:4150) Windows callstack (today): 0000000140444BD6 mysqld.exe!Transaction::commit()[transaction.cpp:297] 000000014042A0F3 mysqld.exe!Connection::commit()[connection.cpp:270] 0000000140400F2F mysqld.exe!StorageConnection::commit()[storageconnection.cpp:149] 00000001403F4B2E mysqld.exe!StorageInterface::commit()[ha_falcon.cpp:1153] 000000014002C291 mysqld.exe!ha_commit_one_phase()[handler.cc:1168] 000000014003092F mysqld.exe!ha_commit_trans()[handler.cc:1137] 000000014016B5C4 mysqld.exe!end_active_trans()[sql_parse.cc:132] 0000000140173B2B mysqld.exe!mysql_execute_command()[sql_parse.cc:2591] 0000000140110538 mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:2862] 00000001401131E2 mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:2691] 0000000140114C96 mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:2806] 0000000140115356 mysqld.exe!sp_head::execute()[sp_head.cc:1215] 00000001401165DB mysqld.exe!sp_head::execute_procedure()[sp_head.cc:1939] 000000014017771E mysqld.exe!mysql_execute_command()[sql_parse.cc:4150] 0000000140110538 mysqld.exe!sp_instr_stmt::exec_core()[sp_head.cc:2862] 00000001401131E2 mysqld.exe!sp_lex_keeper::reset_lex_and_exec_core()[sp_head.cc:2691] 0000000140114C96 mysqld.exe!sp_instr_stmt::execute()[sp_head.cc:2806] 0000000140115356 mysqld.exe!sp_head::execute()[sp_head.cc:1215] 00000001401165DB mysqld.exe!sp_head::execute_procedure()[sp_head.cc:1939] 0000000140209D65 mysqld.exe!Event_job_data::execute()[event_data_objects.cc:1987] 0000000140206989 mysqld.exe!Event_worker_thread::run()[event_scheduler.cc:312] 0000000140206AF5 mysqld.exe!event_worker_thread()[event_scheduler.cc:265] 000000014029DE25 mysqld.exe!pthread_start()[my_winthread.c:87] 00000001404DF5D7 mysqld.exe!_callthreadstart()[thread.c:295] 00000001404DF6A5 mysqld.exe!_threadstart()[thread.c:275] 0000000077A2495D kernel32.dll!BaseThreadInitThunk() 0000000077C28791 ntdll.dll!RtlUserThreadStart()
[25 Apr 2008 9:19]
Vladislav Vaintroub
Here is the source snippet of the crash location on Windows, Transaction.cpp , line 295 ff for (RecordVersion *record = firstRecord; record; record = record->nextInTrans) if (!record->getPriorVersion()) ++record->format->table->cardinality; /*<--CRASH HERE */ else if (record->state == recDeleted && record->format->table->cardinality > 0) --record->format->table->cardinality;
[25 Apr 2008 15:56]
MySQL Verification Team
I could not repeat the crash running the >1h on Windows XP with the latest source server.
[25 Apr 2008 17:15]
Hakan Küçükyılmaz
Running falcon_bug_22165 several times in a row eventually gave me this stacktrace: ./mysql-test-run.pl --force --mem --suite=falcon_team falcon_bug_22165 falcon_bug_22165 falcon_bug_22165, ..., falcon_bug_22165 Program terminated with signal 11, Segmentation fault. #0 0x00002b1ae67b76b2 in pthread_kill () from /lib/libpthread.so.0 (gdb) bt #0 0x00002b1ae67b76b2 in pthread_kill () from /lib/libpthread.so.0 #1 0x00000000007c670c in write_core (sig=11) at stacktrace.c:305 #2 0x0000000000662875 in handle_segfault (sig=11) at mysqld.cc:2624 #3 <signal handler called> #4 Table::updateRecord (this=0xeeeeeeeeeeeeeeee, record=0x7b403a0) at Table.cpp:446 #5 0x0000000000842c0c in Transaction::commit (this=0x20cf9e0) at Transaction.cpp:282 #6 0x000000000085737d in Connection::commit (this=0x2aaab06073f0) at Connection.cpp:269 #7 0x0000000000827309 in StorageConnection::commit (this=0x2aaab0607358) at StorageConnection.cpp:146 #8 0x00000000008217a9 in StorageInterface::commit (hton=<value optimized out>, thd=0x2aaab060b938, all=true) at ha_falcon.cpp:1141 #9 0x000000000075efb6 in ha_commit_one_phase (thd=0x2aaab060b938, all=true) at handler.cc:1168 #10 0x000000000075f45b in ha_commit_trans (thd=0x2aaab060b938, all=<value optimized out>) at handler.cc:1137 #11 0x000000000066a2a7 in end_trans (thd=0x2aaab060b938, completion=COMMIT) at sql_parse.cc:579 #12 0x0000000000671598 in mysql_execute_command (thd=0x2aaab060b938) at sql_parse.cc:3863 #13 0x00000000007d8774 in sp_instr_stmt::exec_core (this=0x1316578, thd=0x7b403a0, nextp=0x2b1ae67bc508) at sp_head.cc:2861 #14 0x00000000007d897c in sp_lex_keeper::reset_lex_and_exec_core (this=0x13165b8, thd=0x2aaab060b938, nextp=0x459cfad4, open_tables=false, instr=0x1316578) at sp_head.cc:2691 #15 0x00000000007dfb8e in sp_instr_stmt::execute (this=0x1316578, thd=0x2aaab060b938, nextp=0x459cfad4) at sp_head.cc:2804 #16 0x00000000007dccd5 in sp_head::execute (this=0x1374ef8, thd=0x2aaab060b938) at sp_head.cc:1215 #17 0x00000000007dd682 in sp_head::execute_procedure (this=0x1374ef8, thd=0x2aaab060b938, args=0x136de18) at sp_head.cc:1939
[28 Apr 2008 20:37]
Kevin Lewis
May be related to Bug#28048
[4 Jun 2008 13:11]
Vladislav Vaintroub
Table::updateRecord (this=0xeeeeeeeeeeeeeeee) in the callstack suggests that table has been freed before use (0xee is pattern for free memory)
[21 Jul 2008 14:15]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50118 2751 Vladislav Vaintroub 2008-07-21 Accoding to Kevin, Bug#22165 could have been fixed with his fix to 36438. This change reactivates the test case
[21 Jul 2008 14:17]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50119 2751 Vladislav Vaintroub 2008-07-21 Accoding to Kevin, Bug#22165 could have been fixed with his fix to 36438. This change reactivates the test case
[22 Jul 2008 13:20]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50190 2752 Vladislav Vaintroub 2008-07-22 The problem was not solved with 36438, disable test case again
[11 Aug 2008 13:23]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51311 2772 Vladislav Vaintroub 2008-08-11 Bug#22165 - crash while doing ALTER in parallel with another transactions Eliminated races. Race condition 1. ALTER/DROP/TRUNCATE have to ensure that table has no committed records that are not yet "write complete". Otherwise crash is possible if table is deleted, but then accessed during Transaction::writeComplete() in Record::poke() It is now solved, Storage::external_lock will wait for all transaction on the table to become write complete, if there is no uncommited records. Race condition 2. In Transaction::commit(), Transaction state is set to Committed to early while the transaction is still in the active list. This can produce incorrect results in Transaction::hasUncommitedRecords, DROP is not blocked. When later Transaction::commit() traverses the record list, it can crash accessing freed memory in Table::updateRecord(). This is fixed and transaction state is changed during the transition from active list to committed list.
[14 Aug 2008 7:33]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51602 2684 He Zhenxing 2008-08-14 [merge] Merge 6.0 -> 6.0-rpl-testfixes
[14 Aug 2008 7:36]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/51603 2684 He Zhenxing 2008-08-14 [merge] Merge 6.0 -> 6.0-rpl-testfixes
[25 Aug 2008 15:36]
Kevin Lewis
This fix is in version 6.0.7
[28 Aug 2008 20:14]
Bugs System
Pushed into 6.0.7-alpha (revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (version source revid:cbell@mysql.com-20080822132131-uveo6wiuecy6m2b8) (pib:3)
[9 Sep 2008 13:34]
Kevin Lewis
On 9Sep2008 Vlad pushed this patch http://lists.mysql.com/commits/53551. It is a follow-on change to this bug as well as Bug#38947. This patch moves the assignment of the transaction state and the signalling of waiting threads until the end of the rollback, as he did earlier for Transaction::commit. Here is the reasoning behind this change; Vlad wrote this comment above; Race condition 2. In Transaction::commit(), Transaction state is set to Committed too early while the transaction is still in the active list. This can produce incorrect results in Transaction::hasUncommitedRecords, DROP is not blocked. When later Transaction::commit() traverses the record list, it can crash accessing freed memory in Table::updateRecord(). So if a transaction is still on the active list and we say it is committed, and signal all waiters, then waiting threads get started before it gets moved to the committed list. TransactionManager::waitForWriteComplete(Table* table) only looks into the committed list, not the activeTransaction list. So if signaling early causes the DDL thread to go early, then the table might disappear while the commit is still happening. Why would the DDL thread be waiting on a transaction? It calls TransactionManager::hasUncommittedRecords via StorageTable::alterCheck. This function would wait on activeTransactions, not the transaction itself. But if there is another thread in the engine waiting on that transaction, the DDL would be back in the server waiting for that thread to finish and come out. (The server does not serialize DDL with commit statements since the server does not know what tables are involved.) So here is the scenario. A DDL is waiting in the server for a lock on the table. A table operation is waiting in Falcon for a Transaction to commit or rollback. The commit finally comes in, it signals the waiting thread early, before moving itself from the active to committed lists. The waiting thread finishes with that table and the DDL comes in and calls Transaction::hasRecords. It looks for the transaction in the active list, finds it but it is committed (isActive() == false) and hasRecords() also returns false. So the DDL then calls waitForTransComplete and does not find the transaction in the committed list either, so it does not wait. The table is dropped. Later, as part of the continuing commit, maybe during releaseDependencies, the table->format is accessed. Bang. So why should the rollback also signal othre transactions when it is completely finished? In the same manner as above, the drop table could get started and finished before the rollback has completed. There might be problems in release dependencies or elsewhere. It is better practice be completely finished transitioning an active transaction before signalling that we are done.
[13 Sep 2008 20:55]
Bugs System
Pushed into 6.0.7-alpha (revid:vvaintroub@mysql.com-20080811132253-4go0vfgcn59z91qu) (version source revid:hakan@mysql.com-20080725175322-8wgujj5xuzrjz3ke) (pib:3)
[30 Sep 2008 17:54]
Jon Stephens
Documented as follows in the 6.0.7 changelog: Using ALTER TABLE with interleaving transactions could cause mysqld to crash. This fix supersedes an earlier one for this issue that appeared in MySQL 6.0.0.