Bug #35072 | Falcon crash in RecoveryObjects::findRecoveryObject, line 103 | ||
---|---|---|---|
Submitted: | 5 Mar 2008 10:01 | Modified: | 1 Oct 2008 12:02 |
Reporter: | Philip Stoev | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0-falcon-team | OS: | Any |
Assigned to: | Vladislav Vaintroub | CPU Architecture: | Any |
Tags: | falcon, pushbuild, test failure, widespread |
[5 Mar 2008 10:01]
Philip Stoev
[5 Mar 2008 16:29]
Philip Stoev
I start my servers with perl mysql-test-run.pl --start-and-exit --skip-ndb. [philips@philips build]$ uname -a Linux philips 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:55:12 EDT 2007 i686 i686 i386 GNU/Linux I use a debug build from the 6.0-falcon-team tree.
[7 Mar 2008 6:53]
Hakan Küçükyılmaz
Philip, I ran your test for more than 6 hours. No crash.
[7 Mar 2008 10:37]
Philip Stoev
I did a fresh clone of mysql-6.0-falcon-team, compiled with BUILD/compile-pentium-debug-max-no-ndb, started server with perl mysql-test-run.pl --start-and-exit --skip-ndb and then ran the second test case. It crashed in 10 seconds or so.
[30 Apr 2008 17:11]
Philip Stoev
This stack trace still occurs with 6.0.5
[30 Apr 2008 18:32]
Philip Stoev
Unrolled test case for bug 35072
Attachment: bug35072.test (application/octet-stream, text), 22.63 KiB.
[30 Apr 2008 18:33]
Philip Stoev
I just uploaded a test case where the procedure loop has been unrolled. to run: $ mysql -uroot --socket=var/tmp/master.sock -f < bug35072.test crash is almost immediate on my virtual fedora core, 2 x 1.4 Ghz cpu. Timing appears to be important.
[12 Jul 2008 10:46]
MySQL Verification Team
i got this crash with 1 thread workload. all thread debug info attached.
Attachment: bug35072_debug_info.txt (text/plain), 35.70 KiB.
[15 Jul 2008 9:50]
John Embretsen
The test falcon_bugs2 crashed at the exact same place in RecoveryObjects::findRecoveryObject() during Pushbuild on mysql-6.0-falcon branch, Windows 2003, Mon Jul 14 17:11:24 2008. Test output: falcon.falcon_bugs2 [ fail ] mysqltest: At line 557: query 'DROP TABLE t1' failed: 2006: MySQL server has gone away The result from queries just before the failure was: < snip > SET autocommit=0; SET tx_isolation="READ-COMMITTED"; CREATE TABLE t1 (a int); INSERT INTO t1 VALUES (1); SET autocommit=0; SET tx_isolation="READ-COMMITTED"; SELECT * FROM t1; a COMMIT; SELECT * FROM t1; a 1 SELECT * FROM t1; a 1 DROP TABLE t1; ***************** falcon_bugs2_023 ***************** CREATE TABLE t1 (a int); ALTER TABLE t1 ADD PRIMARY KEY (a); DROP TABLE t1; Stack trace from Master.err: CURRENT_TEST: falcon.falcon_bugs 080714 23:36:15 - mysqld got exception 0xc0000005 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=1048576 read_buffer_size=131072 max_used_connections=5 max_threads=151 thread_count=1 connection_count=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 59980 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. thd: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... 007E7490 mysqld.exe!RecoveryObjects::findRecoveryObject()[recoveryobjects.cpp:103] 007E764F mysqld.exe!RecoveryObjects::isObjectActive()[recoveryobjects.cpp:90] 007C7904 mysqld.exe!SRLUpdateRecords::commit()[srlupdaterecords.cpp:361] FFFFFFFF kernel32.dll!AlignRects() 77F4725C ntdll.dll!RtlFreeHeap() 65535C2E 65535C2E FFFFF428 The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. Writing a core file CURRENT_TEST: falcon.falcon_bugs2 MiniDumpWriteDump() failed, last error 3489660941
[15 Jul 2008 15:06]
Kevin Lewis
Vlad, This apparently has happened several times. John noticed the same crash on a recent pushbuild, now that we can get call stacks. This is not a recovery, it is a gopher thread doing drop table.
[15 Jul 2008 16:01]
Hakan Küçükyılmaz
Another similar stack trace from Windows Pushbuild: CURRENT_TEST: falcon.falcon_bug_22516 080715 16:43:42 - mysqld got exception 0xc0000005 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=1048576 read_buffer_size=131072 max_used_connections=3 max_threads=151 thread_count=1 connection_count=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 60395 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. thd: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... 00000001404CF428 mysqld.exe!RecoveryObjects::isObjectActive()[recoveryobjects.cpp:90] 000000014049AFD2 mysqld.exe!SRLUpdateIndex::commit()[srlupdateindex.cpp:157] 00000001404D01A0 mysqld.exe!SerialLogTransaction::commit()[seriallogtransaction.cpp:88] 00000001404D03A2 mysqld.exe!SerialLogTransaction::doAction()[seriallogtransaction.cpp:159] 00000001404CEB45 mysqld.exe!Gopher::gopherThread()[gopher.cpp:73] 000000014042A8C8 mysqld.exe!Thread::thread()[thread.cpp:169] 000000014042AE3E mysqld.exe!Thread::thread()[thread.cpp:147] 0000000077D6B69A kernel32.dll!BaseThreadStart()
[16 Jul 2008 15:51]
Philip Stoev
The crash is here: 98 RecoveryPage* RecoveryObjects::findRecoveryObject(int objectNumber, int tableSpaceId) 99 { 100 int slot = objectNumber % RPG_HASH_SIZE; 101 102 for (RecoveryPage *object = recoveryObjects[slot]; object; object = object->collision) 103 if (object->objectNumber == objectNumber && object->tableSpaceId == tableSpaceId) <<<<<<<<<<<< HEREE 104 return object; 105 106 return NULL; 107 } (gdb) print object $1 = (class RecoveryPage *) 0xeeeeeeee
[18 Jul 2008 8:40]
Hakan Küçükyılmaz
Another crash in Windows Pushbuild with similar stack trace: CURRENT_TEST: falcon.index_merge_falcon 080718 2:28:05 - mysqld got exception 0xc0000005 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=1048576 read_buffer_size=131072 max_used_connections=2 max_threads=151 thread_count=1 connection_count=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 59980 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. thd: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... 007723D0 mysqld.exe!RecoveryObjects::findRecoveryObject()[recoveryobjects.cpp:103] 0077258F mysqld.exe!RecoveryObjects::isObjectActive()[recoveryobjects.cpp:90] 00752A83 mysqld.exe!SRLUpdateIndex::execute()[srlupdateindex.cpp:163] 007530CE mysqld.exe!SRLUpdateIndex::commit()[srlupdateindex.cpp:158] 00772FE0 mysqld.exe!SerialLogTransaction::commit()[seriallogtransaction.cpp:92] 0077204B mysqld.exe!Gopher::gopherThread()[gopher.cpp:73] 00704D9D mysqld.exe!Thread::thread()[thread.cpp:169] 007050EC mysqld.exe!Thread::thread()[thread.cpp:147] 77E64829 kernel32.dll!GetModuleHandleA() The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. Writing a core file Minidump written to e:\var-falcon-101\master-data\mysqld.dmp
[18 Jul 2008 10:37]
Vladislav Vaintroub
Not Windows specific, Philip indicates that his original test crashes on his Linux machine
[18 Jul 2008 11:55]
Vladislav Vaintroub
Philip, could please you attach the full backtrace, including other threads? I still cannot reproduce. Seems to be a very subtle timing issue. Interestingly, all machines on which the crash happened, are VMWare boxes.
[18 Jul 2008 15:02]
Philip Stoev
Threads for bug 35072
Attachment: bug35072.threads.txt (text/plain), 10.52 KiB.
[18 Jul 2008 21:25]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/50059 2750 Vladislav Vaintroub 2008-07-18 Bug#35072: Crash Falcon crash in RecoveryObjects::findRecoveryObject Problem: The functiion crashed while traversing single linked list that contained a dangling pointer (deleted memory, filled with 0xe bytes). Access and modification of the hash table was not synchronized. When one thread adds a new entry to the hashtable while another is delets from the same hash bucket, hashtable can become corrupted (i.e still have pointer to the deleted object) Solution is to synchronize hashtable access. Here, a reader/writer lock is used, one per hash bucket.
[19 Jul 2008 4:10]
Kevin Lewis
Vlad, before you push this, we need to measure the performance cost in a high concurrency test. Hakan and/or Kelly. Can one of you try out Vlad's patch here to see what affect it has on DBT2? The new RecoveryObjects::syncArray of SyncObjects is a concern expressed by Jim below in an email sent to dev_falcon. Also, I have added Vlad's color commentary. >Vladislav Vaintroub wrote: >> Falconers, >> I'm asking 2 volunteers to review the fix to >> http://lists.mysql.com/commits/50059 , the fix for accessing invalid >> memory in RecoveryObjects::findObject problem, that was plaguing us >> in the last time on pushbuild. >> >> Admittedly, I could not reproduce the problem either on my Windows machine >> or on one of my 2 vmwared Ubuntus. Yet, the problem seems to be obvious. >> It has to be missing synchronization when the hashtable is modified in >> different threads. (I can't think of anythihg else, here luckily all >> memory allocations and deletes are done in the same class. >> >> I also asked Philip to test the fix and he confirms problem does not exist >> on his Virtual machine anymore. >> >> Preliminary answer for the friends of lock-free - it might or might >> not be possible to do the job without any synchronization and with >> compare-exchange and memory-barriers and such. >> I did not even try- it would be Herculean effort for me to prove >> that it works only to find out later, that it does not work on some >> OS and with some compiler. I tried my best to make this as fast as >> possible with a reader/writer lock (i.e SyncObject) per hash bucket. >> Not sure it was necessary, single mutex would probably be just fine here. >Jim Starkey wrote; >Vlad, you may want to think about this some more. I suspect you fix >will work, but at a cost in memory and performance that can probably be >avoided. > >The primary use of recovery objects is during recovery, which is single >threaded. So no synchronization is required there. > >The mechanism is also used to track the states of indexes and data >sections during ordinary processing to avoid posting stuff to deleted >objects. The only places where objects are deleted from the hash table >is when previous deleted index and section ids are reused and become >active again. Insert/insert conflicts are controlled by >SerialLog::syncObject locked by the START_RECORD macro. Insert/read >conflicts are a non-issue because inserting into a hash table is >intrinsically thread safe (though processor visibility can be an >issue). That leaves delete/read problems. > >At the moment, we delete recovery objects when their avatars are >reactivated. If we just left them in the hash table as "active", the >problem evaporates. If we really cared about the tiny bit of memory >involved,we could do a check for superfluous recovery objects when we >already had an exclusive lock. > >Your solution does minimize collisions. But since actual contention can >occur only when creating a new index or table, a rare event in a >production system, the potential for contention is low anyway. My >primary concern is the cost of the interlocked CASes to seize and >release the hash table sync object. But do note that the memory >consumed for a large number of sync objects is probably greater than the >memory consumed by dangling (but correct) recovery objects. > > From time to time I wake up in the middle of the night to worry that >Falcon is 98% lock/unlock calls with only 2% of the cycles actually >slopping data around. For non-critical areas, it makes good engineering >sense to be conservative with the locking strategy. But in the >mainline, we need to be clever. (Of course, we also have to get it >right, which isn't easy.)
[22 Aug 2008 17:36]
Kevin Lewis
This fix is in version 6.0.7
[14 Sep 2008 2:25]
Bugs System
Pushed into 6.0.7-alpha (revid:vvaintroub@mysql.com-20080718212452-v2bgwuzcalgzanoq) (version source revid:sven@mysql.com-20080818195835-r615g9zz6xphmkzg) (pib:3)
[1 Oct 2008 12:02]
Jon Stephens
Documented bugfix in the 6.0.7 changelog as follows: Executing ALTER TABLE ADD PARTITION followed by ALTER TABLE DROP PARTITION on a Falcon table, and then killing the thread performing the ALTER TABLE statements could cause the server to crash.