Bug #42568 | Falcon: Crash in SyncObject::unlock at line 976 in SyncObject.cpp | ||
---|---|---|---|
Submitted: | 3 Feb 2009 15:33 | Modified: | 15 May 2009 15:17 |
Reporter: | John Embretsen | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | 6.0.10-bzr-bugteam | OS: | Linux |
Assigned to: | Olav Sandstå | CPU Architecture: | Any |
Tags: | F_SYNCHRONIZATION, falcon, test failure |
[3 Feb 2009 15:33]
John Embretsen
[3 Feb 2009 15:39]
John Embretsen
MTR output from bug42569 crash, including full stack traces (test run produced two core files).
Attachment: gcc296-failure.txt (text/plain), 17.73 KiB.
[3 Feb 2009 15:54]
John Embretsen
(Bug number in attachment description should of course be 42568, not 42569)
[3 Feb 2009 16:20]
Olav Sandstå
I too think this is related to the same issue as Bug#40633 so I assign it to myself.
[9 Feb 2009 15:29]
Olav Sandstå
I have been able to identify at least one way this assert can happen. The sympthon is that we are unlocking the Cache::syncWait object when it seems like it has not been locked. From the core file produced by John the syncWait object looks like this: syncWait = {<SynchronizationObject> = { _vptr.SynchronizationObject = 0xd7bb70}, monitorCount = 1, mutex = {<SynchronizationObject> = { _vptr.SynchronizationObject = 0xd9e310}, holder = 0x0, mutex = { __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\0' <repeats 39 times>, __align = 0}}, queue = 0x0, exclusiveThread = 0x0, readers = 0, waiters = 0, lockState = 0, stalls = 0, objectId = 45, sharedCount = 0, collisionCount = 0, exclusiveCount = 2, waitCount = 0, queueLength = 0, location = 0x0, name = 0xd93d8e "Cache::syncWait"} As this shows the syncWait object is not locked and it seems valid to assert at this point in time. But the interesting thing to note is that the monitorCount = 1. For an unlocked syncObject the monitorCount should be 0. So this is (IMHO) an illegal state for a syncObject. How can we get a syncObject into this state? 1. To get the monitorCount equal to 1, this syncObject must be locked exclusively twice (on first lock() call the lock state is set to -1, on the second call() the monitorCount is set to 1). 2. To get the state back to lockState=0 you have to call unlock() twice (The first unlock() call will set monitorCount to 0 and the second unlock() call will set the lockState to 0). So how did we get to this illegal state? Theory: the second call to lock() happened in parallel with the second call to unlock() - and was done by two different threads. Thread 2: started on the unlock() operation and executed the following code from SyncObject::unlock(): if (monitorCount) { //ASSERT (monitorCount > 0); --monitorCount; DEBUG_FREEZE; return; } Thread *thread = NULL; for (;;) { //ASSERT ((type == Shared && lockState > 0) || (type == Exclusive && lockState == -1)); long oldState = lockState; long newState = (type == Shared) ? oldState - 1 : 0; Since at this stage the syncObject has only been locked onze the monitorCount is 0 and this threads decides that it should unlock the lock. But then he get interrupted by the other thread: Thread 1: starts on the lock() operation and do the following code from SyncObject::lock(): if (thread == exclusiveThread) { ++monitorCount; BUMP(exclusiveCount); DEBUG_FREEZE; return; } Since it was him that already had locked the same lock one time earlier and has the lock on this object (the test thread == exclusiveThread is true) he just bumps the monitorCount to 1 (and continue..) Thread 2: is allowed to continue and do the rest of the SyncObject::unlock() code: exclusiveThread = NULL; if (COMPARE_EXCHANGE(&lockState, oldState, newState)) { DEBUG_FREEZE; if (waiters) grantLocks(); return; } BACKOFF; } and unlock the lock... So both threads "succeed" with their lock/unlock operation but leaves the SyncObject in an invalid state. I assume this is illegal use of a SyncObject(?) to have to independent threads doing respectively lock() by the first and then unlock() by the second and assume that the "monitor count" property is correct. The underlying problem that lead to this "abuse" of the SyncObject is found in bug 40633 where we due to a bug did lock a syncObject multiple times in a row by the same thread and let another thread to the unlocking. With the fix for bug#40633 this situation should not occur any longer - and this bug should thus also be fixed by the same fix as for bug#40633. Still, I consider replacing the syncWait object that today is a SyncObject with a Muxtex object to avoid that we every succeed to lock it twice.
[9 Feb 2009 16:03]
Kevin Lewis
Cache::syncWait is only locked exclusively and it should not be locked multiple times by the same thread. So I agree that it should be a Mutex instead of a SyncObject.
[12 Feb 2009 14:37]
Olav Sandstå
Having looked more closely about the possibility of replacing the syncWait sync object with a Mutex I do not think it is as simple as I initially had hopped for. The syncWait object is lock mostly exclusively but the is one place where it is locked using a shared lock. This shared locking I had thought could be replaced by locking it exclusively. Unfortunately it then turns out that this object will be attempted locked twice by the same thread, first in Cache::flush() and then in Cache::flushWait(). We could probably rewrite this in some way to make it possible to use a Mutex but since the fix for Bug#40633 already is in place and solves this problem (avoids that we lock the sync object exclusively twice without a call to unlock) I think we should not go ahead and to a rewrite just to be able to replace the SyncObject with a Mutex. Another issue with the Mutex implementation on Unix at least is that it should not be locked and then unlocked by a different thread.
[12 Feb 2009 15:06]
Vladislav Vaintroub
Neither Windows implementation of Mutex (based on critical sections) can work across threads.
[26 Feb 2009 14:23]
Olav Sandstå
This bug should be fixed by the first patch committed for bug#40633.
[15 May 2009 15:17]
MC Brown
Internal/test fix. No changelog entry required.