MySQL Bugs: #42568: Falcon: Crash in SyncObject::unlock at line 976 in SyncObject.cpp

Bug #42568	Falcon: Crash in SyncObject::unlock at line 976 in SyncObject.cpp
Submitted:	3 Feb 2009 15:33	Modified:	15 May 2009 15:17
Reporter:	John Embretsen	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0.10-bzr-bugteam	OS:	Linux
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	F_SYNCHRONIZATION, falcon, test failure

Description:
Falcon crashes or fails doing an assertion in SyncObject::unlock in SyncObject.cpp while running tests in the main suite that are otherwise unrelated to Falcon (Falcon is just being initialized and shut down).

This has manifested itself as test failures in Pushbuild recently.

Stack trace:

Program terminated with signal 6, Aborted.
#0  0x00002b164167f4c5 in pthread_kill () from /lib64/libpthread.so.0
#0  0x00002b164167f4c5 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00000000005aed13 in signal_handler (sig=6) at mysqltest.cc:7423
#2  <signal handler called>
#3  0x00002b164136aaa5 in raise () from /lib64/libc.so.6
#4  0x0000000000a21642 in Error::error (string=<value optimized out>) at Error.cpp:94
#5  0x000000000094a6bc in SyncObject::unlock (this=<value optimized out>) at SyncObject.cpp:976
#6  0x0000000000a013e2 in Cache::ioThread (this=0x2aaaaacffff0) at Cache.cpp:950
#7  0x0000000000959d34 in Thread::thread (this=0x2aaaaaed9330) at Thread.cpp:167
#8  0x0000000000959f19 in Thread::thread (parameter=0x66a5) at Thread.cpp:146
#9  0x00002b164167b193 in start_thread () from /lib64/libpthread.so.0
#10 0x00002b16413fa47d in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()

SyncObject::unlock @ SyncObject.cpp ~line 976:
----------------------------------------------

#else //FAST_SHARED

void SyncObject::unlock(void)
{
	if (lockState > 0)
		unlock (NULL, Shared);
	else if (lockState == -1)
		unlock (NULL, Exclusive);
	else
		ASSERT(false);
}
#endif //FAST_SHARED

Cache.cpp line 950:
----------------------------------------------

	syncWait.unlock();

This bug may be related to Bug#40633 (pushbuild failure: Falcon assertion "lockState == 0" failed in SyncObject.cpp)

How to repeat:
Reproduced manually on Suse Linux 10.1 64-bit (sapsrv2) using mysql-6.0-bugteam branch 2009-02-03:

./mtr  --force --mem --mysqld=--core-file --embedded gcc296 --repeat=1000

(...)

main.gcc296                              [ pass ]     11
main.gcc296                              [ fail ]  timeout after 900 seconds
        Test ended at 2009-02-03 15:33:47

 - saving '/nfstmp1/johnemb/replay/mysql-6.0-bugteam/mysql-test/var/log/main.gcc296/' to
'/nfstmp1/johnemb/replay/mysql-6.0-bugteam/mysql-test/var/log/main.gcc296/'
 - found 'core.26277', moving it to '/nfstmp1/johnemb/replay/mysql-6.0-bugteam/mysql-test/var/log/main.gcc296'
 - found 'core.26277' (1/5)

Trying 'dbx' to get a backtrace

Trying 'gdb' to get a backtrace
Core generated by '/nfstmp1/johnemb/replay/mysql-6.0-bugteam/libmysqld/examples/mysqltest_embedded'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
Using host libthread_db library "/lib64/libthread_db.so.1".

warning: core file may not match specified executable file.
Core was generated by `/nfstmp1/johnemb/replay/mysql-6.0-bugteam/libmysqld/examples/mysqltest_embedded'.
Program terminated with signal 6, Aborted.

Full stack trace will be attached shortly.

Suggested fix:
?

MTR output from bug42569 crash, including full stack traces (test run produced two core files).

Attachment: gcc296-failure.txt (text/plain), 17.73 KiB.

(Bug number in attachment description should of course be 42568, not 42569)

I too think this is related to the same issue as Bug#40633 so I assign it to myself.

I have been able to identify at least one way this assert can
happen. The sympthon is that we are unlocking the Cache::syncWait
object when it seems like it has not been locked.

From the core file produced by John the syncWait object looks like this:

  syncWait = {<SynchronizationObject> = {
      _vptr.SynchronizationObject = 0xd7bb70}, monitorCount = 1,
    mutex = {<SynchronizationObject> = {
        _vptr.SynchronizationObject = 0xd9e310}, holder = 0x0, mutex = {
        __data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
          __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\0' <repeats 39 times>, __align = 0}}, queue = 0x0,
    exclusiveThread = 0x0, readers = 0, waiters = 0, lockState = 0,
    stalls = 0, objectId = 45, sharedCount = 0, collisionCount = 0,
    exclusiveCount = 2, waitCount = 0, queueLength = 0, location = 0x0,
    name = 0xd93d8e "Cache::syncWait"}

As this shows the syncWait object is not locked and it seems valid to
assert at this point in time. But the interesting thing to note is
that the monitorCount = 1. For an unlocked syncObject the monitorCount
should be 0. So this is (IMHO) an illegal state for a syncObject.

How can we get a syncObject into this state?

1. To get the monitorCount equal to 1, this syncObject must be locked
exclusively twice (on first lock() call the lock state is set to -1,
on the second call() the monitorCount is set to 1).

2. To get the state back to lockState=0 you have to call unlock()
twice (The first unlock() call will set monitorCount to 0 and the
second unlock() call will set the lockState to 0).

So how did we get to this illegal state?

Theory: the second call to lock() happened in parallel with the second
call to unlock() - and was done by two different threads.

Thread 2: started on the unlock() operation and executed the following
code from SyncObject::unlock():

	if (monitorCount)
		{
		//ASSERT (monitorCount > 0);
		--monitorCount;
		DEBUG_FREEZE;

		return;
		}
	
	Thread *thread = NULL;
	
	for (;;)
		{
		//ASSERT ((type == Shared && lockState > 0) || (type == Exclusive && lockState == -1));
		long oldState = lockState;
		long newState = (type == Shared) ? oldState - 1 : 0;
  

Since at this stage the syncObject has only been locked onze the
monitorCount is 0 and this threads decides that it should unlock the
lock. But then he get interrupted by the other thread:

Thread 1: starts on the lock() operation and do the following code
from SyncObject::lock():

		if (thread == exclusiveThread)
			{
			++monitorCount;
			BUMP(exclusiveCount);
			DEBUG_FREEZE;
			
			return;
			}

Since it was him that already had locked the same lock one time earlier and has the lock on this object (the test thread == exclusiveThread is true) he just bumps the monitorCount to 1 (and continue..)

Thread 2: is allowed to continue and do the rest of the SyncObject::unlock() code:

		exclusiveThread = NULL;
		
		if (COMPARE_EXCHANGE(&lockState, oldState, newState))
			{
			DEBUG_FREEZE;

			if (waiters)
				grantLocks();
				
			return;
			}
			
		BACKOFF;
		}

and unlock the lock...

So both threads "succeed" with their lock/unlock operation but leaves
the SyncObject in an invalid state.

I assume this is illegal use of a SyncObject(?) to have to independent
threads doing respectively lock() by the first and then unlock() by
the second and assume that the "monitor count" property is correct.

The underlying problem that lead to this "abuse" of the SyncObject is
found in bug 40633 where we due to a bug did lock a syncObject
multiple times in a row by the same thread and let another thread to
the unlocking. With the fix for bug#40633 this situation should not
occur any longer - and this bug should thus also be fixed by the same
fix as for bug#40633.

Still, I consider replacing the syncWait object that today is a
SyncObject with a Muxtex object to avoid that we every succeed to lock
it twice.

Cache::syncWait is only locked exclusively and it should not be locked multiple times by the same thread.  So I agree that it should be a Mutex instead of a SyncObject.

Having looked more closely about the possibility of replacing the syncWait sync object with a Mutex I do not think it is as simple as I initially had hopped for.

The syncWait object is lock mostly exclusively but the is one place where it is locked using a shared lock. This shared locking I had thought could be replaced by locking it exclusively. Unfortunately it then turns out that this object will be attempted locked twice by the same thread, first in Cache::flush() and then in Cache::flushWait(). 

We could probably rewrite this in some way to make it possible to use a Mutex but since the fix for Bug#40633 already is in place and solves this problem (avoids that we lock the sync object  exclusively twice without a call to unlock) I think we should not go ahead and to a rewrite just to be able to replace the SyncObject with a Mutex.

Another issue with the Mutex implementation on Unix at least is that it should not be locked and then unlocked by a different thread.

Neither Windows implementation of Mutex (based on critical sections) can work across threads.

This bug should be fixed by the first patch committed for bug#40633.

Internal/test fix. No changelog entry required.