MySQL Bugs: #42184: Crash in Falcon on shutdown

Bug #42184	Crash in Falcon on shutdown
Submitted:	18 Jan 2009 6:32	Modified:	26 May 2010 17:51
Reporter:	Mark Callaghan	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.8,6.0.9	OS:	Any
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	crash, F_SHUTDOWN, transaction

Description:
On the bright side, Falcon was astonishingly fast.

It was started with default parameters except for:
falcon_record_memory_max=500M
falcon_page_cache_size=500M

The test was sysbench OLTP readonly with 64 concurrent users. But the crash is on shutdown, and my scripts do:
1) run sysbench test
2) drop database used for sysbench table
3) create database used for sysbench table
4) shutdown mysqld -- crash is here

[Falcon] Error: assertion (lockState == 0) failed at line 132 in file SyncObject.cpp

thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x24) [0x902dd4]
/usr/sbin/mysqld(handle_segfault+0x340) [0x4abe50]
/usr/grte/v1/lib64/libpthread.so.0 [0x2ba63673f550]
/usr/grte/v1/lib64/libpthread.so.0(raise+0x2b) [0x2ba63673f43b]
/usr/sbin/mysqld(Error::error(char const*, ...)+0xea) [0x68485a]
/usr/sbin/mysqld(SyncObject::~SyncObject()+0x26) [0x6952b6]
/usr/sbin/mysqld(Transaction::~Transaction()+0xf9) [0x6a5589]
/usr/sbin/mysqld(TransactionManager::~TransactionManager()+0x3b) [0x6a7c1b]
/usr/sbin/mysqld(Database::~Database()+0x27c) [0x6bf64c]
/usr/sbin/mysqld(StorageDatabase::close()+0x2e) [0x68984e]
/usr/sbin/mysqld(StorageHandler::shutdownHandler()+0x4a) [0x68c89a]
/usr/sbin/mysqld(StorageInterface::falcon_deinit(void*)+0x16) [0x67f736]
/usr/sbin/mysqld(ha_finalize_handlerton(st_plugin_int*)+0x2b) [0x58416b]
/usr/sbin/mysqld [0x5fe7a6]
/usr/sbin/mysqld [0x5ff20c]
/usr/sbin/mysqld(plugin_shutdown()+0x14f) [0x5ff65f]
/usr/sbin/mysqld [0x4aa571]
/usr/sbin/mysqld(unireg_end()+0xe) [0x4aa97e]
/usr/sbin/mysqld(kill_server_thread+0x373) [0x4ae953]
/usr/grte/v1/lib64/libpthread.so.0 [0x2ba63673909a]
/usr/grte/v1/lib64/libc.so.6(clone+0x72) [0x2ba6375ba3d2]

How to repeat:
see above

Suggested fix:
?

This is the same assertion as is being hit in Bug#40633 - "pushbuild failure: Falcon assertion "lockState == 0" failed in SyncObject.cpp", which is great if this is reproducible, because Bug#40633 has so far been hard to reproduce/debug.
Fixing one of the bugs will probably/hopefully fix both.

This is easy for me to reproduce:
1) run sysbench OLTP readonly (create database test, create and insert to table, run query test for 60 seconds)
2) drop database test; create database test
3) shutdown database

The crash happens frequently at step 3 (at least 10% of the time)

I'm not able to reproduce this, at least not on 32-bit linux using the current falcon bzr branch (6.0.10) and sysbench 0.4.10. Using the following sysbench option for the test runs:

sysbench \
 --num-threads=64 \
 --mysql-db=test \
 --mysql-port=10010 \
 --mysql-host=127.0.0.1 \
 --mysql-engine-trx=yes \
 --mysql-table-engine=Falcon \
 --mysql-user=root \
 --oltp-read-only \
 --test=oltp run

15 runs of "start server", "prepare test", "run test", drop database test; create database test, "shutdown server", "examine server log" and no issues seen. Perhaps 6.0.8 and/or some specific platform and/or options is required?

Are you using an SMP server? My servers have 4 or more cores.

I have filed several bugs for falcon crashes using whatever was the latest release version of falcon at the time. Most of them seem to eventually get closed with 'cannot reproduce using the latest internal version of falcon'. Either I am wasting my time trying things out (and I am not trying to crash falcon with my tests) or falcon should be made much more stable before being shared with the public.

Mark, if you got core, would it be possible to provide stack traces for all threads? If any falcon threads exist at this point, it should be only in sleeping state, doing nothing.
That's in  theory, in practice it looks like shutdown inteferes with some other running thread, but it is not obvious which one.

Stacktraces from threads, grabbed from core

Attachment: bug42184-threadstack.txt (text/plain), 9.57 KiB.

I was finally able to reproduce on a Solaris 10 x86_64 (4-cores, 2-sockets AMD CPU) machine, using Sysbench 0.4.10 against MySQL Community Server 6.0.8 alpha. The crash happened after 41 iterations of the described "how to repeat" procedure.

Attached stack traces from dbx. I have the core available, but it is 700 MB so I cannot attach it to the bug report... Let me know if someone needs it.

Mark,

I am sorry that you feel that many of your bug reports are closed as "Can't repeat". Sometimes crashes like this depend on so many things (e.g. timing), and it can be hard for us to create an environment similar to yours, especially when there are few or no environment details in the bug report.

(And yes, the 32-bit linux host I was using previously was a 4-core machine as well.)

Sometimes bugs are also fixed in the time between an official release and the time the bug is reported, that is just the way it is.

Please note that these are alpha releases of MySQL Server, and users are likely to experience some instability (not that I don't think you know that). Development and stabilization efforts are ongoing almost every day of the week, and I think that most of the time the stability of the mysql-6.0 branch on Launchpad (or mysql-6.0-falcon for that matter) is higher/better than the previous official alpha release.

I can assure you that the Falcon team very much appreciates the fact that you are trying out Falcon and reporting the bugs you find, so I hope you will continue doing that :)

Updated call stacks for all threads in core file produced by John

Attachment: bug42184-threadstack2.txt (text/plain), 10.40 KiB.

Comment to the attached call stack for all active threads:

-only the thread doing the shutdown of Falcon is active executing code in Falcon.
-there are five Falcon threads sleeping in Synchronize::sleep(), ideally I would have liked these to be gone before we start destroying central data structures but as long as they are not active we should be fine 

Basically, the call stacks give no hints about that any thread have a lock on the 
SyncObject that leads to this assert.

I have reproduced the crash ion my machine.
The syncObject in question is Cache::syncWait. It is not a usual synchronization object (there is no Sync guarding it that will automatically unlocked when a function calling it returns). It is rather used as an event, I believe to prevent multiple flushes running in parallel.

MY guess is that Cache::flush locks it at a time, where io threads cannot unlock it anymore, because as they become inactive or shut down.

A quick fix would be to set syncWait.lockState to None prior to return from 
~Cache::Cache(). I dunno what would be a good fix though

Please see bug http://bugs.mysql.com/bug.php?id=42343 - crash in Cache::flush

Vlad, Good catch and probably good guess.  Except that the crash you found is probably not this one, since this happens in Transaction::~Transaction.  But just like yours, this mught be a problem with Transaction::syncIsActive which is also an 'event' type syncObject.  Please use a different bug for your crash on Cache::syncWait.

Transaction::syncIsActive is probably also caught in a non-controlled-thread-exit-strategy.  But Transaction::~Transaction is supposed to unlock it only if itis owned by the current thread.  Maybe the TransactionManager could release all these at shutdown...

correct, the crash I saw was  for Bug#40633

Olav, the five threads that are sleeping are gopher threads.

Gopher::shutdown() does not do anything if Gopher::workedThread is NULL. the workerThread is set to NULL at the end Gopher::gopherThread(void).  I have no idea why, maybe some cludge?

I have verified this bug also with mysql-6.0.9-alpha-solaris10-x86_64 on a 4-core host. The assert failed (same stack trace) after 1919 runs of the described how-to-repeat procedure. 

Still running test against relatively up-to-date falcon-team bzr branch (6.0.10) (build options may differ from released builds, though). I am also running the test on a "Niagara 2" machine (8 cores, 64 threads), with no crash so far, so it is not easily reproducible in any SMP environment, given the information available so far.

I changed my scripts to do 'truncate table sbtest' prior to the drop/create database sequence prior to shutdown. Another intermittent error has occurred:

ERROR 1296 (HY000) at line 1: Got error 211 'couldn't update system.indexs for SBTEST..PRIMARY_KEY' from Falcon

This was followed by the same crash on shutdown. There were no error messages prior to the crash on shutdown.

It turns out that Falcon wasn't fast. The inserts done by the 'prepare' run of sysbench failed, so the sbtest table was empty. The failure was on the first insert into an empty Falcon table. This only happened once and there was nothing in the db *.err file. The output from sysbench is:

ALERT: failed to execute MySQL query: `INSERT INTO sbtest(k, c, pad) VALUES (1,' ','qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt'),...,(63,' ',ALERT: Error 1022 Can't write; duplicate key in table 'sbtest'

The crash on shutdown error happens on 6.0.8 and 6.0.9

Olav,  Why is ~TransactionManager(void) committing any transactions?  Shouldn't they be rolled back?  

TransactionManager::~TransactionManager(void)
{
	rolledBackTransaction->release();
	
	for (Transaction *transaction; (transaction = activeTransactions.first);)
		{
		transaction->inList = false;
		transaction->state = Committed;
		activeTransactions.first = transaction->next;
		transaction->release();
		}
}

Second question.  Any idea which SyncObject is still locked?
Third question.  Will a transaction state object with useCounts avoid this crash?