Bug #42184 Crash in Falcon on shutdown
Submitted: 18 Jan 7:32 Modified: 2 Sep 12:06
Reporter: Mark Callaghan
Status: Verified
Category:Server: Falcon Severity:S3 (Non-critical)
Version:6.0.8,6.0.9 OS:Any
Assigned to: Olav Sandstaa Target Version:6.0-rc
Tags: transaction, crash, F_SHUTDOWN
Triage: Triaged: D2 (Serious)

[18 Jan 7:32] Mark Callaghan
Description:
On the bright side, Falcon was astonishingly fast.

It was started with default parameters except for:
falcon_record_memory_max=500M
falcon_page_cache_size=500M

The test was sysbench OLTP readonly with 64 concurrent users. But the crash is on
shutdown, and my scripts do:
1) run sysbench test
2) drop database used for sysbench table
3) create database used for sysbench table
4) shutdown mysqld -- crash is here

[Falcon] Error: assertion (lockState == 0) failed at line 132 in file SyncObject.cpp

thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x24) [0x902dd4]
/usr/sbin/mysqld(handle_segfault+0x340) [0x4abe50]
/usr/grte/v1/lib64/libpthread.so.0 [0x2ba63673f550]
/usr/grte/v1/lib64/libpthread.so.0(raise+0x2b) [0x2ba63673f43b]
/usr/sbin/mysqld(Error::error(char const*, ...)+0xea) [0x68485a]
/usr/sbin/mysqld(SyncObject::~SyncObject()+0x26) [0x6952b6]
/usr/sbin/mysqld(Transaction::~Transaction()+0xf9) [0x6a5589]
/usr/sbin/mysqld(TransactionManager::~TransactionManager()+0x3b) [0x6a7c1b]
/usr/sbin/mysqld(Database::~Database()+0x27c) [0x6bf64c]
/usr/sbin/mysqld(StorageDatabase::close()+0x2e) [0x68984e]
/usr/sbin/mysqld(StorageHandler::shutdownHandler()+0x4a) [0x68c89a]
/usr/sbin/mysqld(StorageInterface::falcon_deinit(void*)+0x16) [0x67f736]
/usr/sbin/mysqld(ha_finalize_handlerton(st_plugin_int*)+0x2b) [0x58416b]
/usr/sbin/mysqld [0x5fe7a6]
/usr/sbin/mysqld [0x5ff20c]
/usr/sbin/mysqld(plugin_shutdown()+0x14f) [0x5ff65f]
/usr/sbin/mysqld [0x4aa571]
/usr/sbin/mysqld(unireg_end()+0xe) [0x4aa97e]
/usr/sbin/mysqld(kill_server_thread+0x373) [0x4ae953]
/usr/grte/v1/lib64/libpthread.so.0 [0x2ba63673909a]
/usr/grte/v1/lib64/libc.so.6(clone+0x72) [0x2ba6375ba3d2]

How to repeat:
see above

Suggested fix:
?
[19 Jan 14:59] John H. Embretsen
This is the same assertion as is being hit in Bug#40633 - "pushbuild failure: Falcon
assertion "lockState == 0" failed in SyncObject.cpp", which is great if this is
reproducible, because Bug#40633 has so far been hard to reproduce/debug.
Fixing one of the bugs will probably/hopefully fix both.
[19 Jan 18:15] Mark Callaghan
This is easy for me to reproduce:
1) run sysbench OLTP readonly (create database test, create and insert to table, run
query test for 60 seconds)
2) drop database test; create database test
3) shutdown database

The crash happens frequently at step 3 (at least 10% of the time)
[22 Jan 16:21] John H. Embretsen
I'm not able to reproduce this, at least not on 32-bit linux using the current falcon bzr
branch (6.0.10) and sysbench 0.4.10. Using the following sysbench option for the test
runs:

sysbench \
 --num-threads=64 \
 --mysql-db=test \
 --mysql-port=10010 \
 --mysql-host=127.0.0.1 \
 --mysql-engine-trx=yes \
 --mysql-table-engine=Falcon \
 --mysql-user=root \
 --oltp-read-only \
 --test=oltp run

15 runs of "start server", "prepare test", "run test", drop database test; create
database test, "shutdown server", "examine server log" and no issues seen. Perhaps 6.0.8
and/or some specific platform and/or options is required?
[22 Jan 20:04] Mark Callaghan
Are you using an SMP server? My servers have 4 or more cores.

I have filed several bugs for falcon crashes using whatever was the latest release
version of falcon at the time. Most of them seem to eventually get closed with 'cannot
reproduce using the latest internal version of falcon'. Either I am wasting my time
trying things out (and I am not trying to crash falcon with my tests) or falcon should be
made much more stable before being shared with the public.
[22 Jan 20:33] Vladislav Vaintroub
Mark, if you got core, would it be possible to provide stack traces for all threads? If
any falcon threads exist at this point, it should be only in sleeping state, doing
nothing.
That's in  theory, in practice it looks like shutdown inteferes with some other running
thread, but it is not obvious which one.
[23 Jan 14:45] John H. Embretsen
Stacktraces from threads, grabbed from core

Attachment: bug42184-threadstack.txt (text/plain), 9.57 KiB.

[23 Jan 15:14] John H. Embretsen
I was finally able to reproduce on a Solaris 10 x86_64 (4-cores, 2-sockets AMD CPU)
machine, using Sysbench 0.4.10 against MySQL Community Server 6.0.8 alpha. The crash
happened after 41 iterations of the described "how to repeat" procedure. 

Attached stack traces from dbx. I have the core available, but it is 700 MB so I cannot
attach it to the bug report... Let me know if someone needs it.

Mark, 

I am sorry that you feel that many of your bug reports are closed as "Can't repeat".
Sometimes crashes like this depend on so many things (e.g. timing), and it can be hard
for us to create an environment similar to yours, especially when there are few or no
environment details in the bug report.

(And yes, the 32-bit linux host I was using previously was a 4-core machine as well.)

Sometimes bugs are also fixed in the time between an official release and the time the
bug is reported, that is just the way it is.

Please note that these are alpha releases of MySQL Server, and users are likely to
experience some instability (not that I don't think you know that). Development and
stabilization efforts are ongoing almost every day of the week, and I think that most of
the time the stability of the mysql-6.0 branch on Launchpad (or mysql-6.0-falcon for that
matter) is higher/better than the previous official alpha release.

I can assure you that the Falcon team very much appreciates the fact that you are trying
out Falcon and reporting the bugs you find, so I hope you will continue doing that :)
[26 Jan 10:49] Olav Sandstaa
Updated call stacks for all threads in core file produced by John

Attachment: bug42184-threadstack2.txt (text/plain), 10.40 KiB.

[26 Jan 10:55] Olav Sandstaa
Comment to the attached call stack for all active threads:

-only the thread doing the shutdown of Falcon is active executing code in Falcon.
-there are five Falcon threads sleeping in Synchronize::sleep(), ideally I would have
liked these to be gone before we start destroying central data structures but as long as
they are not active we should be fine 

Basically, the call stacks give no hints about that any thread have a lock on the 
SyncObject that leads to this assert.
[26 Jan 16:22] Vladislav Vaintroub
I have reproduced the crash ion my machine.
The syncObject in question is Cache::syncWait. It is not a usual synchronization object
(there is no Sync guarding it that will automatically unlocked when a function calling it
returns). It is rather used as an event, I believe to prevent multiple flushes running in
parallel.

MY guess is that Cache::flush locks it at a time, where io threads cannot unlock it
anymore, because as they become inactive or shut down.

A quick fix would be to set syncWait.lockState to None prior to return from 
~Cache::Cache(). I dunno what would be a good fix though
[26 Jan 16:39] Philip Stoev
Please see bug http://bugs.mysql.com/bug.php?id=42343 - crash in Cache::flush
[26 Jan 19:53] Kevin Lewis
Vlad, Good catch and probably good guess.  Except that the crash you found is probably not
this one, since this happens in Transaction::~Transaction.  But just like yours, this
mught be a problem with Transaction::syncIsActive which is also an 'event' type
syncObject.  Please use a different bug for your crash on Cache::syncWait.

Transaction::syncIsActive is probably also caught in a
non-controlled-thread-exit-strategy.  But Transaction::~Transaction is supposed to unlock
it only if itis owned by the current thread.  Maybe the TransactionManager could release
all these at shutdown...
[26 Jan 20:01] Vladislav Vaintroub
correct, the crash I saw was  for Bug#40633
[26 Jan 20:09] Vladislav Vaintroub
Olav, the five threads that are sleeping are gopher threads.

Gopher::shutdown() does not do anything if Gopher::workedThread is NULL. the workerThread
is set to NULL at the end Gopher::gopherThread(void).  I have no idea why, maybe some
cludge?
[28 Jan 10:20] John H. Embretsen
I have verified this bug also with mysql-6.0.9-alpha-solaris10-x86_64 on a 4-core host.
The assert failed (same stack trace) after 1919 runs of the described how-to-repeat
procedure. 

Still running test against relatively up-to-date falcon-team bzr branch (6.0.10) (build
options may differ from released builds, though). I am also running the test on a
"Niagara 2" machine (8 cores, 64 threads), with no crash so far, so it is not easily
reproducible in any SMP environment, given the information available so far.
[29 Jan 5:18] Mark Callaghan
I changed my scripts to do 'truncate table sbtest' prior to the drop/create database
sequence prior to shutdown. Another intermittent error has occurred:

ERROR 1296 (HY000) at line 1: Got error 211 'couldn't update system.indexs for
SBTEST..PRIMARY_KEY' from Falcon

This was followed by the same crash on shutdown. There were no error messages prior to
the crash on shutdown.
[29 Jan 21:58] Mark Callaghan
It turns out that Falcon wasn't fast. The inserts done by the 'prepare' run of sysbench
failed, so the sbtest table was empty. The failure was on the first insert into an empty
Falcon table. This only happened once and there was nothing in the db *.err file. The
output from sysbench is:

ALERT: failed to execute MySQL query: `INSERT INTO sbtest(k, c, pad) VALUES (1,'
','qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrtttttttttt'),...,(63,' ',ALERT: Error 1022
Can't write; duplicate key in table 'sbtest'

The crash on shutdown error happens on 6.0.8 and 6.0.9
[13 Mar 17:06] Kevin Lewis
Olav,  Why is ~TransactionManager(void) committing any transactions?  Shouldn't they be
rolled back?  

TransactionManager::~TransactionManager(void)
{
	rolledBackTransaction->release();
	
	for (Transaction *transaction; (transaction = activeTransactions.first);)
		{
		transaction->inList = false;
		transaction->state = Committed;
		activeTransactions.first = transaction->next;
		transaction->release();
		}
}

Second question.  Any idea which SyncObject is still locked?
Third question.  Will a transaction state object with useCounts avoid this crash?