Bug #38970 Crash in function called from falcon_init when running test cases
Submitted: 22 Aug 2008 19:45 Modified: 15 May 14:52
Reporter: Sven Sandberg
Status: Closed
Category:Server: Falcon Severity:S1 (Critical)
Version:6.0-rpl OS:Any
Assigned to: Vladislav Vaintroub Target Version:6.0-beta
Tags: 6.0-rpl-green, replication, test failure, core, crash, F_ERROR HANDLING
Triage: Triaged: D1 (Critical)

[22 Aug 2008 19:45] Sven Sandberg
Description:
When running the test suite in 6.0-rpl locally, I got coredumps inside falcon code during
the slave's startup in five replication tests. The coredumps are not repeatable every
time.

I will attach the stack traces (gdb /path/to/mysqld /path/to/core --eval-command='thread
apply all bt' --batch).

How to repeat:
Run the suite in 5.1-rpl.
[22 Aug 2008 19:47] Sven Sandberg
list of stack traces for five crashes

Attachment: stack-traces (application/octet-stream, text), 31.18 KiB.

[22 Aug 2008 20:11] Sven Sandberg
Correction: the coredumps did not only happen in the startup of slave servers, it happened
also in the startup of master servers.
[25 Aug 2008 4:11] Kevin Lewis
This does not look like a Falcon issue as likely as a disk or IO system issue.  The errors
are either IO::writePages,IO.cpp:338; "write error on page %d (%d/%d/%d) of \"%s\": %s
(%d)") which sets a fatal error flag, after which the other type of error will occur to
other threads;  IO::writePages, IO.cpp:308, "can't continue after fatal error".  

Since the error is not consistent, my guess is that it is not in the way the file is
opened.
[25 Aug 2008 19:12] Vladislav Vaintroub
Found this in the stacktraces...

[Falcon] Error: write error on page 0 (4096/4096/4) of
"/home/sven/bzr/merge/6.0-rpl_from_5.1-rpl/mysql-test/var/2/mysqld.2/data/falcon_master.fts":
Input/output error (5)

So, pwrite is getting IO error- errno 5. I've no idea how can this be. wild guess is that
file descriptor 5 was opened by falcon , then closed by somebody else and we're doing
pwrite on socket or similar
[25 Aug 2008 19:13] Vladislav Vaintroub
file descriptor 5 should really  befile descriptor 4 in previous comment.
[13 Nov 2008 19:52] Sven Sandberg
Note that only crash 3 and crash 4 contain the text "Error: write error on page 0
(4096/4096/4)..."

Crash 1, 2, and 5 instead contain the text "can't continue after fatal error" in the
stack trace. I just reproduced crash 1/2/5 on my local machine in 6.0-rpl, which has
BUG#39458 fixed. So this is not a symptom of BUG#39458.
[13 Nov 2008 19:56] Sven Sandberg
stack trace from crash 6

Attachment: stacktrace (application/octet-stream, text), 7.86 KiB.

[13 Nov 2008 21:22] Kevin Lewis
Sven,  I assume that you now consider this bug verified.  Has it happened again since Aug
25?  Assigning this to Vlad to look into.  If this is a one time problem we may have to
make it Can't Repeat.
[14 Nov 2008 9:45] Sven Sandberg
Kevin, yes crash#6 happened yesterday with a 6.0-rpl tree. With 6.0-rpl it usually happens
at least a couple of times each time I run the suite. Let me know if I shall try to repeat
it with 6.0 main.
[14 Nov 2008 12:24] Vladislav Vaintroub
Sven, please try to reproduce with main. it is not a usual crash and looks very like as if
somebody (pointing to rpl ;)) closes file descriptors that belong to falcon.
[18 Dec 2008 13:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/61964

2945 Vladislav Vaintroub	2008-12-18
       Bug #38970 Crash in function called from falcon_init when running test cases 
      Problem: Upon encountering IO errors, falcon crashes with assert.
      Solution:Instead of assert, throw an exception. This allows to more graceful error
handling during
      Falcon startup .Error text will be written into error log and Falcon will not load.
[18 Dec 2008 14:04] Vladislav Vaintroub
Pushed into falcon-team
[13 Feb 8:25] Bugs System
Pushed into 6.0.10-alpha (revid:alik@sun.com-20090211182317-uagkyj01fk30p1f8) (version
source revid:hky@sun.com-20081218223730-ujuygclo2fezfurq) (merge vers: 6.0.9-alpha)
(pib:6)
[15 May 14:52] MC Brown
A note has been added to the 6.0.10 changelog: 

When the Falcon storage engine encountered an I/O error, mysqld would crash. Errors now
raise an exception, which is reported to the error log and Falcon will fail to
initialize.