MySQL Bugs: #33608: Falcon crash and corruption on blob DML and restart

Bug #33608	Falcon crash and corruption on blob DML and restart
Submitted:	1 Jan 2008 17:53	Modified:	5 May 2008 18:27
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0.4-p2	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any

Description:
A mix of blob inserts, updates and deletes plus server restarts will eventually cause a tablespace where a particular query will constantly cause a crash.

Stack trace is:

0x824a181 handle_segfault + 441
0xbbc402 (?)
0x83fce5d Error::error(char const*, ...) + 73
0x8472e7d Cache::fetchPage(Dbb*, int, PageType, LockType) + 441
0x83f5c11 Dbb::fetchPage(int, PageType, LockType) + 33
0x844b3ab Section::updateRecord(int, Stream*, unsigned int, bool) + 355
0x83f6353 Dbb::updateRecord(int, int, Stream*, unsigned int, bool) + 175
0x8446a8a SRLUpdateRecords::redo() + 370
0x844feaa SerialLog::recover() + 1230
0x83ec677 Database::openDatabase(char const*) + 367
0x83e8fad Connection::getDatabase(char const*, char const*, Threads*) + 193
0x83e68c2 Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*) + 74
0x83bb3bf StorageDatabase::getOpenConnection() + 63
0x83bf164 StorageHandler::initialize() + 108
0x83bebce StorageHandler::getStorageConnection(StorageTableShare*, THD*, int, OpenOption) + 426
0x83b1582 StorageInterface::open(char const*, int, unsigned int) + 98
0x830abf2 handler::ha_open(st_table*, char const*, int, int) + 34
0x828b393 open_table_from_share(THD*, st_table_share*, char const*, unsigned int, unsigned int, unsigned int, st_table*, open_table_mode) + 783
0x82885f8 open_unireg_entry(THD*, st_table*, TABLE_LIST*, char const*, char*, unsigned int, st_mem_root*, unsigned int) + 124
0x828315a open_table(THD*, TABLE_LIST*, st_mem_root*, bool*, unsigned int) + 1954
0x82843c7 open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int) + 1103
0x82c3d66 mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool) + 186
0x8255ccd mysql_execute_command(THD*) + 2949
0x825b5a5 mysql_parse(THD*, char const*, unsigned int, char const**) + 269
0x82540bd dispatch_command(enum_server_command, THD*, char*, unsigned int) + 909
0x8253d00 do_command(THD*) + 176
0x825254f handle_one_connection + 287

How to repeat:
Option #1. Run attached .pl script and wait

Option #2. Mount attached tablespace and issue:

UPDATE blobtest.blobtest SET bl = REPEAT('X', 140288), md5 = MD5(REPEAT('X', 140288)) WHERE id = 2;

File for reproducing bug #33608, please edit top of file to match your setup.

Attachment: bug33608.pl (application/octet-stream, text), 2.16 KiB.

ZIP of bug #33608 tablespace causing crash on future queries.

Attachment: bug33608.zip (application/x-zip-compressed, text), 32.12 KiB.

In fact, any query issued against the corrupted tablespace will cause a crash. It is not know which query caused the corruption itself.

Please note that the specific thing about this test is that the server gets restarted throughout the test, and that transactions containing blob updates which change the size of the blob are involved. Only one connection thread is involved.

The problem does not occur without the periodic restarts.

Please let me know if you want me to isolate the issue further.

The bug33608-2.zip file was produced using the same script, however instead of a crash, the server starts to report that the blobtest.blobtest table does not exist, even though SHOW TABLES will display that it is there.

Corrupted tablespace produced by the Perl script from bug #33608

Attachment: bug33608-2.zip (application/x-zip-compressed, text), 29.62 KiB.

Jim Starkey wrote;
I'm having next to no luck with 33608.  The test is driven by a random 
number generator.  Sometimes it inserts, updates, or deletes blobs and 
sometimes it restarts the server.

The files in question show that the server was restarted twice.  The 
symptom of the corruption is a record locator page with a dangling data 
page.  Normally, we will recreate the record locator page unless we know 
for sure it has been written.  In this case, however, the last known 
write was before the most recent restart.  At the moment, however, we 
don't carry that information in the serial log, though we have provision 
to add it.

I think this is a bona fide bug, but unless we expect our beta users to 
start and the stop the server in a loop commingled with update 
operations, I think we should consider postponing a fix until after 
beta.  I'm quite sure that if the server had stayed up long enough for 
an I/O thread to finish writing recovered pages we probably wouldn't 
have seen this (how's that for hedging...).

Re Kevin: Double recovery issue; high Risk to fix in 6.0.4

Jim Starkey pushed a change to Falcon that requires it ro wait for a Page Cache Flush at the end of recovery.  It used to do this, but when the pool of IO threads were added, the recovery process just called for a flush to happen and then returned.  Now the flush waits for all the IO threads to complete.

When recovery does not wait for the flush to complete, new blocks can be written to the serial log before all the pages from the previous recovery can be written to disk.  These blocks indicate that recovery should start after the previous one.  If a shutdown happens before these pages are fully written, The following recovery will not start early enough.

This very likely solves the double recovery problems exhibited by Bugs 32992, 33608 & 33665.  I am putting these bugs into Patch Pending so that this fix can be verified, or not.

Explanation of the double recovery problem by Ann Harrison;
==============================
We have been plagued recently with failures in recovery after
a previous recovery had not succeeded.  That turns out to
have been the result of a change to the page cache checkpoint
algorithm.  The fix was simple, once the cause was understood.
I'm going to try to explain the interaction of recovery and
checkpointing.

When Falcon starts, it determines whether it was shutdown
cleanly.  If not, it invokes recovery to apply the remaining
active parts of the serial log.  Recovery begins by finding
the position in the log after which changes may need to be
applied.  That position is called the recovery start point.
Every entry in the serial log includes the recovery start
point that was current when the entry was written.  If the
shutdown was not clean, Falcon begins recovery by identifying
the active log, reading the last entry, then backing up to its
recovery start point.

Recovery consists of applying changes in the serial log to
page images in the page cache, until they are all done, then
flushing the cache by forcing a checkpoint.

The test systems often shut down the server before Falcon has
moved all changes from the serial log to the main database.
After some errors, the test system restarts the server again,
causing Falcon to start recovery, then shutdown again.  For
the last several weeks, recovery has failed on the second restart.
One odd characteristic Jim noticed was that if he restarted from
an earlier stable point in the serial log, recovery succeeded.

Originally, a cache checkpoint consisted of a single I/O thread
writing all the dirty pages from the cache, using buffered I/O
without an fsync.  That was pretty fast, but not durable after
an ill-timed crash.  Adding an fsync caused a checkpoint to take
longer than the interval between checkpoints, which was a
performance problem.

A checkpoint now causes dirty pages in the cache to be
consolidated in a separate area so contiguous pages can be
flushed in a single write, then a bevy of I/O threads push
the page clusters to disk using direct I/O or buffered I/O
with a sync on each write (O_SYNC) depending on the capabilities
of the system.

The original page cache checkpoint got all pages into the
system file cache before it continued.  The new system is
asynchronous.

The problem was that the recovery system declared the
database recovered before the dirty pages in the page cache
were actually on disk. Other processing began, including
writes to the serial log with a new recovery start point.
If the server stopped at that point, the next attempt to
recover would fail because the first recovery did not get
fully applied to the database.

The solution is for recovery to wait until the I/O threads
are complete.

Patch is in mysql-6.0-falcon-team and mysql-6.0-release version 6.0.4

Noted in 6.0.4 changelog.

Incomplete Falcon recovery after server restarts eventually resulted
in tablespace corruption.