Bug #43964 Falcon Exception: corrupted record after recovery
Submitted: 30 Mar 18:21 Modified: 28 Jun 23:27
Reporter: Philip Stoev
Status: In progress
Category:Server: Falcon Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Christopher Powers Target Version:6.0-beta
Tags: F_ENCODING
Triage: D1 (Critical)

[30 Mar 18:21] Philip Stoev
Description:
When recovering after the falcon_transaction RQG test, Falcon printed the following in
the log:

...
# 23:43:38 corrupted record (overrun), table 25, record 21951
# 23:43:38 Compressed
# 23:43:38     13823, 13845, -5323, 13714, 9302, 
# 23:43:38 Compressed
# 23:43:38     .5.65..5V.
# 23:43:38 Exception: corrupted record, table 25, record 21951
# 23:43:39 corrupted record (overrun), table 25, record 21951
# 23:43:39 Compressed
# 23:43:39     13823, 13845, -5323, 13714, 9302, 
# 23:43:39 Compressed
# 23:43:39     .5.65..5V.
# 23:43:39 Exception: corrupted record, table 25, record 21951
# 23:43:39 Table::databaseFetch record 21951 in table TEST.TABLE10_FALCON_INT_AUTOINC:
corrupted record, table 25, record 21951
...

How to repeat:
This only happened once and a tablespace is not available at this time. This bug will be
updated when a tablespace becomes available.
[23 Apr 23:13] Vladislav Vaintroub
Why this is a critical bug? Did it affect any functionality? Is there anything else than
some printout in the log file?
[25 Apr 2:57] Philip Stoev
Vlad, when it comes to cryptic recovery messages, I need to assume the worst. In this
case, I assume that one or more records are either inaccessible or have bad data.

If you are not happy with any of the triage values, please modify them and/or set the
"need triage" flag so that this bug is triaged again.
[29 Apr 22:58] Kevin Lewis
We need to determine if this message truly indicates data corruption as Philip rightfully
assumes.  Also, Vlad suggested that this bug may have been fixed by recent
RecordLocatorPage bugs, like Bug#39890
[15 May 21:53] Christopher Powers
Unable to recreate the error with the tablespace files provided. The recovery process
cannot initialize the FALCON_USER tablespace because it cannot find the Page Inventory
Page. (Note that the RQG testcase is falcon_pagesize_32K and that falcon_page_size =
32K.)

Recovering database
/home/cpowers/work/dev/dev-03/mysql/mysql-test/var/mysqld.1/data/falcon_master.fts ...
First block in the serial log is 202
Last block in the serial log is 3023
First block used in recovery is 3023
Recovery phase 1...
Processed:        2
Recovery phase 2...
TableSpaceManager::bootstrap() : table space FALCON_USER, id 1, type 0, filename
falcon_user.fts
TableSpaceManager::bootstrap() : table space FALCON_TEMPORARY, id 2, type 0, filename
falcon_temporary.fts
Recovering from lost page inventory page 1
Exception: Transaction inventory page not found
Can't open tablespace FALCON_USER, id 1 : Transaction inventory page not found
Processed:        2
Recovery phase 3...
Processed:        2
Recovery complete
Exception: write error on page 1 (4096/4096/-1) of
"/home/cpowers/work/dev/dev-03/mysql/mysql-test/var/mysqld.1/data/falcon_user.fts": Bad
file descriptor (9)
Thread::thread: thread 1082132816: write error on page 1 (4096/4096/-1) of
"/home/cpowers/work/dev/dev-03/mysql/mysql-test/var/mysqld.1/data/falcon_user.fts": Bad
file descriptor (9)

The exception occurs when Dbb::getLastPage() is called. Dbb::getLastPage() is not
essential to the recovery process, and if it is disabled, the recovery completes but
still leaves the tablespaces unusable.

It appears that the database is not viable. 

There have been several Recovery fixes since the bug was opened. I suggest monitoring PB2
for another occurrence.
[24 Jun 10:21] Olav Sandstaa
This situation happened again. This time it was during validation of the tables in the
database after a "successful" recovery in the falcon_many_indexes test. Multiple records
in a table are reported as "corrupted". The log file contains:

# 21:59:56 Verifying table: table100_falcon_key_pk_parts_2_int_autoinc; database: test
# 21:59:56 28: TransID=19  CommitNoUpdates  (null)
# 21:59:56 28: TransID=20  CommitNoUpdates  (null)
# 21:59:59 corrupted record (overrun), table 26, record 1049
# 21:59:59 Compressed
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     11, 27, 11, 27, 11, 27, 11, 27, 11, 27, 
# 21:59:59     5131, 18964, -5120, 26726, 26726, 31076, 31076, 29043, 29043, 26229, 
# 21:59:59     26229, 28010, 28010, 26722, 26722, 25963, 25963, 26982, 26982, 31091, 
# 21:59:59     31091, 27499, 27499, 27238, 27238, 29801, 29801, 26477, 26477, 26731, 
# 21:59:59     26731, 30066, 30066, 26729, 26729, 29295, 29295, 30050, 30050, 24944, 
# 21:59:59     24944, 30570, 30570, 28534, 28534, 30312, 30312, 27746, 27746, 27249, 
# 21:59:59     27249, 25716, 25716, 30576, 30576, 27750, 27750, 26218, 26218, 30320, 
# 21:59:59     30320, 26216, 26216, 28784, 28784, 30832, 30832, 28259, 28259, 25195, 
# 21:59:59     25195, 30063, 30063, 28262, 28262, 25190, 25190, 30064, 30064, 25700, 
# 21:59:59     25700, 30070, 30070, 27246, 27246, 28777, 28777, 31092, 31092, 28013, 
# 21:59:59     28013, 29815, 29815, 28779, 28779, 26988, 26988, 28513, 28513, 24933, 
# 21:59:59     24933, 27505, 27505, 26724, 26724, 30580, 30580, 25718, 25718, 27237, 
# 21:59:59     27237, 29556, 29556, 29815, 29815, 28769, 28769, 26726, 26726, 27753, 
# 21:59:59     27753, 330, 12670, -28989, -15567, 12686, -28989, -15567, 12686, -28989, 
# 21:59:59     -15567, 12686, -28989, -15567, 12686, -28989, -15567, 12686, -28989,
-15567, 
# 21:59:59     12686, -28989, -15567, 12686, -28989, -15567, 12686, -28989, -15567,
12686, 
# 21:59:59     -28989, -15567, 12686, -28989, -15567, 12686, -28989, -15567, 12686,
-28989, 
# 21:59:59     -15567, 12686, -28989, -15567, 12686, -28989, -15567, 12686, -28989,
-15567,
[28 Jun 23:27] Christopher Powers
Associated push: http://pb2.norway.sun.com/web.py?template=push_details&push=427310
[30 Jun 1:50] Christopher Powers
This is not a recovery problem. There appears to be an issue with records that are stored
across one or more overflow pages.

The error occurs when Stream::decompress() encounters record 1049 in
table100_falcon_key_pk_parts_2_int_autoinc. Record 1049 was apparently too large to fit
onto the one page, so the record "tail" was relegated to an overflow page.

The failure can be easily recreated using the database files noted above. The error
occurs when record 1049 is accessed for the first time, after recovery completes. The
error does not occur after that.

Debugging reveals that record 1049 is stored on primary page 2516 and overflow page is
810. Neither of these pages is accessed during recovery, nor is record 1049 accessed
during recovery.

Interestingly, none of the record data is stored on the primary page, only the overflow
page. The next step is to review the overflow mechanism, especially for records that
equal the page size.