MySQL Bugs: #43459: falcon_ddl fails due to missing record after recovery

Bug #43459	falcon_ddl fails due to missing record after recovery
Submitted:	6 Mar 2009 15:52	Modified:	21 Feb 2014 18:04
Reporter:	Olav Sandstå	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0.10-alpha	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	error handling, F_DDL

Description:
This failure happened when running falcon_ddl:

After Falcon has done a "successfull" recovery, the test does a check for the content of the database and fails with the following written to the log:

# 19:58:51 Verifying table: c; database: x
# 19:58:51 13: CommitNoUpdates transaction 647
# 19:58:51 13: CommitNoUpdates transaction 648
# 19:58:51 13: CommitNoUpdates transaction 649
# 19:58:51 13: CommitNoUpdates transaction 650
# 19:58:51 13: CommitNoUpdates transaction 651
DBD::mysql::st execute failed: Can't find record in 'c' at lib/GenTest/Reporter/Recovery.pm line 196, <RDRFH> line 26.

How to repeat:
Might be reproducable by running falcon_ddl.

Suggested fix:
After recovery all tables should be there....

philip, please explain the error message. What did fail?

One of the SELECT queries that is run after recovery in order to validate the table failed with a "can't find record" message, which is not acceptable.

My understanding is that this message is caused by Falcon reporting that a given row exists , e.g. during an index scan, however refusing to retrieve the data from the record at the second pass. This signifies that the table is corrupted.

I will try to have the RQG print out the exact SQL of the failing query.

Binary:

http://loki08.norway.sun.com:8080/372537.mysql-6.0.11-alpha-solaris10-sparc.tar.gz

Tablespace:

http://loki08.norway.sun.com:8080/372630.mysql-6.0.11-alpha-solaris10-sparc-test.tar.gz , in master-data

To reproduce:

MTR_VERSION=1 perl mysql-test-run.pl --start-and-exit --mysqld=--datadir=master-data 1st

and then issue:

mysql> select * from `x`.`c`;
Empty set (0.00 sec)

mysql>  SELECT * FROM `x`.`c` WHERE `pk` >= -9;
ERROR 1032: Can't find record in 'c'

Here is an update on what I have found by using the Falcon database from the failing test:

The MySQL server claims that is has one index on the table named "c":

+-------+---------+------+-----+---------+----------------+
| Field | Type    | Null | Key | Default | Extra          |
+-------+---------+------+-----+---------+----------------+
| pk    | int(11) | NO   | PRI | NULL    | auto_increment | 
+-------+---------+------+-----+---------+----------------+
1 row in set (0.08 sec)

mysql> show index from c;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| c     |          0 | PRIMARY  |            1 | pk          | NULL      |        NULL |     NULL | NULL   |      | BTREE      |         |               | 
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.03 sec)

But by adding tracing to the Falcon code it seems like Falcon does not have any index on this table after recovery.

Some details:

1. In the call to StorageInterface::index_init() we are not finding the requested index but return 0 (OK).

2. In the call to StorageInterface::read_range_first() does a call to StorageInterface::scanRange() which again calls StorageTable::checkCurrentIndex(). The result of this due to the index not existing is that error code 120 (HA_ERR_KEY_NOT_FOUND). This results in the client get the error message:

ERROR 1032 (HY000): Can't find record in 'c'

The failing statement will "work as expected" if we changes the return value in StorageInterface::read_range_first() from HA_ERR_KEY_NOT_FOUND to HA_ERR_END_OF_FILE when StorageInterface::scanRange() returns an error. The same conversion of return value is done a few lines later in the function if index_next() returns an error.

Still, I do not consider this as a solution since we should not hide the fact that the index is missing. 

A "real solution" should involve avoiding that the server and Falcon have different views on what indexes a table has. Or at least we should detect this and either report or repair the missing index after recovery.

Setting the status back to verified and un-assign myself until we have a more clearer strategy for how this should be solved.

It may "work as expected" if we return EOF, but it probably won't get
the right answer - unless we were very lucky and the statement was in
fact at the end of the result set.

In order for work to continue on this bug, I need to modify the RQG so that it uses kill -11 and not kill -9 when killing the server before recovery. This way a record of the threads that existed at the time of the crash will be preserved (alternatively, gdb can be used to stop the process, list the threads and then kill the server).

Once this is in place and the bug has been reproduced again, we will know better what caused it, including the actual query that was in progress at the time of the kill.