Bug #43459 falcon_ddl fails due to missing record after recovery
Submitted: 6 Mar 16:52 Modified: 10 Mar 13:46
Reporter: Olav Sandstaa
Status: Analyzing
Category:Server: Falcon Severity:S3 (Non-critical)
Version:6.0.10-alpha OS:Any
Assigned to: Philip Stoev Target Version:6.0+
Tags: F_DDL, error handling
Triage: Needs Triage: D2 (Serious)

[6 Mar 16:52] Olav Sandstaa
Description:
This failure happened when running falcon_ddl:

After Falcon has done a "successfull" recovery, the test does a check for the content of
the database and fails with the following written to the log:

# 19:58:51 Verifying table: c; database: x
# 19:58:51 13: CommitNoUpdates transaction 647
# 19:58:51 13: CommitNoUpdates transaction 648
# 19:58:51 13: CommitNoUpdates transaction 649
# 19:58:51 13: CommitNoUpdates transaction 650
# 19:58:51 13: CommitNoUpdates transaction 651
DBD::mysql::st execute failed: Can't find record in 'c' at
lib/GenTest/Reporter/Recovery.pm line 196, <RDRFH> line 26.

How to repeat:
Might be reproducable by running falcon_ddl.

Suggested fix:
After recovery all tables should be there....
[7 Mar 7:04] Vladislav Vaintroub
philip, please explain the error message. What did fail?
[7 Mar 7:36] Philip Stoev
One of the SELECT queries that is run after recovery in order to validate the table failed
with a "can't find record" message, which is not acceptable.

My understanding is that this message is caused by Falcon reporting that a given row
exists , e.g. during an index scan, however refusing to retrieve the data from the record
at the second pass. This signifies that the table is corrupted.

I will try to have the RQG print out the exact SQL of the failing query.
[7 Mar 9:12] Philip Stoev
Binary:

http://loki08.norway.sun.com:8080/372537.mysql-6.0.11-alpha-solaris10-sparc.tar.gz

Tablespace:

http://loki08.norway.sun.com:8080/372630.mysql-6.0.11-alpha-solaris10-sparc-test.tar.gz ,
in master-data

To reproduce:

MTR_VERSION=1 perl mysql-test-run.pl --start-and-exit --mysqld=--datadir=master-data 1st

and then issue:

mysql> select * from `x`.`c`;
Empty set (0.00 sec)

mysql>  SELECT * FROM `x`.`c` WHERE `pk` >= -9;
ERROR 1032: Can't find record in 'c'
[9 Mar 18:32] Olav Sandstaa
Here is an update on what I have found by using the Falcon database from the failing
test:

The MySQL server claims that is has one index on the table named "c":

+-------+---------+------+-----+---------+----------------+
| Field | Type    | Null | Key | Default | Extra          |
+-------+---------+------+-----+---------+----------------+
| pk    | int(11) | NO   | PRI | NULL    | auto_increment | 
+-------+---------+------+-----+---------+----------------+
1 row in set (0.08 sec)

mysql> show index from c;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality |
Sub_part | Packed | Null | Index_type | Comment | Index_Comment |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| c     |          0 | PRIMARY  |            1 | pk          | NULL      |        NULL | 
   NULL | NULL   |      | BTREE      |         |               | 
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
1 row in set (0.03 sec)

But by adding tracing to the Falcon code it seems like Falcon does not have any index on
this table after recovery.

Some details:

1. In the call to StorageInterface::index_init() we are not finding the requested index
but return 0 (OK).

2. In the call to StorageInterface::read_range_first() does a call to
StorageInterface::scanRange() which again calls StorageTable::checkCurrentIndex(). The
result of this due to the index not existing is that error code 120
(HA_ERR_KEY_NOT_FOUND). This results in the client get the error message:

ERROR 1032 (HY000): Can't find record in 'c'
[9 Mar 20:00] Olav Sandstaa
The failing statement will "work as expected" if we changes the return value in
StorageInterface::read_range_first() from HA_ERR_KEY_NOT_FOUND to HA_ERR_END_OF_FILE when
StorageInterface::scanRange() returns an error. The same conversion of return value is
done a few lines later in the function if index_next() returns an error.

Still, I do not consider this as a solution since we should not hide the fact that the
index is missing. 

A "real solution" should involve avoiding that the server and Falcon have different views
on what indexes a table has. Or at least we should detect this and either report or repair
the missing index after recovery.

Setting the status back to verified and un-assign myself until we have a more clearer
strategy for how this should be solved.
[9 Mar 21:39] Ann Harrison
It may "work as expected" if we return EOF, but it probably won't get
the right answer - unless we were very lucky and the statement was in
fact at the end of the result set.
[10 Mar 13:41] Philip Stoev
In order for work to continue on this bug, I need to modify the RQG so that it uses kill
-11 and not kill -9 when killing the server before recovery. This way a record of the
threads that existed at the time of the crash will be preserved (alternatively, gdb can
be used to stop the process, list the threads and then kill the server).

Once this is in place and the bug has been reproduced again, we will know better what
caused it, including the actual query that was in progress at the time of the kill.