MySQL Bugs: #42903: Falcon: Inconsistent tablespace after recovery

Bug #42903	Falcon: Inconsistent tablespace after recovery
Submitted:	16 Feb 2009 20:11	Modified:	26 May 2010 17:51
Reporter:	Philip Stoev	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Lars-Erik BjÃ¸rk	CPU Architecture:	Any
Tags:	F_INDEX

Description:
After executing a RQG kill -9 test like this one:

$ perl runall.pl \
  --mem \
  --rows=10 \
  --threads=64 \
  --mysqld=--falcon-index-chill-threshold=1 \
  --basedir=/build/bzr/6.0-falcon-team \
  --mask=949 \
  --queries=10000000 \
  --duration=900 \
  --engine=Falcon \
  --grammar=conf/transactions-flat.yy \
  --gendata=conf/transactions.zz \
  --reporters=Deadlock,ErrorLog,Backtrace,Recovery \
  --mysqld=--loose-falcon-lock-wait-timeout=1 \
  --mysqld=--loose-innodb-lock-wait-timeout=1 \
  --mysqld=--log-output=none \
  --mysqld=--skip-safemalloc 

The Falcon table remains inconsistent in a sense that a full table scan will return more rows as compared to a range query with an index.

This issue has been observed across various workloads, thread counts and Falcon options. Usually the difference between the queries is 1 row.

How to repeat:
A corrupted tablespace will be uploaded shortly.

The tablespace is available at

http://fedora10.dyndns.org/bug42903-2.zip

mysql> SELECT COUNT(*) FROM `test`.`table10_falcon_int_autoinc`;
+----------+
| COUNT(*) |
+----------+
|       28 |
+----------+
1 row in set (0.20 sec)

mysql> SELECT COUNT(*) FROM `test`.`table10_falcon_int_autoinc` WHERE `pk` >= 0;
+----------+
| COUNT(*) |
+----------+
|       23 |
+----------+
1 row in set (0.14 sec)

Multiple tablespaces that are corrupted in the same manner are available. This one was picked because it is small and the discrepancy between the two queries is greater than the usual 1 row.

I did two dozen test runs with graceful shutdown + startup and the inconsistency issue appeared only once, when full-blown recovery was triggered for some reason.

If recovery is only a token operation, then the inconsistency does not show up.

Therefore, this appears to be a geniune recovery problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68011

3043 Vladislav Vaintroub	2009-03-02
      Bug#42344  Falcon recovery Exception: read error on page 
      Bug#42903  Falcon: Inconsistent tablespace after recovery
      
      The exception is caused during recovery of an index page, when 
      trying to  insert a first record to the parent page of the current 
      page. Parent page was not existing - this caused reads behind the 
      EOF of falcon_user.fts 
      
      The problem is that parent page itself was not logged.
      Another (worse) problem that can result into corrupted index is that 
      that parent pointer is not reliable and often can be wrong for 
      non-leaf pages. When non-leaf page is split, half or its nodes are 
      transferred to the newly created page. For  consistency, parentPage
      of corresponding child pages had to be changed and change had to be 
      logged and this was not done.
      
      The fix is to remove all occurenced parentPage pointer from the 
      Falcon code. Having a link to parent page was an idea that is not 
      implemented consistently. Also, this patch removed all occurenced 
      to priorPage from the Falcon code.This was another inconsistently 
      implemented idea.
      
      IndexPage::priorPage and IndexPage::parentPage are renamed to  
      int32 IndexPage::unused[2], for future use and to prevent the need
      to upgrade for existing databases.
      
      Recovery is changed to perform careful writes in the right order 
      during the index split (newly created page, then page that overflowed, 
      then parent page or the newly creted page).
      
      Doing recovery like this could result in a slightly detuned index, if 
      crash or kill occures before parent page is logged. Detuned means
      there is a page without a parent. Index traversal will still find it
      when traversing the page level via IndexPage::next links. 
      
      "Detuned" problem can be fixed  by logging  all pages involved in index 
      split at once. This is something to consider in the future.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68025

3043 Vladislav Vaintroub	2009-03-02
      Bug#42344  Falcon recovery Exception: read error on page 
      Bug#42903  Falcon: Inconsistent tablespace after recovery
      
      The exception is caused during recovery of an index page, when 
      trying to  insert a first record to the parent page of the current 
      page. Parent page was not existing - this caused reads behind the 
      EOF of falcon_user.fts 
      
      The problem is that parent page itself was not logged.
      Another (worse) problem that can result into corrupted index is that 
      that parent pointer is not reliable and often can be wrong for 
      non-leaf pages. When non-leaf page is split, half or its nodes are 
      transferred to the newly created page. For  consistency, parentPage
      of corresponding child pages had to be changed and change had to be 
      logged and this was not done.
      
      The fix is to remove all occurenced parentPage pointer from the 
      Falcon code. Having a link to parent page was an idea that is not 
      implemented consistently. Also, this patch removed all occurenced 
      to priorPage from the Falcon code.This was another inconsistently 
      implemented idea.
      
      IndexPage::priorPage and IndexPage::parentPage are renamed to  
      int32 IndexPage::unused[2], for future use and to prevent the need
      to upgrade for existing databases.
      
      Recovery is changed to perform careful writes in the right order 
      during the index split (newly created page, then page that overflowed, 
      then parent page or the newly creted page).
      
      Doing recovery like this could result in a slightly detuned index, if 
      crash or kill occures before parent page is logged. Detuned means
      there is a page without a parent. Index traversal will still find it
      when traversing the page level via IndexPage::next links. 
      
      "Detuned" problem can be fixed  by logging  all pages involved in index 
      split at once. This is something to consider in the future.

With the changes to the Recovery reporting in the RQG, even a very simple test run is sufficient to demonstrate the problem:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team \
  --engine=Falcon \
  --gendata=conf/combinations.zz \
  --grammar=conf/combinations.yy \
  --duration=40 \
  --queries=10000000 \
  --reporter=Recovery

I'm somewhat skeptical about the last comment.
I run that test 500 times without an error.

Vlad, now that LIMIT queries have been eliminated from recovery, an inconsistent tablespace is detected much less frequently. We have two options regarding this bug:

A. Take a corrupted tablespace from the RQG falcon_online_alter test, which frequently produces those in PB2. The advantage is that the tablespace is inconsistent across many indexes, the disadvantage is that the test uses online alter.

B. Let me find you and give you an inconsistent tablespace that is produced without ONLINE ALTER.

Please let me know.

May I opt for 
b) - let you find a tablespace that is not produced with ALTER

Thanks

A tablespace produced without any ALTER is available at

http://fedora10.dyndns.org/var-bug42903-2.zip

note that the inconsistency in this particular tablespace appears transient, e.g. if the queries are executed the second time, the table appears consistent.

This particular sequence of commands demonstrates the issue:

rm -rf master-data-copy

cp -r master-data master-data-copy

./mysqld --datadir=master-data-copy --language=/build/bzr/6.0-falcon-team/sql/share/english --falcon-debug-mask=65535 --socket=/tmp/mysql.sock > mysqld.log 2>&1 &

sleep 10

mysql -uroot --protocol=tcp --port=3306 -e "SELECT LPAD(pk, 8, ' ') FROM test.table10_falcon_int_autoinc" > query1.out

mysql -uroot --protocol=tcp --port=3306 -e "SELECT LPAD(pk, 8, ' ') FROM test.table10_falcon_int_autoinc FORCE INDEX (int_key) WHERE int_key >= -9223372036854775808 OR int_key IS NULL" > query2.out

diff -u query1.out query2.out

And here is a tablespace that is permanently inconsistent:

http://fedora10.dyndns.org/var-bug42903-B2.zip

mysql> select * from table10_falcon_int_autoinc where pk = 30823;
Empty set (0.01 sec)

mysql> select * from table10_falcon_int_autoinc IGNORE KEY (primary) where pk = 30823;
+---------+-------+------+
| int_key | pk    | int  |
+---------+-------+------+
|     100 | 30823 |  100 |
+---------+-------+------+
1 row in set (0.01 sec)

mysql> select * from table10_falcon_int_autoinc WHERE int_key = 100;
Empty set (0.00 sec)

mysql> select * from table10_falcon_int_autoinc IGNORE KEY (int_key) WHERE int_key = 100;
+---------+-------+------+
| int_key | pk    | int  |
+---------+-------+------+
|     100 | 30823 |  100 |
+---------+-------+------+
1 row in set (0.00 sec)

Note: the latest tablespace was produced using --falcon-use-deferred-index-hash=1

Pushed into 6.0.11-alpha (revid:hky@sun.com-20090402144811-yc5kp8g0rjnhz7vy) (version source revid:christopher.powers@sun.com-20090304040340-b4zoglfws0iswqm1) (merge vers: 6.0.11-alpha) (pib:6)

A permanently inconsistent tablespace was just observed after the server has been left idle for 10 minutes, then shut down cleanly and then restarted.

To clarify you last comment, no recovery happened? What does error log look like (e.g number processed serial log entries reported)?

Ann wrote; The problem appears to be in these lines of code:

    Transaction *sysTransaction = database->getSystemTransaction();
    table->dropIndex(name, sysTransaction);
-    database->commitSystemTransaction();

    Index::deleteIndex (database, schema, name);
    database->commitSystemTransaction(); 

 The first commit is what keeps the serial log from rotating after a delete index. The SRLDeleteIndex record goes into the log with a transaction id, but no commit SRL entry is ever included for that operation because its done by a system transaction that is committed without having been marked as having done updates.  Without the first commit, everything seems to work in terms of
rotating the log.

Sorry, the previous comment was for 39130.  This is still in verified state.

A more stable location for the tablespaces:

http://mysql-systemqa.s3.amazonaws.com/var-bug42903-B2.zip

Olav reports that pushbuild2 shows numerous failures in each push that are similar to this bug, in which SELECT statements return different results whether they are using an index or not.  It is usually seen after recovery, but that is the time when RQG does the comparison, not during the run.  So in a team discussion, we think this is related to index issues that Lars-Erik is working on.

This bug should be monitored in pushbuild 2 after the other repeatable index issues are fixed.