Bug #42903 Falcon: Inconsistent tablespace after recovery
Submitted: 16 Feb 21:11 Modified: 13 Apr 20:05
Reporter: Philip Stoev
Status: Verified
Category:Server: Falcon Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Lars-Erik Bjørk Target Version:6.0-beta
Tags: F_INDEX
Triage: Triaged: D2 (Serious)

[16 Feb 21:11] Philip Stoev
Description:
After executing a RQG kill -9 test like this one:

$ perl runall.pl \
  --mem \
  --rows=10 \
  --threads=64 \
  --mysqld=--falcon-index-chill-threshold=1 \
  --basedir=/build/bzr/6.0-falcon-team \
  --mask=949 \
  --queries=10000000 \
  --duration=900 \
  --engine=Falcon \
  --grammar=conf/transactions-flat.yy \
  --gendata=conf/transactions.zz \
  --reporters=Deadlock,ErrorLog,Backtrace,Recovery \
  --mysqld=--loose-falcon-lock-wait-timeout=1 \
  --mysqld=--loose-innodb-lock-wait-timeout=1 \
  --mysqld=--log-output=none \
  --mysqld=--skip-safemalloc 

The Falcon table remains inconsistent in a sense that a full table scan will return more
rows as compared to a range query with an index.

This issue has been observed across various workloads, thread counts and Falcon options.
Usually the difference between the queries is 1 row.

How to repeat:
A corrupted tablespace will be uploaded shortly.
[16 Feb 21:15] Philip Stoev
The tablespace is available at

bug42903-2.zip">http://fedora10.dyndns.org/bug42903-2.zip

mysql> SELECT COUNT(*) FROM `test`.`table10_falcon_int_autoinc`;
+----------+
| COUNT(*) |
+----------+
|       28 |
+----------+
1 row in set (0.20 sec)

mysql> SELECT COUNT(*) FROM `test`.`table10_falcon_int_autoinc` WHERE `pk` >= 0;
+----------+
| COUNT(*) |
+----------+
|       23 |
+----------+
1 row in set (0.14 sec)

Multiple tablespaces that are corrupted in the same manner are available. This one was
picked because it is small and the discrepancy between the two queries is greater than
the usual 1 row.
[2 Mar 15:23] Philip Stoev
I did two dozen test runs with graceful shutdown + startup and the inconsistency issue
appeared only once, when full-blown recovery was triggered for some reason.

If recovery is only a token operation, then the inconsistency does not show up.

Therefore, this appears to be a geniune recovery problem.
[2 Mar 17:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68011

3043 Vladislav Vaintroub	2009-03-02
      Bug#42344  Falcon recovery Exception: read error on page 
      Bug#42903  Falcon: Inconsistent tablespace after recovery
      
      The exception is caused during recovery of an index page, when 
      trying to  insert a first record to the parent page of the current 
      page. Parent page was not existing - this caused reads behind the 
      EOF of falcon_user.fts 
      
      The problem is that parent page itself was not logged.
      Another (worse) problem that can result into corrupted index is that 
      that parent pointer is not reliable and often can be wrong for 
      non-leaf pages. When non-leaf page is split, half or its nodes are 
      transferred to the newly created page. For  consistency, parentPage
      of corresponding child pages had to be changed and change had to be 
      logged and this was not done.
      
      The fix is to remove all occurenced parentPage pointer from the 
      Falcon code. Having a link to parent page was an idea that is not 
      implemented consistently. Also, this patch removed all occurenced 
      to priorPage from the Falcon code.This was another inconsistently 
      implemented idea.
      
      IndexPage::priorPage and IndexPage::parentPage are renamed to  
      int32 IndexPage::unused[2], for future use and to prevent the need
      to upgrade for existing databases.
      
      Recovery is changed to perform careful writes in the right order 
      during the index split (newly created page, then page that overflowed, 
      then parent page or the newly creted page).
      
      Doing recovery like this could result in a slightly detuned index, if 
      crash or kill occures before parent page is logged. Detuned means
      there is a page without a parent. Index traversal will still find it
      when traversing the page level via IndexPage::next links. 
      
      "Detuned" problem can be fixed  by logging  all pages involved in index 
      split at once. This is something to consider in the future.
[2 Mar 19:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68025

3043 Vladislav Vaintroub	2009-03-02
      Bug#42344  Falcon recovery Exception: read error on page 
      Bug#42903  Falcon: Inconsistent tablespace after recovery
      
      The exception is caused during recovery of an index page, when 
      trying to  insert a first record to the parent page of the current 
      page. Parent page was not existing - this caused reads behind the 
      EOF of falcon_user.fts 
      
      The problem is that parent page itself was not logged.
      Another (worse) problem that can result into corrupted index is that 
      that parent pointer is not reliable and often can be wrong for 
      non-leaf pages. When non-leaf page is split, half or its nodes are 
      transferred to the newly created page. For  consistency, parentPage
      of corresponding child pages had to be changed and change had to be 
      logged and this was not done.
      
      The fix is to remove all occurenced parentPage pointer from the 
      Falcon code. Having a link to parent page was an idea that is not 
      implemented consistently. Also, this patch removed all occurenced 
      to priorPage from the Falcon code.This was another inconsistently 
      implemented idea.
      
      IndexPage::priorPage and IndexPage::parentPage are renamed to  
      int32 IndexPage::unused[2], for future use and to prevent the need
      to upgrade for existing databases.
      
      Recovery is changed to perform careful writes in the right order 
      during the index split (newly created page, then page that overflowed, 
      then parent page or the newly creted page).
      
      Doing recovery like this could result in a slightly detuned index, if 
      crash or kill occures before parent page is logged. Detuned means
      there is a page without a parent. Index traversal will still find it
      when traversing the page level via IndexPage::next links. 
      
      "Detuned" problem can be fixed  by logging  all pages involved in index 
      split at once. This is something to consider in the future.
[24 Mar 20:02] Philip Stoev
With the changes to the Recovery reporting in the RQG, even a very simple test run is
sufficient to demonstrate the problem:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team \
  --engine=Falcon \
  --gendata=conf/combinations.zz \
  --grammar=conf/combinations.yy \
  --duration=40 \
  --queries=10000000 \
  --reporter=Recovery
[1 Apr 23:30] Vladislav Vaintroub
I'm somewhat skeptical about the last comment.
I run that test 500 times without an error.
[1 Apr 23:37] Philip Stoev
Vlad, now that LIMIT queries have been eliminated from recovery, an inconsistent
tablespace is detected much less frequently. We have two options regarding this bug:

A. Take a corrupted tablespace from the RQG falcon_online_alter test, which frequently
produces those in PB2. The advantage is that the tablespace is inconsistent across many
indexes, the disadvantage is that the test uses online alter.

B. Let me find you and give you an inconsistent tablespace that is produced without
ONLINE ALTER.

Please let me know.
[1 Apr 23:57] Vladislav Vaintroub
May I opt for 
b) - let you find a tablespace that is not produced with ALTER

Thanks
[2 Apr 9:25] Philip Stoev
A tablespace produced without any ALTER is available at

bug42903-2.zip">http://fedora10.dyndns.org/var-bug42903-2.zip

note that the inconsistency in this particular tablespace appears transient, e.g. if the
queries are executed the second time, the table appears consistent.

This particular sequence of commands demonstrates the issue:

rm -rf master-data-copy

cp -r master-data master-data-copy

./mysqld --datadir=master-data-copy
--language=/build/bzr/6.0-falcon-team/sql/share/english --falcon-debug-mask=65535
--socket=/tmp/mysql.sock > mysqld.log 2>&1 &

sleep 10

mysql -uroot --protocol=tcp --port=3306 -e "SELECT LPAD(pk, 8, ' ') FROM
test.table10_falcon_int_autoinc" > query1.out

mysql -uroot --protocol=tcp --port=3306 -e "SELECT LPAD(pk, 8, ' ') FROM
test.table10_falcon_int_autoinc FORCE INDEX (int_key) WHERE int_key >=
-9223372036854775808 OR int_key IS NULL" > query2.out

diff -u query1.out query2.out
[2 Apr 11:40] Philip Stoev
And here is a tablespace that is permanently inconsistent:

bug42903-B2.zip">http://fedora10.dyndns.org/var-bug42903-B2.zip

mysql> select * from table10_falcon_int_autoinc where pk = 30823;
Empty set (0.01 sec)

mysql> select * from table10_falcon_int_autoinc IGNORE KEY (primary) where pk = 30823;
+---------+-------+------+
| int_key | pk    | int  |
+---------+-------+------+
|     100 | 30823 |  100 |
+---------+-------+------+
1 row in set (0.01 sec)

mysql> select * from table10_falcon_int_autoinc WHERE int_key = 100;
Empty set (0.00 sec)

mysql> select * from table10_falcon_int_autoinc IGNORE KEY (int_key) WHERE int_key =
100;
+---------+-------+------+
| int_key | pk    | int  |
+---------+-------+------+
|     100 | 30823 |  100 |
+---------+-------+------+
1 row in set (0.00 sec)
[2 Apr 11:46] Philip Stoev
Note: the latest tablespace was produced using --falcon-use-deferred-index-hash=1
[2 Apr 19:39] Bugs System
Pushed into 6.0.11-alpha (revid:hky@sun.com-20090402144811-yc5kp8g0rjnhz7vy) (version
source revid:christopher.powers@sun.com-20090304040340-b4zoglfws0iswqm1) (merge vers:
6.0.11-alpha) (pib:6)
[6 Apr 14:55] Philip Stoev
A permanently inconsistent tablespace was just observed after the server has been left
idle for 10 minutes, then shut down cleanly and then restarted.
[6 Apr 15:31] Vladislav Vaintroub
To clarify you last comment, no recovery happened? What does error log look like (e.g
number processed serial log entries reported)?
[13 Apr 20:00] Kevin Lewis
Ann wrote; The problem appears to be in these lines of code:

    Transaction *sysTransaction = database->getSystemTransaction();
    table->dropIndex(name, sysTransaction);
-    database->commitSystemTransaction();

    Index::deleteIndex (database, schema, name);
    database->commitSystemTransaction(); 

 The first commit is what keeps the serial log from rotating after a delete index. The
SRLDeleteIndex record goes into the log with a transaction id, but no commit SRL entry is
ever included for that operation because its done by a system transaction that is
committed without having been marked as having done updates.  Without the first commit,
everything seems to work in terms of
rotating the log.
[13 Apr 20:05] Kevin Lewis
Sorry, the previous comment was for 39130.  This is still in verified state.
[29 Apr 1:07] Philip Stoev
A more stable location for the tablespaces:

bug42903-B2.zip">http://mysql-systemqa.s3.amazonaws.com/var-bug42903-B2.zip
[24 Jun 20:56] Kevin Lewis
Olav reports that pushbuild2 shows numerous failures in each push that are similar to this
bug, in which SELECT statements return different results whether they are using an index
or not.  It is usually seen after recovery, but that is the time when RQG does the
comparison, not during the run.  So in a team discussion, we think this is related to
index issues that Lars-Erik is working on.

This bug should be monitored in pushbuild 2 after the other repeatable index issues are
fixed.