Bug #40112 Two falcon servers unable to agree on the result from a simple SELECT
Submitted: 17 Oct 2008 15:26 Modified: 13 Dec 2008 10:03
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Kevin Lewis CPU Architecture:Any
Tags: F_INDEX

[17 Oct 2008 15:26] Philip Stoev
Description:
Two identical Falcon servers, when issued the same single-thread workload containing SELECT/INSERT/UPDATE/DELETE in transactions, show disagreement in the result of a SELECT query:

# 18:23:10 Query: SELECT `int_key` FROM B WHERE `int_nokey` > 125 failed: result content mismatch between servers.
# 18:23:10 Query: SELECT `int_nokey` FROM B WHERE `int_key` > 108 failed: result length mismatch between servers.

When given the same workload, Innodb and MyISAM show consistency between the two servers.

If all statements are issued as autocommit, no differences are reported with Falcon.

How to repeat:
A test case will be provided shortly.
[17 Oct 2008 15:29] Philip Stoev
Grammar file for bug 40112

Attachment: bug40112.yy (application/octet-stream, text), 1.18 KiB.

[17 Oct 2008 15:38] Philip Stoev
To reproduce using the Random Query Generator

$ perl runall.pl \
  --basedir=/path/to/6.0-falcon-team \
  --vardir1=/tmp/vardir1 \
  --vardir2=/tmp/vardir2 \
  --engine=Falcon \
  --grammar=bug40112.yy \
  --threads=1 \
  --queries=10000 \
  --validators=ResultsetComparator \
  --reporters=

This will start two identical servers and proceed to issue the same queries against both. Any situations where the two servers return different result sets will be reported.
[22 Oct 2008 17:02] Hakan Küçükyılmaz
Philip,

are both connections are running the exact same sequence? Do you use the same seed for random()?
[22 Oct 2008 17:17] Philip Stoev
Yes, both servers are running exactly the same queries. The stream of queries is generated only once and then executed on both servers.
[2 Nov 2008 22:09] Kevin Lewis
I saved the SQL statements made by the Random Query Generator when it reproduced the problem.  After running these SQL statements many times I found that the table can get into a condition in which the index did not have all the records in the table accounted for.  But later, everything would get corrected.

The following log contains debug messages added to diagnose this bug.  They show that transaction 97 is committed, and each of its 6 DeferredIndexes are written to the serial log.  But during writeComplete, when the gopher thread reads the log and puts them into key pages, only 5 of the 6 are handled.

51: Commit Transaction 97
SRLUpdateIndex::append Transaction 97 Index 5
SRLUpdateIndex::append Transaction 97 Index 4
SRLUpdateIndex::append Transaction 97 Index 3
SRLUpdateIndex::append Transaction 97 Index 2
SRLUpdateIndex::append Transaction 97 Index 1
SRLUpdateIndex::append Transaction 97 Index 0
SRLUpdateIndex::execute Transaction 97 Index 5
SRLUpdateIndex::execute Transaction 97 Index 4
SRLUpdateIndex::execute Transaction 97 Index 3
SRLUpdateIndex::execute Transaction 97 Index 2
SRLUpdateIndex::execute Transaction 97 Index 0
51: WriteComplete Transaction 97

It turns out that SRLUpdateIndex::append did not actually put anything into the log for Index 1 of this table.  The reason is that DeferredIndexWalker::next() returns NULL for the first node and SRLUpdateIndex::append thinks the DeferredIndex is empty.  DeferredIndexWalker::next() has a special mode for the first call.  But if there is nothing in the first leaf page, this special mode returns NULL.  

DeferredIndexes are non-destructive during their life.  They do not loose pages, but the do loose key values.  At the end of each verb within a longer transaction Transaction::releaseSavepoint() calls an immediate record->scavenge() for record versions which have been changed more than once in the same transaction.  These scavenged records were never visible to any other transaction and are no longer needed.  Nether are the index entries.  So they are dropped from the Deferredindexes.  If a leaf bucket becomes empty, it is not deleted since the whole deferredindex has a short lifespan.  

DeferredIndexWalker::next() was not prepared for the first leaf bucket to be empty.  In this case, on the first call, it would return NULL.  So then nothing would be found in the deferredIndex.  

This happened a lot of times during a typical GENTEST run.  It was not always incorrect because completed index pages contain the same record number with lots of different key values.  The scavenger cannot clean them up fast enough.  A smaller transaction that does not have an empty first leaf bucket can update all of the records previously inserted in a transaction that had this problem.  So over time, GENTEST was able to correct for missing index entries by updating the records again, based on a non-indexed WHERE clause.

Nevertheless, this is a serious and far-reaching bug. It affects indexes only by leaving data out of them.  Since Falcon always compares against the record before returning, Falcon would not return the wrong record, but it might not return all the matching records.
[3 Nov 2008 0:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/57650

2897 Kevin Lewis	2008-11-02
      Bug#40112 - DeferredIndexWalker::next() was not prepared for the first 
      leaf bucket to be empty.  But it can become empty after several verbs 
      in the same transaction change the same records.  
      
      When the same record is changed more than once in the same transaction, 
      the older unneeded versions are deleted along with the index entries.  
      Even though DeferredIndex key nodes can be deleted, the buckets are not.  
      
      If the first leaf bucket is empty, DeferredIndexWalker::next() would 
      return NULL on the first call.  In SRLUpdateIndex::append(), this would
      result that nothing would be put into the Serial log at commit time for 
      this DeferredIndex.  Then this DefferedIndex would not get put into 
      index pages.
      
      The solution is to only return NULL on the first call to 
      DeferredIndexWalker::next() if the first leaf bucket is empty AND there 
      is no parent bucket.  Otherwise, use the normal mechanism to advance to 
      the next slot.
[20 Nov 2008 12:56] Bugs System
Pushed into 6.0.8-alpha  (revid:klewis@mysql.com-20081103003304-csz2vz6s191gqm32) (version source revid:klewis@mysql.com-20081103003304-csz2vz6s191gqm32) (pib:5)
[13 Dec 2008 10:03] MC Brown
A note has been added to the 6.0.8 changelog: 

The indexes and record contents of a FALCON table could get out of synchronization during a lrge number of updates. Because FALCON returns data only if it matches both the index and record data the result sets returned could be invalid when comparing the results of an index and non-index based SELECT.