Bug #40112 Two falcon servers unable to agree on the result from a simple SELECT
Submitted: 17 Oct 2008 17:26 Modified: 13 Dec 2008 11:03
Reporter: Philip Stoev
Status: Closed
Category:Server: Falcon Severity:S1 (Critical)
Version:6.0-falcon-team OS:Any
Assigned to: Kevin Lewis Target Version:6.0.9
Tags: F_INDEX
Triage: Triaged: D2 (Serious)

[17 Oct 2008 17:26] Philip Stoev
Description:
Two identical Falcon servers, when issued the same single-thread workload containing
SELECT/INSERT/UPDATE/DELETE in transactions, show disagreement in the result of a SELECT
query:

# 18:23:10 Query: SELECT `int_key` FROM B WHERE `int_nokey` > 125 failed: result content
mismatch between servers.
# 18:23:10 Query: SELECT `int_nokey` FROM B WHERE `int_key` > 108 failed: result length
mismatch between servers.

When given the same workload, Innodb and MyISAM show consistency between the two servers.

If all statements are issued as autocommit, no differences are reported with Falcon.

How to repeat:
A test case will be provided shortly.
[17 Oct 2008 17:29] Philip Stoev
Grammar file for bug 40112

Attachment: bug40112.yy (application/octet-stream, text), 1.18 KiB.

[17 Oct 2008 17:38] Philip Stoev
To reproduce using the Random Query Generator

$ perl runall.pl \
  --basedir=/path/to/6.0-falcon-team \
  --vardir1=/tmp/vardir1 \
  --vardir2=/tmp/vardir2 \
  --engine=Falcon \
  --grammar=bug40112.yy \
  --threads=1 \
  --queries=10000 \
  --validators=ResultsetComparator \
  --reporters=

This will start two identical servers and proceed to issue the same queries against both.
Any situations where the two servers return different result sets will be reported.
[22 Oct 2008 19:02] Hakan Kuecuekyilmaz
Philip,

are both connections are running the exact same sequence? Do you use the same seed for
random()?
[22 Oct 2008 19:17] Philip Stoev
Yes, both servers are running exactly the same queries. The stream of queries is generated
only once and then executed on both servers.
[2 Nov 2008 23:09] Kevin Lewis
I saved the SQL statements made by the Random Query Generator when it reproduced the
problem.  After running these SQL statements many times I found that the table can get
into a condition in which the index did not have all the records in the table accounted
for.  But later, everything would get corrected.

The following log contains debug messages added to diagnose this bug.  They show that
transaction 97 is committed, and each of its 6 DeferredIndexes are written to the serial
log.  But during writeComplete, when the gopher thread reads the log and puts them into
key pages, only 5 of the 6 are handled.

51: Commit Transaction 97
SRLUpdateIndex::append Transaction 97 Index 5
SRLUpdateIndex::append Transaction 97 Index 4
SRLUpdateIndex::append Transaction 97 Index 3
SRLUpdateIndex::append Transaction 97 Index 2
SRLUpdateIndex::append Transaction 97 Index 1
SRLUpdateIndex::append Transaction 97 Index 0
SRLUpdateIndex::execute Transaction 97 Index 5
SRLUpdateIndex::execute Transaction 97 Index 4
SRLUpdateIndex::execute Transaction 97 Index 3
SRLUpdateIndex::execute Transaction 97 Index 2
SRLUpdateIndex::execute Transaction 97 Index 0
51: WriteComplete Transaction 97

It turns out that SRLUpdateIndex::append did not actually put anything into the log for
Index 1 of this table.  The reason is that DeferredIndexWalker::next() returns NULL for
the first node and SRLUpdateIndex::append thinks the DeferredIndex is empty. 
DeferredIndexWalker::next() has a special mode for the first call.  But if there is
nothing in the first leaf page, this special mode returns NULL.  

DeferredIndexes are non-destructive during their life.  They do not loose pages, but the
do loose key values.  At the end of each verb within a longer transaction
Transaction::releaseSavepoint() calls an immediate record->scavenge() for record versions
which have been changed more than once in the same transaction.  These scavenged records
were never visible to any other transaction and are no longer needed.  Nether are the
index entries.  So they are dropped from the Deferredindexes.  If a leaf bucket becomes
empty, it is not deleted since the whole deferredindex has a short lifespan.  

DeferredIndexWalker::next() was not prepared for the first leaf bucket to be empty.  In
this case, on the first call, it would return NULL.  So then nothing would be found in the
deferredIndex.  

This happened a lot of times during a typical GENTEST run.  It was not always incorrect
because completed index pages contain the same record number with lots of different key
values.  The scavenger cannot clean them up fast enough.  A smaller transaction that does
not have an empty first leaf bucket can update all of the records previously inserted in a
transaction that had this problem.  So over time, GENTEST was able to correct for missing
index entries by updating the records again, based on a non-indexed WHERE clause.

Nevertheless, this is a serious and far-reaching bug. It affects indexes only by leaving
data out of them.  Since Falcon always compares against the record before returning,
Falcon would not return the wrong record, but it might not return all the matching
records.
[3 Nov 2008 1:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/57650

2897 Kevin Lewis	2008-11-02
      Bug#40112 - DeferredIndexWalker::next() was not prepared for the first 
      leaf bucket to be empty.  But it can become empty after several verbs 
      in the same transaction change the same records.  
      
      When the same record is changed more than once in the same transaction, 
      the older unneeded versions are deleted along with the index entries.  
      Even though DeferredIndex key nodes can be deleted, the buckets are not.  
      
      If the first leaf bucket is empty, DeferredIndexWalker::next() would 
      return NULL on the first call.  In SRLUpdateIndex::append(), this would
      result that nothing would be put into the Serial log at commit time for 
      this DeferredIndex.  Then this DefferedIndex would not get put into 
      index pages.
      
      The solution is to only return NULL on the first call to 
      DeferredIndexWalker::next() if the first leaf bucket is empty AND there 
      is no parent bucket.  Otherwise, use the normal mechanism to advance to 
      the next slot.
[20 Nov 2008 13:56] Bugs System
Pushed into 6.0.8-alpha  (revid:klewis@mysql.com-20081103003304-csz2vz6s191gqm32) (version
source revid:klewis@mysql.com-20081103003304-csz2vz6s191gqm32) (pib:5)
[13 Dec 2008 11:03] MC Brown
A note has been added to the 6.0.8 changelog: 

The indexes and record contents of a FALCON table could get out of synchronization during
a lrge number of updates. Because FALCON returns data only if it matches both the index
and record data the result sets returned could be invalid when comparing the results of an
index and non-index based SELECT.