MySQL Bugs: #41742: Falcon REPEATABLE READ does not return repeatable results

Bug #41742	Falcon REPEATABLE READ does not return repeatable results
Submitted:	25 Dec 2008 9:39	Modified:	15 May 2009 13:39
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S3 (Non-critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	F_ISOLATION

Description:
When executing a transactional workload in REPEATABLE READ mode (--falcon-co nsistent-read=1) in Falcon, equivalent SELECT statements within the same transaction will return different data based on the access method that is used.

For example, the query (full table scan):
 
SELECT `pk` FROM `table10_falcon_int_autoinc` WHERE 1 = 1

and the query (index scan):

SELECT `pk` FROM `table10_falcon_int_autoinc` WHERE 1 = 1 AND `pk` > -16777216

will return different results even though the queries are equvalent for a table with positive PK values.

If the query is not mutated (e.g. same access plan is used twice), the same result is returned.

This bug is different from bug 41478 because bug 41478 uses exactly the same query over and over again to produce different results. This bug requires that different access plans are triggered to produce the inconsistent results.

How to repeat:
Freshly branch the mysql-test-extra-6.0 tree and then run:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team/ \
  --engine=Falcon \
  --mysqld=--loose-falcon-consistent-read=1 \
  --gendata=conf/transactions.zz \
  --grammar=conf/repeatable_read.yy \
  --validator=RepeatableRead \
  --queries=50000 \
  --threads=4

The RepeatableRead validator will take each SELECT query from the workload and add additional predicates in order to change the execution plan. The results from the original and the modified queries will be compared and any differences will cause the test to fail.

Please use --threads=3 if the bug does not manifest itself. About a minute of runtime would be sufficient.

If the test reports that a query involving LIMIT is at fault, you may wish to edit lib/Validator/RepeatableRead.pm and comment out the LIMIT lines in the @predicates array on line 19. This will remove any issues with LIMIT from interfering with this bug.

In initial testing, I found that there is a conflict between Falcon's LIMIT support and range queries that use sets. 

mysql> SELECT * FROM `table10_falcon_int_autoinc`;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       9 | 37 |    9 |
|       2 | 40 |    7 |
|       9 | 41 |    7 |
|       6 | 45 |    5 |
|       8 | 47 |    6 |
|       5 | 49 |    5 |
|       7 | 51 |    1 |
|       4 | 52 |    3 |
|       2 | 53 |    2 |
|       7 | 54 |    5 |
|       8 | 55 |    7 |
|       7 | 56 |    5 |
|       0 | 57 |    0 |
|       6 | 58 |    5 |
|       4 | 59 |    5 |
|       9 | 60 |    7 |
|       5 | 61 |    3 |
|       4 | 62 |    9 |
|       6 | 63 |    1 |
+---------+----+------+
19 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` IN ( 0 , 6 , 3 ) LIMIT 10;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       6 | 45 |    5 |
|       0 | 57 |    0 |
|       6 | 58 |    5 |
|       6 | 63 |    1 |
+---------+----+------+
4 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` IN ( 0 , 6 , 3 ) ORDER BY pk LIMIT 10;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       6 | 45 |    5 |
|       0 | 57 |    0 |
|       6 | 58 |    5 |
|       6 | 63 |    1 |
+---------+----+------+
4 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` IN ( 0 , 6 , 3 ) ORDER BY int_key LIMIT 10;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       6 | 45 |    5 |
|       6 | 58 |    5 |
|       6 | 63 |    1 |
+---------+----+------+
3 rows in set (0.00 sec)

### This query is different in that it is ordered by the same indexed field used in the LIMIT

This problem is really a duplicate of Bug#42405.  In that case, the filter used '<>'instead of IN.  That causes the same problem.  The optimizer is using multiple ranges in a single LIMIT.

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` <> 6;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       9 | 37 |    9 |
|       2 | 40 |    7 |
|       9 | 41 |    7 |
|       8 | 47 |    6 |
|       5 | 49 |    5 |
|       7 | 51 |    1 |
|       4 | 52 |    3 |
|       2 | 53 |    2 |
|       7 | 54 |    5 |
|       8 | 55 |    7 |
|       7 | 56 |    5 |
|       0 | 57 |    0 |
|       4 | 59 |    5 |
|       9 | 60 |    7 |
|       5 | 61 |    3 |
|       4 | 62 |    9 |
+---------+----+------+
16 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` <> 6 ORDER BY int_key LIMIT 20;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       7 | 51 |    1 |
|       7 | 54 |    5 |
|       7 | 56 |    5 |
|       8 | 47 |    6 |
|       8 | 55 |    7 |
|       9 | 37 |    9 |
|       9 | 41 |    7 |
|       9 | 60 |    7 |
+---------+----+------+
8 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `int_key` <> 6 ORDER BY pk LIMIT 20;
+---------+----+------+
| int_key | pk | int  |
+---------+----+------+
|       9 | 37 |    9 |
|       2 | 40 |    7 |
|       9 | 41 |    7 |
|       8 | 47 |    6 |
|       5 | 49 |    5 |
|       7 | 51 |    1 |
|       4 | 52 |    3 |
|       2 | 53 |    2 |
|       7 | 54 |    5 |
|       8 | 55 |    7 |
|       7 | 56 |    5 |
|       0 | 57 |    0 |
|       4 | 59 |    5 |
|       9 | 60 |    7 |
|       5 | 61 |    3 |
|       4 | 62 |    9 |
+---------+----+------+
16 rows in set (0.00 sec)

Kavin, the bug is still there. To reproduce, please edit your RepeatableRead.pm to be like this:

my @predicates = (
        'AND `pk` > -16777216'
);

And then run:

$ runall.pl \
 --basedir=/build/bzr/6.0-falcon-team \
 --mysqld=--loose-innodb-lock-wait-timeout=1 \
 --mysqld=--table-lock-wait-timeout=1 \
 --mysqld=--loose-falcon-lock-wait-timeout=1 \
 --mysqld=--loose-falcon-debug-mask=2 \
 --mysqld=--skip-safemalloc \
 --grammar=conf/repeatable_read.yy \
 --gendata=conf/transactions.zz \
 --mysqld=--falcon-consistent-read=1 \
 --mysqld=--transaction-isolation=REPEATABLE-READ \
 --validator=RepeatableRead \
 --mysqld=--falcon-consistent-read=1 \
 --threads=4 \
 --reporters=Deadlock,ErrorLog,Backtrace,Recovery,WinPackage \
 --duration=1200 \
 --vardir=/tmp/vardir \
 --mysqld=--log-output=file \
 --queries=100000 \
 --engine=falcon

a couple of time and you will see;

# 22:12:06 Query: SELECT `int` FROM `table10_falcon_int_autoinc` WHERE 1 = 1 returns different result when executed with additional predicate 'AND `pk` > -16777216' (30 vs. 37 rows).
# 22:12:07 Query: SELECT `pk` FROM `table10_falcon_int_autoinc` WHERE `int_key` BETWEEN 3 and 8 returns different result when executed with additional predicate 'AND `pk` > -16777216' (6 vs. 6 rows).

No LIMIT is involved. An assertion in the new Scavenger may get in the way, however this is a separate issue.

Gosh, Philip, I just cannot repeat this.  I ran the RQG you specified for 1 hour yesterday and the test finished successfully.  During the test, I issued this query over and over again.  Each time, all 5 columns returned the same count.

select 
(SELECT count(*) FROM table10_falcon_int_autoinc WHERE 1 = AS 'One', 
(SELECT count(*) FROM table10_falcon_int_autoinc WHERE 1 = 1 AND pk > -1) AS 'Two', 
(SELECT count(*) FROM table10_falcon_int_autoinc WHERE 'int' > -1 LIMIT 2000) AS 'Three', 
(SELECT count(*) FROM table10_falcon_int_autoinc WHERE int_key > -1 AND pk > -1  LIMIT 2000) AS 'Four', 
(SELECT count(*) FROM table10_falcon_int_autoinc WHERE 1 = 1 AND pk > -1  LIMIT 2000) AS 'Five';

So if you can still repeat this, maybe I should assign it to another engineer to try to debug.

It turns out only INSERTs are required to reproduce the bug. Please proceed as follows:

1. Edit RepeatableRead.pm as follows:

my @predicates = (
        'AND `pk` > -16777216'
};

2. Run:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team \
  --mysqld=--loose-falcon-lock-wait-timeout=1 \
  --mysqld=--loose-falcon-debug-mask=2 \
  --mysqld=--skip-safemalloc \
  --grammar=conf/repeatable_read.yy \
  --gendata=conf/transactions.zz \
  --mysqld=--falcon-consistent-read=1 \
  --mysqld=--transaction-isolation=REPEATABLE-READ \
  --validator=RepeatableRead \
  --threads=10 \
  --duration=1200 \
  --mysqld=--log-output=file \
  --queries=100000 \
  --engine=Falcon \
  --mem \
  --queries=10000000 \
  --mask=18

The --mask parameter will cause only INSERTs and SELECTs to be used.

During the run of this script, I was able to create a table that reported more records from a index retrieval than from a sequential scan;

perl runall.pl --threads=4 --queries=100000 --duration=1200 --basedir=C:\Work\bzr\Chg-09\mysql-6.0-falcon-team --engine=Falcon --grammar=conf/repeatable_read.yy --gendata=conf/transactions.zz --validator=RepeatableRead --reporters=ErrorLog,Backtrace,Recovery,WinPackage

# Return all record from a sequential scan (15) found
mysql> SELECT * FROM `table10_falcon_int_autoinc` order by `int`;
+---------+------+------+
| int_key | pk   | int  |
+---------+------+------+
|       6 |    9 |    0 |
|       6 | 3072 |    0 |
|       5 | 3115 |    0 |
|       0 | 3080 |    0 |
|       6 | 3078 |    1 |
|       1 | 3127 |    2 |
|       8 | 3126 |    3 |
|       6 | 3077 |    3 |
|       7 | 3112 |    4 |
|       5 | 3117 |    4 |
|       6 | 3113 |    6 |
|       6 | 3133 |    6 |
|       8 | 3109 |    7 |
|       5 | 3116 |    9 |
|       5 | 3125 |    9 |
+---------+------+------+
15 rows in set (0.09 sec)

# Return all record from an index scan (16) found
mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE `INT_KEY` > -1 order by `int`;
+---------+------+------+
| int_key | pk   | int  |
+---------+------+------+
|       6 |    9 |    0 |
|       5 | 3115 |    0 |
|       0 | 3080 |    0 |
|       6 | 3072 |    0 |
|       6 | 3078 |    1 |
|       7 | 3123 |    1 |
|       1 | 3127 |    2 |
|       6 | 3077 |    3 |
|       8 | 3126 |    3 |
|       7 | 3112 |    4 |
|       5 | 3117 |    4 |
|       6 | 3133 |    6 |
|       6 | 3113 |    6 |
|       8 | 3109 |    7 |
|       5 | 3125 |    9 |
|       5 | 3116 |    9 |
+---------+------+------+
16 rows in set (0.00 sec)

# The missing record, int_key=7, pk=3123, int=1, is never found
# unless an index is used.
mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE  `int_key` = 7;
+---------+------+------+
| int_key | pk   | int  |
+---------+------+------+
|       7 | 3112 |    4 |
|       7 | 3123 |    1 |
+---------+------+------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE  `pk` = 3123;
+---------+------+------+
| int_key | pk   | int  |
+---------+------+------+
|       7 | 3123 |    1 |
+---------+------+------+
1 row in set (0.00 sec)

mysql> SELECT * FROM `table10_falcon_int_autoinc` WHERE  `int` = 1;
+---------+------+------+
| int_key | pk   | int  |
+---------+------+------+
|       6 | 3078 |    1 |
+---------+------+------+
1 row in set (0.02 sec)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67593

3043 Kevin Lewis	2009-02-25
      Bug #41742  	Falcon REPEATABLE READ does not return repeatable results
      Suspecting that the recordBitmap gets out of sync with the Record tree, 
      I rearranged Table::insertIntoTree so that;
        1) It gets and exclusive or shared lock up front.
        2) When deleting a record from the tree, it clears recordBitmap first because 
           other threads doing fetchNext() will check the recordBitmap first. 
           We don't want them to find a set bit and then not find the record in the tree.
        3) When inserting a record into the tree, it now sets the recordBitmap last,
           after assuring that the store() succeeded.
        4) If the store of a NULL failed, the recordBitmap is now re-set.  
           This may be the real cause of the inconsistency.

> Checking assumptions here. Is is always true that:
> 
> 1) If 'record' == NULL that a record exists in the tree?
> 2) That the bitmap was set to begin with?
>     
>> +    else 
>> +        recordBitmap->set(recordNumber);  // Reset the bit
>> +
>>      return false;
>>  }

Chis,  

These are very good questions.  You got me thinking and looking in the right direction.

It is important to know that the only user of the recordBitmap is Table::fetchNext().  This is a sequential scan.  It is not using an index, but it might have a LIMIT.  So the highWater mark for the bitmap (below which the bitmap is assumed to be reliable) may be half way through the file.  InsertIntoTree does not concern itself with the highwater mark.  It tries to keep the bitmap updated for all records.  So it insertIntoTree came in for a record above the highwater mark, it may be unset before. And if the store() happens to fail, this code will set it for the first time.  All that is OK above the highwater mark, since fetchNext will move that up carefully, setting the bits based on what is really in the tree.

But below the highWater mark is another story.  Let's look at the places where a NULL is put into the record tree;

1) In the catch for a failed
    void Table::insert(Transaction *transaction, int count, Field **fieldVector, Value **values)

2) In the catch for a failed
    uint Table::insert(Transaction *transaction, Stream *stream)

3) In int32 Table::backlogRecord(RecordVersion* record)  where it allows 
Table::insertIntoTree to unset the bit before it immediately resets it.

4) In RecordLeaf::retireRecords() there is a very subtle {*ptr = NULL;} if the call to Record::retire() was successful.

This last one is most disturbing.  If a record is old and get's retired, the slot in the record tree will disappear, but there is no attempt at all to update recordBitmap.  Looking at the consumer of recordBitmap, Table::fetchNext(), it seems that if a bit below the highWater mark is set when it should not be, the code will just skip it and keep looking.
Table::fetchNext does not try to un-set the bit.  So it keeps trying these bits over and over again.  I saw this yesterday on my tests using the new RECORD_HISTORY debugging code. (To be pushed soon).

But if the a bit below the highWater mark is NOT set when it SHOULD be, Table::fetchNext() will miss it.  This can result in queries that do not return as much data as it should.  The previous code in InsertIntotree both set or unset the recordBitmap first, and then called store().  Now, at least, it tries to re-set the bit if it was cleared.  But there can be a gap where a at least a backlogged record is not seen.

Based on this observation, I think I need to adjust the latest patch for this bug.  I no longer think that Table::insertIntoTree should clear ANY bits in recordBitmap.  Even if the bit to be cleared is above the highWater mark, insertIntoTree is not the code that changes that mark.  So it could be above it one moment and below it the next.  

Table::insertIntoTree should continue to set bits in recordBitmap, just in case the bit is below the 'trusted' highWater mark.

This change will mean that the highWater mark will be lowered dramatically whenever table::fetchNext() fails to find a record in the recordLocator page with databaseFetch().  

The reason the recordBitmap is used is because it is faster than finding the next record number from the recordLocater pages.  Even though using dbb->findNextRecord() is slower than the bitmap, a bunch of calls to fetch for retired or deleted records is also a performance slowdown.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67628

3038 Kevin Lewis	2009-02-25
      Bug#41742 - Since Table::fetchNext seems to do OK if a bit below the higWater mark
      is set incorrectly, but does not do OK if a bit below highWater is cleared incorrectly, 
      I am changing Table::insertIntoTree so that it sets bits but does not clear them.
      Bits are already not cleared by RecordLeaf::retireRecords.
      
      So it will be the responsibility of Table::fetchRecords to clear bits below the highWater 
      mark if the record is not found in either the record tree or the database.  This is done 
      under an exclusive lock of Table::syncObject. After the lock is made and before we clear 
      that bit, another fetch is attempted just in case another thread just added that record.
      
      Since Table::insertIntoTree no longer clears bits in recordBitmap, we no longer need to 
      re-set the bit in Table::backlogRecord().

Error still occurs periodically.  And the difference goes both ways.

Query: SELECT `int` FROM `table10_falcon_int_autoinc` WHERE 1 = 1 returns different result when executed with additional predicate 'AND `pk` > -16777216' (5 vs. 11 rows).

Query: SELECT `pk` FROM `table10_falcon_int_autoinc` WHERE 1 = 1 returns different result when executed with additional predicate 'AND `pk` > -16777216' (17 vs. 9 rows).

I was able to debug this further by tagging individual queries with unique identifiers. After numerous test runs, the following picture emerges (autocommit=OFF)

1. Thread #1 inserts a new record or updates an old record to a new value, without commiting

2. Thread #2 issues a full-table-scan SELECT that sees the new record or value, even though it has not been committed yet.

3. Thread #2 issues a index-scan SELECT that no longer sees the new record or value.

The inconsistency is only temporary, however is sometimes observed by the SELECTS from two threads.

Step #2 is at fault here. At that time, the phantom record created by thread #1 has not been committed yet, so it must be in some in-memory data structure (deffered index?) that becomes temporarily visible to thread #2.

A simplifed test case with additional instrumentation will be uploaded shortly.

YY file for bug 41742

Attachment: bug41742.yy (application/octet-stream, text), 590 bytes.

ZZ file for bug 41742

Attachment: bug41742.zz (text/plain), 203 bytes.

To reproduce with the minimum number of columns and the simplest queries:

pull freshly from the test-extra-6.0 tree, take the files attached to this bug and then execute:

$ perl runall.pl \
  --basedir=/build/bzr/6.0-falcon-team \
  --mysqld=--loose-innodb-lock-wait-timeout=5 \
  --mysqld=--table-lock-wait-timeout=5 \
  --mysqld=--loose-falcon-lock-wait-timeout=5 \
  --mysqld=--loose-falcon-debug-mask=2 \
  --mysqld=--skip-safemalloc \
  --grammar=conf/bug41742.yy \
  --gendata=conf/bug41742.zz \
  --mysqld=--falcon-consistent-read=1 \
  --mysqld=--transaction-isolation=REPEATABLE-READ \
  --validator=RepeatableRead \
  --mysqld=--falcon-consistent-read=1 \
  --duration=1200 \
  --mysqld=--log-output=file \
  --queries=100000 \
  --engine=falcon \
  --threads=10 \
  --mysqld=--falcon-record-memory-max=1Gb \
  --mysqld=--falcon-page-cache-size=1Gb

This should fail shortly after takeoff and produce a diff of the two SELECTs that returned different results. Please note the following:

* no LIMIT is used at all;

* Each row is tagged with a unique ID. By grepping for this ID in the query log, you can observe which queries acted on this row;

* Each SELECT is also tagged with a unique ID. If you grep this ID in the query log, you will see the locations of the two SELECTS that are being compared. You will observe that the rows wrongly returned by the first select have not been committed yet;

* Even if the two consequtive SELECTS are entirely identical, they still return different results. Therefore, this bug is not dependent on the table access method;

Philip reported that this problem cannot be repeated after the changes to create a
TransactionState object and the addition of the CycleManager to protect doomed records until after all temporary stack pointers to them have gone away.

Pushed into 6.0.11-alpha (revid:hky@sun.com-20090402144811-yc5kp8g0rjnhz7vy) (version source revid:kevin.lewis@sun.com-20090226055724-zq1qqzgpervfdmk4) (merge vers: 6.0.11-alpha) (pib:6)

A note have been added to the 6.0.11 changelog: 

The Falcon CycleManager has been updated, which addresses a number of issues when examining records in various transaction states and their visisbility/isolation in relation to other threads.