MySQL Bugs: #44946: Falcon asserts on vector having wrong size in Record::getEncodedValue()

Bug #44946	Falcon asserts on vector having wrong size in Record::getEncodedValue()
Submitted:	19 May 2009 6:31	Modified:	26 May 2010 17:47
Reporter:	Olav Sandstå	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S2 (Serious)
Version:	6.0.12-alpha	OS:	Linux
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	F_RECORD TREE

Description:
When running the RQG test falcon_backlog we sometimes hits the following assert in Record::getEncodedValue():

  ASSERT(vector[index] < size);

The call stack from the core file looks like:

#8  0x08569c50 in Error::error (
    string=0x89f0f90 "assertion (%s) failed at line %d in file %s\n")
    at Error.cpp:71
#9  0x08569cf5 in Error::assertionFailed (
    text=0x89f1652 "vector[index] < size", fileName=0x89f155c "Record.cpp", 
    line=682) at Error.cpp:78
#10 0x0856f254 in Record::getEncodedValue (this=0x9c4fb9c8, 
    recordData=0xa6141178 "4", fieldId=8, value=0x9eac3b48) at Record.cpp:682
#11 0x08570528 in Record::getRawValue (this=0x9c4fb9c8, fieldId=8, 
    value=0x9eac3b48) at Record.cpp:377
#12 0x08570937 in Record::getValue (this=0x9c4fb9c8, fieldId=8, 
    value=0x9eac3b48) at Record.cpp:330
#13 0x0857c9d9 in StorageTable::compareKey (this=0x9b57ba00, 
    key=0x9ac04e50 "½ç", keyLength=4) at StorageTable.cpp:496
#14 0x0855d3f8 in StorageInterface::index_next (this=0xb93f9e8, 
    buf=0xb93fdd0 "ÿÿ") at ha_falcon.cpp:1874
#15 0x0855c180 in StorageInterface::read_range_first (this=0xb93f9e8, 
    start_key=0xb93fa88, end_key=0xb93fa98, eq_range_arg=true, sorted=true)
    at ha_falcon.cpp:1757
#16 0x084060cd in handler::multi_range_read_next (this=0xb93f9e8, 
    range_info=0x9eac3d00) at handler.cc:4351
#17 0x0855cfe0 in StorageInterface::multi_range_read_next (this=0xb93f9e8, 
    rangeInfo=0x9eac3d00) at ha_falcon.cpp:1985
#18 0x083e7baf in QUICK_RANGE_SELECT::get_next (this=0x9ac03648)
    at opt_range.cc:8579
#19 0x083ff895 in rr_quick (info=0x9eac3ed8) at records.cc:322
#20 0x0838ffe1 in mysql_update (thd=0xb90d210, table_list=0xb91eee8, 
    fields=@0xb90e5f4, values=@0xb90e828, conds=0xb91f788, order_num=0, 
    order=0x0, limit=18446744073709551615, handle_duplicates=DUP_ERROR, 
    ignore=false) at sql_update.cc:584
#21 0x082d5ef5 in mysql_execute_command (thd=0xb90d210) at sql_parse.cc:3127
#22 0x082dc117 in mysql_parse (thd=0xb90d210, 
    inBuf=0xb91eb90 "UPDATE `table65536_falcon_int_autoinc` AS X SET `varchar_255_latin1_not_null` = 'rvrvxtxtacacececkxkxojojsxsxjljlbvbvlvlvhmhmlolonsnsadadreressssdodohrhrceceqkqkyyyygcgcngngskskplplytytuquqsisihmhmkdk"..., length=293, 
    found_semicolon=0x9eac4e80) at sql_parse.cc:5964
#23 0x082dd299 in dispatch_command (command=COM_QUERY, thd=0xb90d210, 
    packet=0xb912cd1 "UPDATE `table65536_falcon_int_autoinc` AS X SET `varchar_255_latin1_not_null` = 'rvrvxtxtacacececkxkxojojsxsxjljlbvbvlvlvhmhmlolonsnsadadreressssdodohrhrceceqkqkyyyygcgcngngskskplplytytuquqsisihmhmkdk"..., 
    packet_length=293) at sql_parse.cc:1049
#24 0x082de52a in do_command (thd=0xb90d210) at sql_parse.cc:731
#25 0x082caecf in handle_one_connection (arg=0xb90d210) at sql_connect.cc:1146
#26 0x0089a45b in start_thread () from /lib/libpthread.so.0
#27 0x007f1c4e in clone () from /lib/libc.so.6

How to repeat:
Seen sometimes after running falcon_backlog.

See also Bug#42131. The call stack is the same for both these bugs but it is believed to be a different issue since bug#42131 has been fixed and the code where the new crash occurs is different from earlier.

Kevin sent the following excellent description by email:
========================================================

The assert says that either Record::size is too small or the vector array is messed up, causing vector[index] to be too big.  The previous fix was focused on making sure the record itself was thawed before this assert.  But the current code guarantees that!  It now thaws the record buffer into data.record, but sends the pointer to that data.record buffer back to Record::getRawValue() which sends that pointer into Record::getEncodedValue().  It does not matter if the transaction immediately chilled the record again.  The vector in the assert is not using data.record.  And the CycleManager keeps the buffer around.  So the record buffer must be good. 
Maybe the vector of offsets is being filled in by multiple reading threads at the same time.  After a thaw, which is serialized by syncThaw, it seems like both threads could call Record::getEncodedValue() at the same time.  If two threads executed the following loop at the same time, some vector offsets may get skipped.

void Record::getEncodedValue(char* recordData, int fieldId, Value *value)
{
. . .
    while (highWater < index)
        {
        const UCHAR *q = EncodedDataStream::skip(p);
        vector[++highWater] = (USHORT) (q - (UCHAR*) recordData);
        p = q;
        }

I implemented a change that allows multiple threads to update the field offset vector simultaneously, each using a copy of the highWater value.  Then they use CAS to update the highWater index count if it is higher than what is currently there.  That caused RecordVersion::highWater to be expanded from a short to an INTERLOCK_TYPE.  

The reason this can work is that all threads updating the vector are using the same recordData, and so the vector offsets are all the same.  So multiple threads MAY be trying to write the same 2-byte offsets to the same cache lines.  The question is whether this happens enough that cache line bouncing is more of a performance hit that the cost of serializing this code through a critical section or a syncObject protecting the vector.  We still do not want to use a unique SyncObject per record, but it is possible to use an exclusive lock on one of the syncThaw objects for the table.

After implementing the previously mentioned code change for using CAS to update Record::highWater, I hit a crash at the same assert as this bug. I was using RQG combinations with a very low chill threshold. The problem that I found was that the chilled record is not thawable. Since it cannot be thawed, any garbage collection will fail.

Here is the record stack. Each is a version of record number 26.
POSITION TRANSID COMMITID SAVEPOINT SUPERCEEDED VIRTUAL_OFFSET THAWED
base 964 967 0 false 17922085 deleted
1 952 962 0 false 16876114 yes
2 921 924 14 false 0 yes
3 921 924 0 true 12582986 no
4 901 918 0 false 0 yes
5 832 855 0 false 0 yes
6 800 807 2 false 6088067 yes

The call stack indicated that record version 3,4,5 & 6 are being pruned by the scavenger. It has determined that 2 is the oldest visible version and correctly started pruning record 3. Notice that 2 and 3 are both on the same transaction. The scavenger disconnects the prior link between 2 and 3. Index::garbageCollect is called with the 'leaving' record as 3 and the 'staying' record is the base record.

The problem is that record version 3 CANNOT be thawed. The transaction is fully completed. The Transaction object associated with 921 is no longer around, only the TransactionState. So Transaction::thaw() cannot be called. RecordVersion::thaw() normally reads the active record from the page in this case. But that would be a mistake if it succeeded, since record version 3 would now have the current record attached to it. Index::garbageCollect() would leave orphan key nodes in the index.

But in this case, the base record is a deleted record so nothing can be thawed by going to the page. Since this leaving record is not thawed, garbageCollect fails in Index::makeKey-> Record::getValue-> Record::getRawValue-> Record::getEncodedValue() at the same assert as this bug. It is trying to access a recordData buffer == (-1).

There are two ways to fix this problem. Either we (a) make sure there is only one version of each record number associated with every committed transaction, or (b) make sure every old record version is thawed before the transaction is completed.

a) Transaction::commit calls Transaction::releaseSavepoints which does not call Transaction::releaseSavepoint for each savepoint. Instead of collapsing the record version chain down to one record version per transaction, it leaves them all linked up at the savepoint they were created at, and just gets rid of the savepoint objects. This can be changed to just call Transaction::releaseSavepoint() for each lower savepoint.

b) In transaction::fullyCommitted(), before calling writeComplete(), I could add a new routine that thaws each record in the transaction, if needed. This is done by the gopher thread so it is probably a better location than Transaction::commit().

I think (b) is the best choice since it keeps statement latency low by not thawing some stuff at commit time. Letting the gopher threads do the extra required thawing is the best choice.

This problem was caused by an unthawable record. We have had this discussion before, but I believe that Falcon should prevent any case of an unthawable record. So I will aslo add new ASSERTs into RecordVersion::hasRecord() and RecordVersion::getRecordData() so that they no longer fail to thaw anything that needs to be thawed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/74535

2708 Kevin Lewis 2009-05-19
Bug#44946 - ASSERT(vector[index] < size); is hit.
--------------------------------
Record.cpp & Record.h

Protect Record::highWater from being incremented in lock-step with another thread also running Record::getEncodedValue(). Do this by using a local copy of highWater and updating it by CAS only if it is higher than what is there. We assume that multiple threads can update the vector at the same time since they are all writing the same values to the vector. If multiple threads thawing the same record rarely ever happens, there will not be too much performance cost due to cache line bounces. And this patch allows it safely. In order to use CAS, Record::highWater must change from a short to an INTERLOCK_TYPE.

Change the assert to a FATAL message if (vector[index] > size - getSize()). The size includes the bytes used by the Record or RecordVersion object.

In Record::setEncodedRecord(), 'size' only needs to be adjusted for non-interlocked calls. The interlocked calls are done during a thaw. When a record is chilled, the size variable is left the same.
--------------------------------
RecordVersion.cpp & SRLUpdateRecords.cpp

RecordVersion::thaw() should also check the virtualOffset before calling Transaction::thaw(). And if the Transaction cannot thaw, the virtualOffset should be immediately set to zero.

Add ASSERTs to RecordVersion::hasRecord() and RecordVersion::getRecordData() to make sure that there are no more 'unthawable' records.
--------------------------------
Transaction.cpp & Transaction.h

In order to prevent 'unthawable' records, make sure all records attached to a transaction are thawed by the gopher thread when the transaction is fully complete. Do this by adding a ne function; Transaction::thawAll()

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/74629

2709 Kevin Lewis	2009-05-20
      Bug#44946 Code review changes;
      It is not necessary to us CAS to update the Record::highWater.  Explained in comments.
      Added other comments and compiler warning fixes.