MySQL Bugs: #43344: falcon_recovery assert "dup->state == recDeleted" in Table::checkUniqueRecordVer

Bug #43344	falcon_recovery assert "dup->state == recDeleted" in Table::checkUniqueRecordVer
Submitted:	3 Mar 2009 22:14	Modified:	26 May 2010 17:47
Reporter:	Olav Sandstå	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S2 (Serious)
Version:	6.0.10-alpa	OS:	Any
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	F_RECORD TREE

Description:
The RQG test falcon_recovery failed with the following assert in Table::checkUniqueRecordVersion:

// The record has been deleted.
ASSERT(dup->state == recDeleted);

Call stack:

#7  0x08555838 in Error::debugBreak () at Error.cpp:94
#8  0x085558b4 in Error::error (
   string=0x89dac38 "assertion (%s) failed at line %d in file %s\n")
   at Error.cpp:71
#9  0x08555959 in Error::assertionFailed (
   text=0x89ddc70 "dup->state == recDeleted", fileName=0x89dcc8d "Table.cpp",    line=2574) at Error.cpp:78
#10 0x085777e1 in Table::checkUniqueRecordVersion (this=0xb73c5d68,    recordNumber=727, index=0x95f24918, transaction=0x8b727328,    record=0x885fa0b0, syncUnique=0x976669f8) at Table.cpp:2574
#11 0x08577c99 in Table::checkUniqueIndex (this=0xb73c5d68, index=0x95f24918,    transaction=0x8b727328, record=0x885fa0b0, sync=0x976669f8)
   at Table.cpp:2513
#12 0x08577e26 in Table::insertIndexes (this=0xb73c5d68,    transaction=0x8b727328, record=0x885fa0b0) at Table.cpp:1289
#13 0x08578027 in Table::insert (this=0xb73c5d68, transaction=0x8b727328,    stream=0x8c67b3a4) at Table.cpp:3061
#14 0x0855ee09 in StorageDatabase::insert (this=0xb7068158,    connection=0xb709eb98, table=0xb73c5d68, stream=0x8c67b3a4)
   at StorageDatabase.cpp:266
#15 0x08564fc0 in StorageTable::insert (this=0x8c675e78)
   at StorageTable.cpp:109
#16 0x08551b37 in StorageInterface::write_row (this=0x96a90430,    buff=0x96a90818 "�\a") at ha_falcon.cpp:1176
#17 0x083fee7b in handler::ha_write_row (this=0x96a90430, buf=0x96a90818 "�\a")
   at handler.cc:5500
#18 0x0836ed2c in write_record (thd=0xa5f44e0, table=0x96a8fd78,    info=0xa606b58) at sql_insert.cc:1610
#19 0x0836f0f9 in select_insert::send_data (this=0xa606b38, values=@0xa5f58a8)
   at sql_insert.cc:3167
#20 0x08337ddd in end_send (join=0x96a27bb0, join_tab=0xa607070,    end_of_records=false) at sql_select.cc:17218
#21 0x0834101a in evaluate_join_record (join=0x96a27bb0, join_tab=0xa606ec0,    error=0) at sql_select.cc:16490
#22 0x083413fa in sub_select (join=0x96a27bb0, join_tab=0xa606ec0,    end_of_records=false) at sql_select.cc:16257
#23 0x0834d89b in do_select (join=0x96a27bb0, fields=0xa5f58a8, table=0x0,    procedure=0x0) at sql_select.cc:15786
#24 0x083682ca in JOIN::exec (this=0x96a27bb0) at sql_select.cc:2881
#25 0x08362e73 in mysql_select (thd=0xa5f44e0, rref_pointer_array=0xa5f5918,    tables=0xa606690, wild_num=0, fields=@0xa5f58a8, conds=0x0, og_num=0,    order=0x0, group=0x0, having=0x0, proc_param=0x0,    select_options=3490466304, result=0xa606b38, unit=0xa5f557c,    select_lex=0xa5f5814) at sql_select.cc:3062
#26 0x083685da in handle_select (thd=0xa5f44e0, lex=0xa5f5520,    result=0xa606b38, setup_tables_done_option=1073741824) at sql_select.cc:314
#27 0x082d14a1 in mysql_execute_command (thd=0xa5f44e0) at sql_parse.cc:3179
#28 0x082d6b27 in mysql_parse (thd=0xa5f44e0,    inBuf=0xa605e30 "INSERT INTO `table10_falcon_int_autoinc` ( `int` ) SELECT `int` FROM `table100_falcon_int_autoinc` LIMIT 108", length=108,    found_semicolon=0x97667e80) at sql_parse.cc:5752
#29 0x082d7acb in dispatch_command (command=COM_QUERY, thd=0xa5f44e0,    packet=0xa5f9f71 "INSERT INTO `table10_falcon_int_autoinc` ( `int` ) SELECT `int` FROM `table100_falcon_int_autoinc` LIMIT 108", packet_length=108)
   at sql_parse.cc:1009
#30 0x082d8d96 in do_command (thd=0xa5f44e0) at sql_parse.cc:691
#31 0x082c5f07 in handle_one_connection (arg=0xa5f44e0) at sql_connect.cc:1146
#32 0x0089a45b in start_thread () from /lib/libpthread.so.0
#33 0x007f1c4e in clone () from /lib/libc.so.6 

How to repeat:
Happend when running the RQG falcon_recovery.

Suggested fix:
Falcon should not assert

if (!dup->hasRecord())
			{
			// If the record is a lock record, keep looking for a dup.

			if (dup->state == recLock)
				continue;  // Next record version.

			if (dup->state == recRollback)
				continue;  // Next record version.
			
			// The record has been deleted.
			ASSERT(dup->state == recDeleted);

If this record is not recLock, recRollback or recDeleted, then what is it?

(gdb) print *dup
$3 = {_vptr.Record = 0x89ebda8, data = {record = 0x0}, useCount = 1, format = 0xb708eb38, recordNumber = 727, size = 88, generation = 31, highWater = 0,
  encoding = 204 'л', state = 0 '\0'}

In other words dup->state = 0

This assert occurred because of a race condition between hasRecord and chill().  notice the order in which the Record::state and data.record variables are accessed inthe two functions below;

	inline int hasRecord(bool forceThaw = true)
		{
		if (state == recChilled && forceThaw)
			thaw();
			
		return (data.record != NULL);
		}

bool SRLUpdateRecords::chill(. . .
{
. . .
	record->deleteData();
	record->state = recChilled;
. . .

The race is that client#1 checks for (state == recChilled) and finds that (state == 0).  Then client#2 calls deleteData() but does not set the state yet.  Then client#1 returns false for hasData() without thawing the record.  The caller, checkUniqueRecordVersion(), hits the assert in the section of code shown earlier in this bug while the state is still 0 but the record is chilled.

The fix is for hasRecord() to return whether data.record existed before it checked for recChilled.  If next, (state == recChilled), then hasData() should thaw and return true.

The new addition of data.record to the Cyclemanager when it is chilled will add consistency to any thread that reads data.record.  But all other uses of hasRecord need to be re-evaluated with the knowledge that it could be chilled immediately after the check if that record is pending by another transaction.  If hasrecord is called by the same transaction, then it would be safe since only the thread for that transaction can chill it.

Any reference to hasData in the previous note should be hasRecord.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73693

2702 Kevin Lewis	2009-05-08
      Bug#43344 & Bug#44721  
      This series of changes closes many gaps in which a data buffer can get 
      thawed and then re-chilled almost immediatly which can cause the use 
      of a bad data buffer pointer.  For 44344, it makes the act of chilling
      and thawing records an atomic operation which uses CAS.  Each call to 
      thaw returns a valid pointer to the buffer that was thawed so that it 
      can be used during the current cycle as long as the caller has a 
      CycleLock.
      
      Record.h, Record.cpp, RecordVersion.h & RecordVersion.cpp;
      
      The most significant change is that when a data buffer is chilled, 
      it recieves the value (-1) and this happens atomically.  
      Record::state == recChilled is no longer used.  
      The record.data pointer is now always read once, checked, and 
      returned up the call stack to where it is used.
      
      hasRecord() is changed from an inline function to a virtual function 
      for both Record and RecordVersion. hasRecord() and isChilled() are 
      added so that the record can check data.record itself. 
      
      setEncodedRecord() and thaw() return the pointer to the buffer
      that they just created and put into data.record so that the caller 
      does not have to rely on that buffer being there later.
      
      getAllocatedSize() is added so that the number of bytes in a buffer 
      holding the record can be consistently known.  It is used in several 
      places.  
      
      chillData() is similar to and replaces deleteData() during 
      SRLUpdateRecords::chill
      
      isDeleted() is also added so that the common (state == recDeleted) 
      check can be done by the record itself.  This is not fully 
      implemented yet because we need to get away from the over-use of 
      these state flags.  They are somewhat unreliable on multiple CPUs 
      unless they are set with CAS.
      
      getRecordData() is also changed from an inline to a virtual function.
      
      findRecordData() is like getRecordData but without the thaw.  It is 
      used in three places along the call stack for thawing a record to 
      check if another thread has already thawed that record.  This is 
      not likely now because of the use of syncThaw.  But syncThaw may 
      no longer be necessary.
      
      SRLUpdateRecords.h & SRLUpdateRecords.cpp;
      
      thaw() returns the data buffer thawed in addition to the number of 
      bytes reallocated.
      
      chill() is not sent the dataLength anymore since it calculates that 
      itself.
      
      SerialLog.h, SerialLog.cpp, transaction.h, transaction.cpp;
      
      Since the number of bytes chilled and thawed stored in the SerialLog 
      and Transaction were being maintained but not used, I deleted those
      variables and code used to maintain them.  The number of allocated 
      bytes associated with a Transaction is stored in totalRecorddata and 
      is still maintained.
      
      transaction.cpp
      Even though backlogging is not currently active, the choice whether 
      to backlog any particular record needed to be enhanced.  Along with 
      replacing hasRecord(false) with isChilled(), it should avoid 
      backlogging record chains that do not start at the base record.  
      That is checked with isSuperceded(). Also, backlogging should occur 
      for deleted records as well as newly chilled records.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73828

2702 Kevin Lewis	2009-05-12
      Bug#43344 & Bug#44721  
      Second patch (comprehensive) with review changes
      
      This series of changes closes many gaps in which a data buffer can get 
      thawed and then re-chilled almost immediatly which can cause the use 
      of a bad data buffer pointer.  For 44344, it makes the act of chilling
      and thawing records an atomic operation which uses CAS.  Each call to 
      thaw returns a valid pointer to the buffer that was thawed so that it 
      can be used during the current cycle as long as the caller has a 
      CycleLock.
      
      Record.h, Record.cpp, RecordVersion.h & RecordVersion.cpp;
      
      The most significant change is that when a data buffer is chilled, 
      it recieves the value (-1) and this happens atomically.  
      Record::state == recChilled is no longer used.  
      The record.data pointer is now always read once, checked, and 
      returned up the call stack to where it is used.
      
      hasRecord() is changed from an inline function to a virtual function 
      for both Record and RecordVersion. hasRecord() and isChilled() are 
      added so that the record can check data.record itself. 
      
      setEncodedRecord() and thaw() return the pointer to the buffer
      that they just created and put into data.record so that the caller 
      does not have to rely on that buffer being there later.
      
      getEncodedValue() uses the recordData pointer of the caller instead
      of getting it again from data.record.
      
      getAllocatedSize() is added so that the number of bytes in a buffer 
      holding the record can be consistently known.  It is used in several 
      places.  
      
      chillData() is similar to and replaces deleteData() during 
      SRLUpdateRecords::chill
      
      isDeleted() is also added so that the common (state == recDeleted) 
      check can be done by the record itself.  This is not fully 
      implemented yet because we need to get away from the over-use of 
      these state flags.  They are somewhat unreliable on multiple CPUs 
      unless they are set with CAS.
      
      getRecordData() is also changed from an inline to a virtual function.
      
      findRecordData() is like getRecordData but without the thaw.  It is 
      used in three places along the call stack for thawing a record to 
      check if another thread has already thawed that record.  This is 
      not likely now because of the use of syncThaw.  But syncThaw may 
      no longer be necessary.
      
      isAllocated() is an inline function used about 6 times to determine 
      if a pointer is real or not.
      
      SRLUpdateRecords.h & SRLUpdateRecords.cpp;
      
      thaw() returns the data buffer thawed in addition to the number of 
      bytes reallocated.
      
      chill() is not sent the dataLength anymore since it calculates that 
      itself.
      
      SerialLog.h, SerialLog.cpp, transaction.h, transaction.cpp;
      
      Since the number of bytes chilled and thawed stored in the SerialLog 
      and Transaction were being maintained but not used, I deleted those
      variables and code used to maintain them.  The number of allocated 
      bytes associated with a Transaction is stored in totalRecorddata and 
      is still maintained.
      
      transaction.cpp
      Even though backlogging is not currently active, the choice whether 
      to backlog any particular record needed to be enhanced.  Along with 
      replacing hasRecord(false) with isChilled(), it should avoid 
      backlogging record chains that do not start at the base record.  
      That is checked with isSuperceded(). Also, backlogging should occur 
      for deleted records as well as newly chilled records.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73882

2704 Kevin Lewis	2009-05-12
      Code Cleanup and compile warnings
      Fix the previous patch for Bug#43344 & Bug#44721 that made a call to getAllocatedSize()
      from inside a thaw.  This caused a recursive loop and a stack overflow.

The assertion happenned again after the pathes above.  Those patches helped to distinguish between a record with no data buffer and a chilled record.

The recent crash log is here;
[Falcon] Error: assertion (dup->state == recDeleted) failed at line 2589 in file Table.cpp

http://pb2.norway.sun.com/web.py?action=archive_download&archive_id=537483&pretty=please

Since Table::checkUniqueRecordVersion() can be called for any record along a prior record chain at almost any time, it is possible that valid lock or deleted record versions have other temporary states such as recNoChill, recDeleting & recEndChain.  This assertion is not valid and should be deleted.  The after "getRelativeState(dup, DO_NOT_WAIT)" and "if (!dup->hasRecord()"
should be changed to only check for a deleted record if the relative state is Us or CommittedVisible.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/74975

2712 Kevin Lewis	2009-05-26
      Bug#43344 - This assert can still happen even after the chance of it being a chilled record is eliminated.  
      Since Table::checkUniqueRecordVersion() can be called for any record along a prior record chain at almost any time, it is possible that valid lock or deleted record versions have
      other temporary states such as recNoChill, recDeleting & recEndChain.  So this assertion is not valid and should be deleted.  The code after "getRelativeState(dup, DO_NOT_WAIT)" and "if (!dup->hasRecord()" should be changed to only check the state for a deleted record if the relative state is Us or CommittedVisible.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75080

2712 Kevin Lewis	2009-05-27
      Bug#43344 - Thsi patch is just like the previous one, except that a bug is fixed.  Instead of checking if (dup->state != recDeleted) it correctly checks if (dup->state == recDeleted).
      
      The assert is that after hasRecord() returns false, the state is neither recLock, recRollback, nor recDeleted.  It is not recChilled anymore since that state is no longer used. 
      
      Since Table::checkUniqueRecordVersion() can be called for any record along a prior record chain at almost any time, it is possible that valid lock or deleted record versions have other temporary states such as recNoChill, recDeleting & recEndChain.  So this assertion is not valid and should be deleted.  The code after "getRelativeState(dup, DO_NOT_WAIT)" and "if (!dup->hasRecord()" should be changed to only check the state for a deleted record if the relative state is Us or CommittedVisible.

I have been messing with Table::checkUniqueRecordVersion() again for Bug#43344, which we thought was fixed, but still occurs. The assert is that after hasRecord() returns false, the state is neither recLock, recRollback, nor recDeleted. I suspect that the state is one of the 'temporary' states such as recNoChill or recDeleteing on a deleted record, or recEndChain which I think could happen on both recLock and recDeleted.

The previous patch avoids this particular error, but there is a mor pervasive problem that should be addressed.

The idea of changing the state of a record temporarily, hiding the intrinsic type of the record, is a bad idea. If multiple threads are going to be changing these states, then they should be done atomically, or else the algorithm of the threads must guarantee that only one thread changes the state at a time. Algorithmic thread synchronization is hard to grasp, much less document, much less prove.

But instead of making these state changes atomic, I would like to first separate out information that is not needed in that state variable. For example; recChilled, recDeleted, and recLocked all refer to 'why doesn't this record have a data buffer?' Recently, I moved recChilled out of the state by making Record::data.record == recordIsChilled, which is -1. So I have worked this past few days on a patch that does same with recDeleted and recLocked. In this patch, the only reason for Record::data.record == NULL is that the record data is not yet allocated. And there will be no reason for that when a record is found on the tree or in the prior chain.

With this patch, we quit checking the state for recLocked and recDeleted. It uses RecordVersion::isALock() and isDeleted() intead. The temporary states do not change what a RecordVersion object represents. It is easier to prove that the temporary states do not overwrite each other. The code is more deterministic. And many of the other states become useless. (If a state is not checked, why set it?) This patch will be done soon.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/75493

2715 Kevin Lewis 2009-06-02
Bug#43344 - Reduce Falcon's dependency on Record::state since some of them are temporary states that overwrite other states.

Record.cpp, Record.h, RecordVersion.cpp & RecordVersion.h
- - - - - - - - - - - - - - -
Replace states recDeleted and recLock with values for Record::data.record of recordIsALock (-2) and recordIsDeleted(-3). This makes it much more deterministic to tell what kind of record this is. Other states were found to be not needed since they were not being checked. They include recOnDisk, recRollback, recUnlocked, recDeleting,and recPruning. The initial state of zero previously called recData is now called recNormal. A new value is used for data.record when the data buffer has been deleted prior to the object being scavenged. It is called recordDataIsDeleted (-4).

New functions are added to determine intrinsic record types and status including isAllocated, isALock(), findRecordData(), hasNoData(), setLock() & setDeleted(). ExchangeData is separated from deleteData() so that it can be used more flexibly. validateData is deleted since it is not used.

In several places, deleteData() is used before putting something else into the data buffer. Now, it is rearranged so that the ne data buffer is created first and the two are exhanged atomically. This continues the effort to change data.record atomically, even where it contains the non-pointer values.

In the process of analyzing this code, several new ASSERTS are added. These were tested as much as possible, but pushbuild and other DBT2 runs will need to completely verify them.

Index.cpp, RecordScavenge.cpp, SRLUpdateRecords.cpp, StorageDatabase.cpp, Table.cpp, Transaction.cpp
- - - - - - - - - - - - - - -
The functions isALock() and isDeleted() are used everywhere the state was checked for recLocked and recDeleted.

Table.cpp
- - - - - - - - - - - - - - -
One of the two flavors of Table::update() had two sections of code that tried to set 'oldRecord' These were combined into more compact code and comments were added.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75552

2716 Kevin Lewis	2009-06-03
      Bug#43344 Code review fix.  The previous patch lost the negative on isALock().  
      This bug would limit the effectiveness of pruning old versions.
      
      In addition, this patch limits the use of Record::state even further by asserting that it is always recNormal before setting it to something, and setting it back to recNormal when that state is completed.  It assures that these temporary states do not overlap. And since recChilled overlaps with recInserting, recChilled is not used anymore.  Instead recInserting and a new state, recUpdating, are checked to prevent the current record from being chilled.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/75609

2719 Kevin Lewis	2009-06-04
      Bug#43344 - One more cleanup.  RecordVersion::hasData() should act like RecordVersion::getRecordData() by testing the recordData returned from thaw() instead of calling isChilled() which tests data.record again.  The record can immediately get re-chilled.  And that actually happened in a pushbuld test.