MySQL Bugs: #42424: Serious performance degradation after new scavenger

Bug #42424	Serious performance degradation after new scavenger
Submitted:	28 Jan 2009 14:20	Modified:	7 May 2009 16:24
Reporter:	Hakan Küçükyılmaz	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S2 (Serious)
Version:	mysql-6.0-falcon-team	OS:	Linux
Assigned to:	Christopher Powers	CPU Architecture:	Any
Tags:	F_SCAVENGER, performance

Description:
our DBT2 numbers on caneland in the 100 warehouses scenario degraded from
    1300 - 2000 NOTPM

to
    400 - 2100 NOTPM

most times being around 400 NOTPM. OProfile reveals
samples  %        symbol name
68897    15.7878  RecordScavenge::inventoryRecord(Record*)
68759    15.7561  MemMgr::blockSize(void*)
56134    12.8631  SyncObject::lock(Sync*, LockType, int)
42384     9.7123  RecordScavenge::canBeRetired(Record*)
33621     7.7043  SyncObject::unlock(Sync*, LockType)
20272     4.6453  RecordLeaf::pruneRecords(Table*, int, RecordScavenge*)
16485     3.7775  Sync::lock(LockType)

Old OProfile was
samples  %        symbol name
18987     9.8125  SyncObject::lock(Sync*, LockType, int)
10268     5.3065  SyncObject::unlock(Sync*, LockType)
5904      3.0512  MYSQLparse(void*)
5351      2.7654  Cache::fetchPage(Dbb*, int, PageType, LockType)
4587      2.3706  Record::addRef()
4521      2.3365  Record::getEncodedRecord()
3987      2.0605  StorageInterface::decodeRecord(unsigned char*)
3745      1.9354  SyncObject::grantLocks()

Also, while running DBT2 we get this error messages:
090127 18:40:32 [ERROR] Got error 305 when reading table './test/district'
090127 18:40:32 [ERROR] Got error 305 when reading table './test/district'
090127 18:40:33 [ERROR] Got error 305 when reading table './test/item'
090127 18:40:33 [ERROR] Got error 305 when reading table './test/customer'
090127 18:40:33 [ERROR] mysqld: Sort aborted

Note: I did not change memory settings of Falcon!

How to repeat:
Run DBT2 100 warehouses on caneland or lu0009 or comparable system with a software RAID disk system. The problem does not show up that dramatically on my other machine with battery backed up hardware RAID controller.

Verified by several caneland runs.

Tested with
-static const uint64 AGE_GROUPS_IN_CACHE = 50;
+static const uint64 AGE_GROUPS_IN_CACHE = 20;

Result: perfomance degration is still there. OProfile says
samples  %        symbol name
69593    15.6850  MemMgr::blockSize(void*)
67084    15.1195  RecordScavenge::inventoryRecord(Record*)
59314    13.3683  SyncObject::lock(Sync*, LockType, int)
44424    10.0124  RecordScavenge::canBeRetired(Record*)
33808     7.6197  SyncObject::unlock(Sync*, LockType)
18593     4.1905  RecordLeaf::pruneRecords(Table*, int, RecordScavenge*)
16222     3.6561  Sync::lock(LockType)
13717     3.0916  RecordLeaf::retireRecords(Table*, int, RecordScavenge*)
6394      1.4411  RecordGroup::pruneRecords(Table*, int, RecordScavenge*)
5710      1.2869  RecordGroup::retireRecords(Table*, int, RecordScavenge*)
4127      0.9302  Record::getMemUsage()
3983      0.8977  RecordVersion::committedBefore(unsigned int)
3481      0.7846  MYSQLparse(void*)
3208      0.7230  Table::getSyncPrior(Record*)
2927      0.6597  Cache::fetchPage(Dbb*, int, PageType, LockType)
2604      0.5869  RecordVersion::clearPriorVersion()
2495      0.5623  Record::getSyncPrior()
2330      0.5251  Record::addRef()
2330      0.5251  Sync::~Sync()

Stacktrace of a 100 warehouses DBT2 run

Attachment: scavenger_deadlock.txt.gz (application/x-gzip, text), 142.80 KiB.

The performance is back to "normal" after trying Chris' patch

=== modified file 'storage/falcon/RecordScavenge.cpp'
--- storage/falcon/RecordScavenge.cpp   2009-01-14 22:33:44 +0000
+++ storage/falcon/RecordScavenge.cpp   2009-01-29 22:22:40 +0000
@@ -251,7 +251,7 @@
                        scavengeGeneration = baseGeneration - n;
                }

-       return scavengeGeneration;
+       return (scavengeGeneration ? scavengeGeneration : baseGeneration);

samples  %        symbol name
15133     8.3317  SyncObject::lock(Sync*, LockType, int)
11174     6.1520  MemMgr::blockSize(void*)
8857      4.8764  RecordScavenge::inventoryRecord(Record*)
7631      4.2014  SyncObject::unlock(Sync*, LockType)
6294      3.4653  MYSQLparse(void*)
4683      2.5783  Cache::fetchPage(Dbb*, int, PageType, LockType)
3365      1.8527  RecordLeaf::pruneRecords(Table*, int, RecordScavenge*)
3348      1.8433  Sync::lock(LockType)
2865      1.5774  Record::getEncodedRecord()
2499      1.3759  Record::addRef()
2328      1.2817  lex_one_token(void*, void*)
2319      1.2768  StorageInterface::decodeRecord(unsigned char*)
2169      1.1942  Record::release()
1891      1.0411  IndexPage::findNodeInLeaf(IndexKey*, IndexKey*)
1864      1.0263  EncodedDataStream::decode()
1660      0.9139  Table::fetch(int)
1593      0.8771  IO::readPage(Bdb*)
1587      0.8737  Bitmap::nextSet(int)
1320      0.7267  IndexPage::findSupernode(int, unsigned char*, unsigned long, int, Btn*, bool*)
1317      0.7251  MemFreeBlock::findNextLargest(int)
1254      0.6904  IndexNode::IndexNode(Btn*)
1146      0.6309  open_table(THD*, TABLE_LIST*, st_mem_root*, enum_open_table_action*, unsigned int)
1109      0.6106  SyncObject::grantLocks()
1098      0.6045  Bdb::release()
1081      0.5952  IndexPage::findNodeInBranch(IndexKey*, int)
1072      0.5902  Bdb::incrementUseCount()
1062      0.5847  RecordGroup::pruneRecords(Table*, int, RecordScavenge*)
981       0.5401  RecordLeaf::fetch(int)
964       0.5307  get_hash_symbol(char const*, unsigned int, bool)
918       0.5054  alloc_root

After investigating what Chris called a 'reluctant scavenger' and discussing it with Kevin, Chis has a patch that allows the scavenger to retired records even if it cannot retire the whole target amount that would bring active memory down to the floor.  That calculation is done in RecordScavenge::computeThreshold().

In addition to the patch tested by Hakan, The RecordScavenge constructor should read and store the current generation from the database and immediately increment it so that all new records created during the scavenge cycle will never be a part of the current scavenge cycle.

There are a number of other issues with the Scavenger, particularly in a low-memory state:

1. In a low-memory state, the Scavenger should not run updateCardinalities(), because it will also signal the Scavenger for more space.

2. A load-based (formerly 'forced') scavenge should probably call retireRecords() before pruneRecords().

3. In some cases, ten iterations of a 10ms sleep in Record::allocRecordData() is not long enough to allow the Scavenger to free sufficient memory.

4. If allocRecordData() signals a scavenge because there is insufficient memory in the record data pool, Database::retireRecords() will fail to scavenge given a high enough record cache threshold:

void Database::retireRecords()
{
   if (recordDataPool->activeMemory < recordScavengeThreshold)
      return;
   [...]

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64768

2991 Christopher Powers	2009-02-01
      Bug #42424, "Serious performance degradation after new scavenger"
      
      Several fixes for the Scavenger:
      
      1. Don't run updateCardinalities() in a low memory state
      2. Distinguish between scheduled and load-based or 'forced' scavenges
      3. Progessively increase the time that record alloc waits for a scavenge
      4. Retire records during a low memory state, even if active memory is
         below the cache threshold

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/64769

2992 Christopher Powers	2009-02-01
      Bug #42424, "Serious performance degradation after new scavenger"
      
      Increased scavenger wait increment from 10 to 20ms.
      Use progressive scavenger wait timeout for record allocation methods.

The changes look good.  As long as there is a possibility of getting fragmentation in the record data pool, we will need to know if a scavenge cycle is forced so that the scavenge cycle can do more than normal.

Pushed into 6.0.10-alpha (revid:alik@sun.com-20090211182317-uagkyj01fk30p1f8) (version source revid:hky@sun.com-20090203204730-3k0pc0uoxp6wqr1i) (merge vers: 6.0.10-alpha) (pib:6)

Internal/test fix. No changelog entry required.