MySQL Bugs: #40342: Falcon; big idle spike when running sysbench with 50 million table size

Bug #40342	Falcon; big idle spike when running sysbench with 50 million table size
Submitted:	27 Oct 2008 9:26	Modified:	15 May 2009 12:56
Reporter:	Xiang Fang	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S5 (Performance)
Version:	6.0-falcon	OS:	Linux (RHEL 5.2 x64)
Assigned to:	Kevin Lewis	CPU Architecture:	Any
Tags:	F_SCAVENGER, idle spike

Description:
when running sysbech with 50 million table size, I find system will have big idle spike for about 5 seconds, most of CPU are idle(there are 16 logical CPUs, 1 is busy and the others are almost idle).

Here is the summary of gdb call stack info:
1. 1 thread was executing the memMgr::releaseBlock and other 15 thread were blocking at Sync::Lock() from Database::retireRecords().
2. the memMgr::releaseBlock() was also called by Database::retireRecords().

Please see the 2 call stacks below:
Thread 12 (Thread 1274538304 (LWP 3068)):
#0  0x00002b0a8f2b1687 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000000000973139 in Synchronize::sleep (this=0x2aab2b0a8490,
    milliseconds=10000, callersMutex=0x2aab2ae83c68) at Synchronize.cpp:183
#2  0x0000000000892467 in SyncObject::wait (this=0x2aab2ae83c58,
    type=Exclusive, thread=0x2aab2b0a8490, sync=0x4bf7b550, timeout=0)
    at SyncObject.cpp:658
#3  0x00000000008929df in SyncObject::lock (this=0x2aab2ae83c58,
    sync=0x4bf7b550, type=Exclusive, timeout=0) at SyncObject.cpp:449
#4  0x0000000000890edd in Sync::lock (this=0x4bf7b550, type=Exclusive)
    at Sync.cpp:58
#5  0x00000000008d294e in Database::retireRecords (this=0x2aab2ae82bb8,
    forced=true) at Database.cpp:1760
#6  0x00000000008d2e44 in Database::forceRecordScavenge (this=0x2aab2ae82bb8)
    at Database.cpp:2464
#7  0x0000000000894059 in Table::allocRecord (this=0x2aab2b060900,
    recordNumber=20186783, stream=0x4bf7b6c0) at Table.cpp:3666
#8  0x00000000008969ec in Table::databaseFetch (this=0x2aab2b060900,
    recordNumber=20186783) at Table.cpp:665
#9  0x0000000000899a75 in Table::fetch (this=0x2aab2b060900,
    recordNumber=20186783) at Table.cpp:920
#10 0x0000000000883828 in StorageDatabase::nextIndexed (this=0x2aab2b0831b8,
    storageTable=0x2aad43732c10, recordBitmap=0x2aab2b087cb8,
    recordNumber=20186783, lockForUpdate=false) at StorageDatabase.cpp:423
#11 0x000000000088a8a5 in StorageTable::nextIndexed (this=0x2aad43732c10,
    recordNumber=20186783, lockForUpdate=false) at StorageTable.cpp:171
#12 0x000000000087b4fb in StorageInterface::index_next (this=0x2aac48a5cef8,
    buf=0x2aac48a5d2d8 "\237\0064\001") at ha_falcon.cpp:1653
#13 0x0000000000767be9 in handler::read_range_next (this=0x2aac48a5cef8)
    at handler.cc:4885
#14 0x0000000000765ffb in handler::multi_range_read_next (this=0x2aac48a5cef8,
    range_info=0x4bf7bdb0) at handler.cc:4172
#15 0x0000000000748b99 in QUICK_RANGE_SELECT::get_next (this=0x2aadafe00ea0)
    at opt_range.cc:8522
#16 0x00000000007603ac in rr_quick (info=0x2aad4c02b3d8) at records.cc:317
#17 0x00000000006c1033 in sub_select (join=0x2aad58004968,
    join_tab=0x2aad4c02b350, end_of_records=false) at sql_select.cc:13724
#18 0x00000000006c8b7d in do_select (join=0x2aad58004968, fields=0x0,
    table=0x2aad4c02cfd8, procedure=0x0) at sql_select.cc:13453
#19 0x00000000006e352c in JOIN::exec (this=0x2aad58004968)
    at sql_select.cc:2414
....

Thread 13 (Thread 1274272064 (LWP 3067)):
#0  0x0000000000991832 in MemFreeBlock::remove (this=0x2aad59c33cf0)
    at MemFreeBlock.cpp:234
#1  0x00000000009073df in MemMgr::remove (this=0x12bcd40, block=0x2aad59c33cf0)
    at MemMgr.cpp:715
#2  0x000000000090873f in MemMgr::releaseBlock (this=0x12bcd40,
    block=0x2aad59c57f20) at MemMgr.cpp:663
#3  0x00000000009088bc in MemMgr::release (object=0x2aad59c57f30)
    at MemMgr.cpp:591
#4  0x0000000000908920 in MemMgrRelease (object=0x2aad59c57f30)
    at MemMgr.cpp:190
#5  0x00000000009220a7 in ~Record (this=0x2aad59c57f30) at MemoryManager.h:127
#6  0x0000000000921eb7 in Record::release (this=0x2aad59c57f30)
    at Record.cpp:534
#7  0x00000000009210d8 in RecordLeaf::retireRecords (this=0x2aacf901db88,
    table=0x2aab2b060900, base=272372, recordScavenge=0x4bf3a2e0)
    at RecordLeaf.cpp:202
#8  0x00000000009941aa in RecordGroup::retireRecords (this=0x2aacde2b0560,
    table=0x2aab2b060900, base=2723, recordScavenge=0x4bf3a2e0)
    at RecordGroup.cpp:124
#9  0x00000000009941aa in RecordGroup::retireRecords (this=0x2aacde2a2290,
    table=0x2aab2b060900, base=27, recordScavenge=0x4bf3a2e0)
    at RecordGroup.cpp:124
#10 0x00000000009941aa in RecordGroup::retireRecords (this=0x35554b78,
    table=0x2aab2b060900, base=0, recordScavenge=0x4bf3a2e0)
    at RecordGroup.cpp:124
#11 0x0000000000895436 in Table::retireRecords (this=0x2aab2b060900,
    recordScavenge=0x4bf3a2e0) at Table.cpp:1870
#12 0x00000000008d2b1d in Database::retireRecords (this=0x2aab2ae82bb8,
    forced=true) at Database.cpp:1806
#13 0x00000000008d2e44 in Database::forceRecordScavenge (this=0x2aab2ae82bb8)
    at Database.cpp:2464
#14 0x0000000000921d38 in Record::allocRecordData (this=0x2aadbf200120,
    length=69) at Record.cpp:928
#15 0x0000000000923d01 in Record::setEncodedRecord (this=0x2aadbf200120,
    stream=0x4bf3a6c0, interlocked=false) at Record.cpp:707
#16 0x000000000092426f in Record (this=0x2aadbf200120, table=0x2aab2b060900,
    recordNum=29637687, stream=0x4bf3a6c0) at Record.cpp:97
#17 0x0000000000893f60 in Table::allocRecord (this=0x2aab2b060900,
    recordNumber=29637687, stream=0x4bf3a6c0) at Table.cpp:3659
#18 0x00000000008969ec in Table::databaseFetch (this=0x2aab2b060900,
    recordNumber=29637687) at Table.cpp:665
#19 0x0000000000899a75 in Table::fetch (this=0x2aab2b060900,
    recordNumber=29637687) at Table.cpp:920
#20 0x0000000000883828 in StorageDatabase::nextIndexed (this=0x2aab2b0831b8,
    storageTable=0x2aad43953c10, recordBitmap=0x2aab2b086d68,
    recordNumber=29637687, lockForUpdate=false) at StorageDatabase.cpp:423
#21 0x000000000088a8a5 in StorageTable::nextIndexed (this=0x2aad43953c10,
    recordNumber=29637687, lockForUpdate=false) at StorageTable.cpp:171
#22 0x000000000087b4fb in StorageInterface::index_next (this=0x2aac492d5458,
    buf=0x2aac483f5b98 "7<\001") at ha_falcon.cpp:1653
#23 0x0000000000767be9 in handler::read_range_next (this=0x2aac492d5458)
    at handler.cc:4885
#24 0x0000000000765ffb in handler::multi_range_read_next (this=0x2aac492d5458,
    range_info=0x4bf3adb0) at handler.cc:4172
#25 0x0000000000748b99 in QUICK_RANGE_SELECT::get_next (this=0x2aad484224c0)
    at opt_range.cc:8522
#26 0x00000000007603ac in rr_quick (info=0x2aad49990e38) at records.cc:317
#27 0x00000000006c1033 in sub_select (join=0x2aad4db3fc58,
    join_tab=0x2aad49990db0, end_of_records=false) at sql_select.cc:13724
#28 0x00000000006c8b7d in do_select (join=0x2aad4db3fc58,
    fields=0x2aad4db416a0, table=0x0, procedure=0x0) at sql_select.cc:13453
#29 0x00000000006e49ca in JOIN::exec (this=0x2aad4db3fc58)
    at sql_select.cc:2827
...

How to repeat:
MySQL 6.0 falcon  

Sysbench 0.4.8 

Use 16GB memory as ram disk

Running parameter: --num-threads=16 --max-time=200 --max-requests=0 --mysql-engine-trx=yes --mysql-table-engine=falcon --test=oltp --oltp-read-only=on --mysql-user=mysql --oltp-table-size=50000000 --mysql-socket=/tmp/mysql.sock --oltp-point-selects=10 --oltp-simple-ranges=1 --oltp-sum-ranges=1 --oltp-order-ranges=1 --oltp-distinct-ranges=1

I verified this with my own testing a couple weeks ago. If the system if very busy with a lot of threads and one of them calls forceRecordScavenge it cannot get done fast enough to prevent the other threads from doing the same. A forced record scavenge cycle is too long. The forced scavenge needs to be redesigned so that it signals the scavenger thread to start scavenging and then it should loop until it can get the memory it needs.  It should not do the scavenge itself.

See the ideas in Worklogs 4606, 4477 & 4478 for some other improvements to scavenging.

These idle spikes are caused by all the foreground threads waiting on one foreground thread to do a complete scavenge.  The new scavenger will fix this.  Each foreground thread that needs record cache memory will signal the background scavenger thread, sleep for 10 miliseconds, and then try again to get memory. It will repeat this 5 times before giving up and returning an out-of-memory error.  The background scavenger will prune records if it can and retire records down to the scavenge floor.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/62571

2958 Kevin Lewis	2009-01-07
Bug#34893, Bug#36700, Bug#40342, Bug#40801
New scavenger code, See WL#4606, WL#4477 & WL#4478 for more info.
      
This is the second full patch of the new scavenger.  
1) Separate the two tasks involved in scavenging into
   A. Pruning old invisible record versions.
      1. As part of this task, a survey is taken of records
         in age groups so that the next step can be done in 
         one pass
   B. Retiring least-recently-used records from the record cache
2) Watch how much record space is added to the record cache 
   and start a new age group at regular intervals. 
3) Put the scavenger into a separate thread that can be 
   signaled when needed.

This patch contains many improvements over the last patch
including;
1) forcedRecordScavenge is now signalScavenge and the loop is 
   increased to 5 times attempting to get memory from the 
   record cache.
2) Unused code is deleted
3) Integrated with new dependency manager code so that it now
   uses Transaction::commitId to determine if a record is old
   enough to be scavenged.

Pushed into 6.0.10-alpha (revid:alik@sun.com-20090211182317-uagkyj01fk30p1f8) (version source revid:olav@sun.com-20090113103017-41jbad7qlvlwpwxh) (merge vers: 6.0.10-alpha) (pib:6)

An entry has been added to the 6.0.10 changelog: 

The Falcon storage engine has been updated to incorporate new code for the built-in scavenger service, which handles the caching of records in memory. This fixes a number of different issues related to the visibility of different records during certain operations and improves the memory usage.