MySQL Bugs: #41840: Unbounded tablespace growth during recovery

Bug #41840	Unbounded tablespace growth during recovery
Submitted:	3 Jan 2009 13:57	Modified:	15 May 2009 14:11
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	6.0-falcon-team	OS:	Any
Assigned to:	Vladislav Vaintroub	CPU Architecture:	Any
Tags:	F_RECOVERY

Description:
When recovering from the attached tablespace and logs, the tablespace proceeds to grow to many times its original size. The recovery thread is in the following position:

#0  0x000000315b00e6d7 in fsync () from /lib64/libpthread.so.0
#1  0x00000000008bcfac in IO::pwrite (buffer=<value optimized out>, length=<value optimized out>, offset=<value optimized out>, this=<value optimized out>)
    at IO.cpp:651
#2  IO::writePages (this=0x7f862de94108, pageNumber=<value optimized out>, length=<value optimized out>, data=<value optimized out>, type=1) at IO.cpp:364
#3  0x0000000000883480 in Cache::writePage (this=0x7f862e0b2ff0, bdb=0x7f862e0c8bd8, type=1) at Cache.cpp:556
#4  0x000000000088395c in Cache::findBuffer (this=0x7f862e0b2ff0, dbb=0x7f862de94100, pageNumber=67226, lockType=Exclusive) at Cache.cpp:445
---Type <return> to continue, or q <return> to quit---
#5  0x0000000000883b5b in Cache::fakePage (this=0x7f862e0b2ff0, dbb=0x7f862de94100, pageNumber=67226, type=PAGE_data_overflow, transId=117987)
    at Cache.cpp:365
#6  0x000000000089c131 in Dbb::allocPage (this=0x6, pageType=769507328, transId=2) at Dbb.cpp:249
#7  0x00000000008e63ab in Section::storeTail (this=0x7f862a4c2780, stream=0x7fff375a2080, maxRecord=15, pLength=0x7fff375a1f6c, transId=117987,
    earlyWrite=false) at Section.cpp:755
#8  0x0000000000924607 in DataPage::updateRecord (this=0x7f862dc44000, section=0x7f862a4c2780, lineNumber=111, stream=0x7fff375a2080, transId=117987,
    earlyWrite=false) at DataPage.cpp:118
#9  0x00000000008e7831 in Section::updateRecord (this=0x7f862a4c2780, recordNumber=<value optimized out>, stream=0x7fff375a2080, transId=117987,
    earlyWrite=false) at Section.cpp:564
#10 0x000000000089be85 in Dbb::updateRecord (this=0x7f862de94100, sectionId=<value optimized out>, recordId=13430, stream=0x7fff375a2080, transId=117987,
    earlyWrite=false) at Dbb.cpp:350
#11 0x000000000090d762 in SRLUpdateRecords::redo (this=0x7fff375a2b98) at SRLUpdateRecords.cpp:337
#12 0x00000000008f2cb7 in SerialLog::recover (this=0x7f862e1ccb48) at SerialLog.cpp:368
#13 0x0000000000897202 in Database::openDatabase (this=0x7f862e08abb8, filename=0x7fff375a2fd0 "/build/bugs/master-data-copy/falcon_master.fts")
    at Database.cpp:733
#14 0x000000000088b463 in Connection::getDatabase (this=0x7f862de89700, dbName=0x7f862de8937c "FALCON_MASTER",
    dbFileName=0x7fff375a2fd0 "/build/bugs/master-data-copy/falcon_master.fts", threads=0x7f862de893d8) at Connection.cpp:1651
#15 0x000000000088cab4 in Connection::openDatabase (this=0x7f862de89700, dbName=0x7f862de8937c "FALCON_MASTER", filename=<value optimized out>,
    account=0xba8331 "mysql", password=0xba8331 "mysql", address=0x0, parent=0x7f862de893d8) at Connection.cpp:934
#16 0x0000000000850e27 in StorageDatabase::getOpenConnection (this=0x7f862de891f0) at StorageDatabase.cpp:136
#17 0x0000000000853a06 in StorageHandler::initialize (this=0x7f862e08a040) at StorageHandler.cpp:987
#18 0x0000000000848c76 in StorageInterface::falcon_init (p=<value optimized out>) at ha_falcon.cpp:245
#19 0x00000000007535af in ha_initialize_handlerton (plugin=0x350b778) at handler.cc:440
#20 0x00000000007daa02 in plugin_initialize (plugin=0x6) at sql_plugin.cc:1006
#21 0x00000000007dd974 in plugin_init (argc=0x10ac688, argv=0x34eab78, flags=<value optimized out>) at sql_plugin.cc:1213
#22 0x0000000000657f79 in init_server_components () at mysqld.cc:4112
#23 0x000000000065b008 in main (argc=<value optimized out>, argv=<value optimized out>) at mysqld.cc:4574

How to repeat:
The tablespace will be provided shortly.

Here is the log of one such recovery:

# 20:48:05 New table space FALCON_USER, id 1, type 0, filename falcon_user.fts
# 20:48:05 New table space FALCON_TEMPORARY, id 2, type 0, filename falcon_temporary.fts
# 20:48:05 Recovering database /tmp/vardir/master-data_recovery/falcon_master.fts ...
# 20:48:06 /nFirst recovery block is 5696
# 20:48:06 Last recovery block is 7582
# 20:48:06 Recovery read block is 6886
# 20:48:06 Recovery phase 1...
# 20:48:06 Processed:    50913
# 20:48:06
# 20:48:06 Recovery phase 2...
# 20:48:06 Processed:    50913
# 20:48:06
# 20:48:06 Recovery phase 3...
# 20:48:06 === Scavenge Cycle 1 - FALCON_MASTER - 0 seconds
# 20:48:06 4: Activity on /tmp/vardir/master-data_recovery/falcon_master.fts: 10 fetches, 5 reads, 0 writes, 1 flushWrites
# 20:48:06 4: Activity on /tmp/vardir/master-data_recovery/falcon_temporary.fts: 0 fetches, 1 reads, 0 writes, 1 flushWrites
# 20:48:06 4: Activity on /tmp/vardir/master-data_recovery/falcon_user.fts: 8878 fetches, 32 reads, 0 writes, 113 flushWrites
# 20:55:26 Processed:    50913
# 20:55:26 Recovery complete
# 20:55:26 443: Cache flush: 106 pages, 32 writes in 0 seconds (106 pps)
# 20:55:26 443: Commit System Transaction 2
# 20:55:26 443: Commit System Transaction 4
# 20:55:26 443: CommitNoUpdates transaction 6
# 20:55:26 === Scavenge Cycle 2 - FALCON_MASTER - 0 seconds
# 20:55:26 443: Transactions: 3 committed, 0 rolled back, 1 active, 8 available, 2 post-commit, oldest 0 seconds
# 20:55:26 443: SerialLog: 0 reads, 4 writes, 0 transactions, 0 completed, 0 stalls, 2 blocks, 1 windows
# 20:55:26 443: Activity on /tmp/vardir/master-data_recovery/falcon_master.fts: 2668 fetches, 71 reads, 0 writes, 9 flushWrites
# 20:55:26 443: Activity on /tmp/vardir/master-data_recovery/falcon_user.fts: 897340 fetches, 1155 reads, 0 writes, 38880 flushWrites

Notice "897340 fetches, 1155 reads, 0 writes, 38880 flushWrites". Is this OK?

You did not say what you have done before you killed mysqld. The answer to "is this OK?" is then maybe it is ok and it depends on your workload. If you have not killed the app but just stopped the activity of the client application, it alsocould take hours before mysqld stops writing. This happens when gopher threads lag behind the client threads, which is presumably in your case. recovery has to recover ~51000 log records, so it might take both time and space

This was observed when running this RQG command:

$ perl runall.pl \
  --mysqld=--falcon-page-size=8K \
  --vardir=/tmp/vardir \
  --rows=1000 \
  --threads=32 \
  --mask=8 \
  --queries=1000000 \
  --duration=300 \
  --basedir=/build/bzr/6.0-falcon-team \
  --engine=Falcon \
  --grammar=conf/combinations.yy \
  --gendata=conf/combinations.zz \
  --validator=DatabaseConsistency \
  --reporter=Deadlock,ErrorLog,Backtrace,Recovery \
  --mysqld=--loose-falcon-lock-wait-timeout=1 \
  --mysqld=--log-output=none

Note that the --mask parameter means that only DELETEs are being generated.

Data directory before recovery looks like this:

-rw-rw----  1 philips philips  75492352 2009-01-03 18:29 falcon_master.fl1
-rw-rw----  1 philips philips  86007808 2009-01-03 18:29 falcon_master.fl2
-rw-rw----  1 philips philips   1835008 2009-01-03 18:29 falcon_master.fts
-rw-rw----  1 philips philips     65536 2009-01-03 18:29 falcon_temporary.fts
-rw-rw----  1 philips philips 144130048 2009-01-03 18:29 falcon_user.fts

the tablespace is 144 mb, log files are about 80mb each. After recovery:

-rw-rw----  1 philips philips   75492352 2009-01-03 18:28 falcon_master.fl1
-rw-rw----  1 philips philips   86007808 2009-01-03 18:28 falcon_master.fl2
-rw-rw----  1 philips philips    1835008 2009-01-03 18:28 falcon_master.fts
-rw-rw----  1 philips philips      65536 2009-01-03 18:29 falcon_temporary.fts
-rw-rw----  1 philips philips 3103191040 2009-01-03 18:29 falcon_user.fts

Log files unchanged, the falcon_user tablespace is now 3.1 gigabytes.

Another result from a grammar that uses transaction, update, select, replace, delete:

$ perl runall.pl \
  --mysqld=--falcon-page-size=8K \
  --vardir=/tmp/vardir \
  --rows=100 \
  --threads=16 \
  --mask=59 \
  --queries=1000000 \
  --duration=300 \
  --basedir=/build/bzr/6.0-falcon-team \
  --engine=Falcon \
  --grammar=conf/combinations.yy \
  --gendata=conf/combinations.zz \
  --reporter=Deadlock,ErrorLog,Backtrace,Recovery \
  --mysqld=--loose-falcon-lock-wait-timeout=1 \
  --mysqld=--log-output=none

Before recovery, the tablespace is 309 mb.
 -rw-rw---- 1 philips philips  73779712 2009-01-03 17:15 falcon_master.fl1
 -rw-rw---- 1 philips philips  70248960 2009-01-03 17:15 falcon_master.fl2
 -rw-rw---- 1 philips philips   1835008 2009-01-03 17:15 falcon_master.fts
 -rw-rw---- 1 philips philips     65536 2009-01-03 17:15 falcon_temporary.fts
 -rw-rw---- 1 philips philips 309608448 2009-01-03 17:15 falcon_user.fts

After recovfery, it is 2.9 gigabytes

-rw-rw---- 1 philips philips   73779712 2009-01-03 17:14 falcon_master.fl1
-rw-rw---- 1 philips philips   70248960 2009-01-03 17:14 falcon_master.fl2
-rw-rw---- 1 philips philips    1835008 2009-01-03 17:14 falcon_master.fts
-rw-rw---- 1 philips philips      65536 2009-01-03 17:15 falcon_temporary.fts
-rw-rw---- 1 philips philips 2973466624 2009-01-03 17:15 falcon_user.fts

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72525

3129 Vladislav Vaintroub	2009-04-20
      Bug #41840 Unbounded tablespace growth during recovery
      Sometimes, after recovery the tablespaces get huge, orders of
      magnitude larger than original tablespace.
      
      The problem was that page overflow pages were not deleted 
      during recovery. If a record was modified several times during 
      recovery, the overflow pages that were created during recovery
      also leaked.
      
      The fix is to remove overflow pages , when it is possible.
      - remove all SRLOverflowPages found in course of recovery
      - just like during normal runtime,when record gets modified in 
      recovery,  remove overflow pages if the removal is safe.
      Safe means,  that page referenced is a valid dataoverflow page.
      Validity is guaranteed, if page is either not touched by recovery, 
      or the page was allocated during recovery and not yet freed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72569

3131 Vladislav Vaintroub	2009-04-21
      Addendum to Bug #41840 Unbounded tablespace growth during recovery
      
      Problem: blob recovery might access invalid (free) page
      The reason is that page allocations are not recorded during recovery and recovery might assume that the last state (incarnation) of the page was "free", even if a piece of blob was written to it.
      Solution is to record blob pages (data or overflow pages or written with earlyWrite flag) but do nothing in recovery except incrementing the page incarnation. This way the page cannot not be accidentially freed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72572

3131 Vladislav Vaintroub	2009-04-21
      Addendum to Bug #41840 Unbounded tablespace growth during recovery
      
      Problem: blob recovery might access invalid (free) page
      The reason is that page allocations are not recorded during recovery and recovery might assume that the last state (incarnation) of the page was "free", even if a piece of blob was written to it.
      Solution is to record blob pages (data or overflow pages or written with earlyWrite flag) but do nothing in recovery except incrementing the page incarnation. This way the page cannot not be accidentially freed.

Pushed into 6.0.11-alpha (revid:alik@sun.com-20090423071213-afmyrzvolemph7mz) (version source revid:hky@sun.com-20090421195958-j33v1cuo3yer9niu) (merge vers: 6.0.11-alpha) (pib:6)

An entry has been added to the 6.0.11 changelog: 

When creating an index on a Falcon table with a very large dataset, mysqld would crash.