Bug #39130 Unbounded serial log growth with online ALTER
Submitted: 29 Aug 2008 18:08 Modified: 15 May 2009 15:50
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S2 (Serious)
Version:6.0-falcon-team OS:Any
Assigned to: Vladislav Vaintroub CPU Architecture:Any
Tags: F_DDL
Triage: Triaged: D2 (Serious)

[29 Aug 2008 18:08] Philip Stoev
Description:
When executing the falcon_online_alter Random Query Generator test, it is observed that one of the Falcon serial logs grows to at least 500Mb of size. No log rotation or log truncation are observed.

The test operates on tables with a maximum size of 1000 records. Due to the frequent use of ALTER TABLE, which causes an implicit transaction commit, there are no long-running transactions in on the server.

Once the clients are disconnected, the log file remains as it is. At that time, GDB shows that all threads, including the gophers, are sleeping.

How to repeat:
Instructions will be provided shortly.

Suggested fix:
Prevent unbounded serial log file growth. No combination of statements, except for long-running transactions, should prevent the serial log from being rotated and truncated.
[29 Aug 2008 18:17] Philip Stoev
To reproduce, please clone the mysql-test-extra-6.0 tree and run:

$ cd mysql-test/gentest
$ perl runall.pl \
  --basedir=/path/to/6.0-falcon-team \ 
  --engine=falcon \
  --grammar=conf/falcon_online_alter.yy \
  --mysqld=--falcon-lock-wait-timeout=1 \
  --queries=100000

You will observe that after a few initial serial log rotations, one of the serial logs proceeds to grow to sizes over 20 times the size of the other log file and over 10 times the size of the falcon tablespace.

Stopping the clients, leaving the server idle for a few min and then restarting the test does not trigger rotation. The same log continues to grow.
[29 Aug 2008 18:28] Philip Stoev
* This was observed both on Linux and on Sparc
* Removing all ALTERs from the query mix removes the problem
* Making all ALTERs OFFLINE causes the problem to go away.
[29 Aug 2008 18:43] Philip Stoev
mysql> select * from FALCON_SERIAL_LOG_INFO;
+--------------+--------+---------+---------+
| TRANSACTIONS | BLOCKS | WINDOWS | BUFFERS |
+--------------+--------+---------+---------+
|         1789 |   1739 |     132 |      20 |
+--------------+--------+---------+---------+
1 row in set (0.00 sec)

mysql> select * from FALCON_TRANSACTION_SUMMARY;
+-----------+-------------+--------+----------------+--------------------+
| COMMITTED | ROLLED_BACK | ACTIVE | PENDING_COMMIT | PENDING_COMPLETION |
+-----------+-------------+--------+----------------+--------------------+
|     85051 |        4212 |      5 |              0 |                  0 |
+-----------+-------------+--------+----------------+--------------------+
1 row in set (0.04 sec)
[29 Aug 2008 19:10] Philip Stoev
Recovery on a 100Mb serial log file produced with this method asserted shortly after takeoff with the stack trace below. Further recovery attempts crashed the same way.

./mysqld(Error::debugBreak()+0x12)[0x858a73c]
./mysqld(Error::error(char const*, ...)+0x7b)[0x858a7b9]
./mysqld(Cache::fetchPage(Dbb*, int, PageType, LockType)+0x3cc)[0x855f36c]
./mysqld(Dbb::fetchPage(int, PageType, LockType)+0x48)[0x85803b6]
./mysqld(RecordLocatorPage::deleteDataPages(Dbb*, unsigned int)+0x4f)[0x8631725]
./mysqld(Section::deleteSectionLevel(Dbb*, int, unsigned int)+0x1b5)[0x85d8261]
./mysqld(Section::deleteSection(Dbb*, int, unsigned int)+0x193)[0x85d98cb]
./mysqld(Dbb::deleteSection(int, unsigned int)+0xac)[0x857fc8e]
./mysqld(SRLDropTable::redo()+0x75)[0x85ff509]
./mysqld(SerialLog::recover()+0x9eb)[0x85e8819]
./mysqld(Database::openDatabase(char const*)+0x417)[0x857a6bf]
./mysqld(Connection::getDatabase(char const*, char const*, Threads*)+0x14a)[0x856be16]
./mysqld(Connection::openDatabase(char const*, char const*, char const*, char const*, char const*, Threads*)+0x1c1)[0x856d059]
./mysqld(StorageDatabase::getOpenConnection()+0x86)[0x852a6d4]
./mysqld(StorageHandler::initialize()+0xa7)[0x852cfbd]
./mysqld(StorageInterface::falcon_init(void*)+0x1db)[0x8524a51]
./mysqld(ha_initialize_handlerton(st_plugin_int*)+0xb3)[0x83dfc77]

Messages in log:
[Falcon] Error: page 367/1 wrong page type, expected 7 got 5
Bugcheck: page 367/1 wrong page type, expected 7 got 5

and then

[Falcon] Error: assertion (lockState == 0) failed at line 131 in file SyncObject.cpp
Bugcheck: assertion (lockState == 0) failed at line 131 in file SyncObject.cpp
[18 Dec 2008 11:53] Vladislav Vaintroub
There is  currently a valid reason for serial log files to grow beyond limits (this might be an architectural flaw, but it is what it is). If Falcon backend(gopher threads) are too slow compared to the Falcon frontend("client" threads), serial log will grow until gophers catch up. I'm not sure how to solve this, one way might be just to monitor file sizes and the number of committed not write complete transactions and when serial log grows beyond limits, just wait until gophers are finished.
[17 Mar 2009 0:58] Kevin Lewis
Assigning this back to Vlad to fix.  Please document how it is that the log file cannot be switched while there is still work to do by the gophers.  This is what you implied in the last comments.
[20 Mar 2009 22:30] Ann Harrison
The serial logs can switch once, even if the gophers get terribly behind.  That leaves the gophers working in fl2 (for example) while new transactions write to fl1.  No further switch can occur until the gophers finish with fl2.
[23 Mar 2009 14:15] Ann Harrison
For what it's worth, there is a situation where we will get
unbounded serial log growth.  When two-phase commit is enabled,
the information necessary to commit a prepared transaction is
stored in the serial log and must be kept until that transaction
commits or rolls back.  So, by preparing a transaction that did
updates, quitting, and never reconnecting, a connection can
block rotation of the serial logs. 

What will happen is that the log to which the the prepared
transaction was written will move from active to reserve when
the prior reserve has been emptied.  The old reserve log file
will then remain active until the prepared transaction is
resolved.

Suggestions:  Easy - emit an error periodically that there is
an prepared and unresolved transaction.  Harder:  after the 
reserve log has been emptied of everything but prepared transactions
extract them to a separate file and return the previous reserved
log to active rotation.  On commit or rollback of the prepared
transaction(s) remove them from the "prepared" log.  On recovery,
read the prepared log with everything else to restore those
transactions to their indeterminate state. (Does this work?
I think so...)
[9 Apr 2009 7:32] Philip Stoev
I am still able to repeat this bug, as follows:

1. Run a DDL scenario for 5 min:

perl runall.pl --basedir=/build/bzr/6.0-falcon-team --engine=falcon   --grammar=conf/falcon_online_alter.yy   --mysqld=--falcon-lock-wait-timeout=1   --queries=10000000 --duration=300

2. Observe that only one of the serial logs will grow

3. Hit Ctrl+C, wait any ammount of time, observe that mysqld is idle, gdb shows all threads are idle, SHOW PROCESSLIST shows no active transactions and yet the serial logs will not shrink.

4. Run a DML scenario agains the same server for 1 hour:

perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=19306:user=root:database=test --grammar=conf/combinations.yy --queries=100000000 --threads=10 --duration=3600

6. Observe that the same serial log continues to grow and no rotation takes place.

7. Hit Ctrl+C to terminate the test, wait, observe that mysqld is idle, gdb shows all threads are idle, SHOW PROCESSLIST shows no active transactions, I_S.FALCON_TRANSACTIONS is empty and yet the serial logs did not get rotated or truncated.
[10 Apr 2009 14:22] Philip Stoev
The tablespace and serial logs that were produced by the following procedure:

5 min of ONLINE ALTER queries  -> server is idle -> clean shutdown

have been uploaded here:

http://mysql-systemqa.s3.amazonaws.com/var-bug39130.zip
[13 Apr 2009 18:04] Kevin Lewis
Ann wrote; The problem appears to be in these lines of code:

    Transaction *sysTransaction = database->getSystemTransaction();
    table->dropIndex(name, sysTransaction);
-    database->commitSystemTransaction();

    Index::deleteIndex (database, schema, name);
    database->commitSystemTransaction(); 

 The first commit is what keeps the serial log from rotating after a delete index. The SRLDeleteIndex record goes into the log with a transaction id, but no commit SRL entry is ever included for that operation because its done by a system transaction that is committed without having been marked as having done updates.  Without the first commit, everything seems to work in terms of
rotating the log.
[15 Apr 2009 16:59] Bugs System
Pushed into 6.0.11-alpha (revid:hky@sun.com-20090415164923-9arx29ye5pzxd4zf) (version source revid:hky@sun.com-20090414213212-ibcjs12ic1v2h0e5) (merge vers: 6.0.11-alpha) (pib:6)
[15 May 2009 15:50] MC Brown
An entry has been added to the 6.0.11 changelog: 

Performing an online ALTER TABLE statement against a Falcon table, the Falcon serial log could grow beyond the maximum permitted size for a serial log, ignoring both the rotation and truncation