Bug #39575 Falcon crashes during loading of large dataset
Submitted: 22 Sep 2008 7:27 Modified: 9 Jan 2009 14:05
Reporter: Olav Sandstå Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S2 (Serious)
Version:6.0.7 OS:Solaris (Solaris 10)
Assigned to: Olav Sandstå CPU Architecture:Any
Tags: F_PLATFORM

[22 Sep 2008 7:27] Olav Sandstå
Description:
During loading of a large data set into Falcon the following crash occures:

/home1/dists/falc6.0.7early/libexec/mysqld'my_print_stacktrace+0x29 [0x13473a1]
/home1/dists/falc6.0.7early/libexec/mysqld'handle_segfault+0x1ee [0xb1de9e]
/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7ffed63086]
/lib/amd64/libc.so.1'call_user_handler+0x2ff [0xfffffd7ffed57b6f]
/lib/amd64/libc.so.1'_lwp_kill+0xa [0xfffffd7ffed6abfa]
/lib/amd64/libc.so.1'raise+0x19 [0xfffffd7ffed19969]
/lib/amd64/libc.so.1'abort+0x64 [0xfffffd7ffecf5aa4]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cH__CimplRdefault_terminate6F_v_+0x9 [0xfffffd7fff0d8a59]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cH__CimplMex_terminate6F_v_+0x2e [0xfffffd7fff0d88fe]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cG__CrunIex_throw6Fpvpkn0AQstatic_type_info_pF1_v_v_+0x26 [0xfffffd7fff0d9ae6]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cJSerialLogLstartRecord6M_v_+0xe0 [0xcfc7b0]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cPSerialLogRecordLstartRecord6M_v_+0xd [0xd86d1d]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cUSRLSavepointRollbackGappend6MIi_v_+0x4a [0xd0722a]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cLTransactionRrollbackSavepoint6Mi_v_+0xac [0xd294ec]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cPStorageDatabaseRsavepointRollback6MpnKConnection_i_i_+0x1d [0xd0c47d]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRStorageConnectionRsavepointRollback6Mi_i_+0x19 [0xd0b119]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRStorageConnectionMrollbackVerb6M_v_+0x23 [0xd0b3a3]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cQStorageInterfaceIrollback6FpnKhandlerton_pnDTHD_b_i_+0x47 [0xcde757]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRha_rollback_trans6FpnDTHD_b_i_+0xe5 [0xbfd305]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cZha_autocommit_or_rollback6FpnDTHD_i_i_+0x25 [0xbfd445]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cQdispatch_command6FnTenum_server_command_pnDTHD_pcI_b_+0xe88 [0xb29598]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cKdo_command6FpnDTHD__b_+0xbc [0xb2861c]
/home1/dists/falc6.0.7early/libexec/mysqld'handle_one_connection+0xcc [0xb2763c]
/lib/amd64/libc.so.1'_thr_setup+0x89 [0xfffffd7ffed62d09]
/lib/amd64/libc.so.1'_lwp_start+0x0 [0xfffffd7ffed62fb0]
080918  2:31:21 - mysqld got signal 6 ;

Some observations:

1. The crash occurred with binaries compiled with Sun Studio. It seems not to happen when compiling with gcc.

2. When the crash occurs the Falcon serial log file is at 4 GB. This is repeatable so it seems like it is related to some 4 GB boundary. (when running with gcc it seems like the serial log file do not grow to 4 GB?).

How to repeat:
Load a large amount of data, most likely without committing during the load. More details will be provided later as this crash was reported by someone else.

Suggested fix:
Falcon should not crash during loading of large data. Preferable it should succeed loading data, but if error occurs it should handle this more gracefully.
[22 Sep 2008 7:38] Olav Sandstå
By looking at the call stack and the source code it seems like there migth be several different problems/bugs here:

1. The crash seems to occur due to Falcon throwing an exception that seems to not be handled/catched anywhere in the following code in falcon/storage/SerialLog.cpp:

void SerialLog::startRecord()
{
   ASSERT(!recovering);

   if (writeError)
       throw SQLError(IO_ERROR, "Previous I/O error on serial log prevents further processing");

   if (writePtr == writeBlock->data)
       putVersion();

   recordStart = writePtr;
   recordIncomplete = true;
}

2. This happens since the writeError varialble have been set to true. This happens probably as a result of a write operation returning error. Which again probably (based on seeing the size of the serial log file being 4 GB) is due to some "4 GB file limitation". I think this is a bug too (unless we have a hard limit here - I have not checked yet).

3. Difference in serial log file size on Sun Studio and gcc: sounds strange and should be investigated
[26 Sep 2008 9:37] Olav Sandstå
While trying to reproduce this crash I run into an ASSERT that also occurred just about when the serial log file had reached 4 GB. This crash is reported as bug 39661. It may be caused by the same problem as in this crash.
[30 Sep 2008 7:16] Olav Sandstå
I am able to reproduce this crash by configuring Falcon with a 14 GB record cache and 8 GB page cache (it can probably also be reproduced with lower values than this) and then having a client inserting records of 1004 bytes as a single transaction until the crash occurs. The crash happens when the serial log file has  reached 4 GB. 

Verified running on Solaris 10 with Falcon compiled with Sun Studio 12.

I will attach my test client and the call stack from the crash.
[30 Sep 2008 7:23] Olav Sandstå
Test client that inserts 100 million records of 1004 bytes in a single transaction

Attachment: LoadData.java (text/x-java), 4.03 KiB.

[30 Sep 2008 7:25] Olav Sandstå
Call stack from the crash:

current thread: t@24
  [1] __lwp_kill(0x0, 0x6, 0xffffffff7dceef40, 0x1015bbb1c, 0xffffffff7d205400, 0x4), at 0xffffffff7dbce4b0
=>[2] my_write_core(sig = 6), line 307 in "stacktrace.c"
  [3] handle_segfault(sig = 6), line 2660 in "mysqld.cc"
  [4] __sighndlr(0x6, 0x0, 0xffffffff7aa3ddf0, 0x1003e5748, 0x0, 0x0), at 0xffffffff7dbcd3b4
  ---- called from signal handler with signal 6 (SIGABRT) ------
  [5] __lwp_kill(0x0, 0x6, 0xffffffff7f60512c, 0xffffffff7db48a04, 0x5, 0xffffffff7dcf2e78), at 0xffffffff7dbce4b0
  [6] raise(0x6, 0x6, 0x5, 0xffffffff7dcef1c0, 0xffffffff7d205400, 0x5), at 0xffffffff7db6b25c
  [7] abort(0xffffffff7df11e98, 0xffffffff7df0cc38, 0xffffffff7de06a10, 0x19b6c0, 0x2, 0x0), at 0xffffffff7db48a04
  [8] __Cimpl::default_terminate(0xffffffff7df11e98, 0xffffffff7df0cc38, 0xffffffff7df11e78, 0x10604c, 0x104e40, 0xffffffff7de064d0), at 0xffffffff7de064d4
  [9] __Cimpl::ex_terminate(0xffffffff7df0de40, 0x0, 0x0, 0xffffffff7df0de40, 0xffffffff7df0c978, 0x1), at 0xffffffff7de062b4
  [10] _ex_throw_body(0xffffffff7df0de40, 0x0, 0x0, 0x105af4, 0x3c, 0x100f741a8), at 0xffffffff7de06f04
  [11] __Crun::ex_throw(0xffffffff7df0dec8, 0x1010450d0, 0x100776318, 0xffffffff7df0de40, 0xb8, 0xffffffff7df0de40), at 0xffffffff7de06e58
  [12] SerialLog::startRecord(this = 0x101723298), line 709 in "SerialLog.cpp"
  [13] SerialLogRecord::startRecord(this = 0x1017280a8), line 224 in "SerialLogRecord.cpp"
  [14] SRLSavepointRollback::append(this = 0x1017280a8, transactionId = 23U, savepointId = 4353646), line 31 in "SRLSavepointRollback.cpp"
  [15] Transaction::rollbackSavepoint(this = 0x3365db570, savePointId = 4353646), line 1233 in "Transaction.cpp"
  [16] StorageDatabase::savepointRollback(this = 0x1017eb318, connection = 0x101815dd8, savePoint = 4353646), line 519 in "StorageDatabase.cpp"
  [17] StorageConnection::savepointRollback(this = 0x101815d40, savePoint = 4353646), line 213 in "StorageConnection.cpp"
  [18] StorageConnection::rollbackVerb(this = 0x101815d40), line 346 in "StorageConnection.cpp"
  [19] StorageInterface::rollback(hton = 0x1022186b0, thd = 0x338e29040, all = false), line 1254 in "ha_falcon.cpp"
  [20] ha_rollback_trans(thd = 0x338e29040, all = false), line 1241 in "handler.cc"
  [21] trans_rollback_stmt(thd = 0x338e29040), line 214 in "transaction.cc"
  [22] dispatch_command(command = COM_QUERY, thd = 0x338e29040, packet = 0x3392f76e1 "INSERT INTO TESTDATA VALUES(4353646, 'Olav was here Olav was here Olav was here Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav was here Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav was here Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav was here Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav was here Olav was here Olav was here Olav was here Ol" ..., packet_length = 1010U), line 1367 in "sql_parse.cc"
  [23] do_command(thd = 0x338e29040), line 688 in "sql_parse.cc"
  [24] handle_one_connection(arg = 0x338e29040), line 1153 in "sql_connect.cc"
[30 Sep 2008 9:02] Olav Sandstå
Initial analysis/guessing on what happens: 

1. During insert of data the serial log is written. When the size of the serial log file hits 4 GB an "error condition" (will investigate this further) occurs when calling (in SerialLog.cpp):

   flushWindow->write(flushBlock);

and we set the status of the SerialLog object to:

   writeError = true;

2. This information is likely returned to the server which decides that we should rollback to the latest savepoint (see call stack).

3. When Falcon does the rollback to savepoint it tries to allocate a SRLSavepointRollback on the SerialLog. This fails due to the status of the SerialLog object is "writeError == true", see the following code in SerialLog::startRecord():

    	if (writeError)
		throw SQLError(IO_ERROR, "Previous I/O error on serial log prevents further processing");

4. The excpection is never caught anywhere so the server crashes.

Will update with information about what causes the status of the SerialLog object to be set to state "writeError == true" when hitting 4 GB file size. Need to added tracing of this to the code that do the writing of the serial log file.
[1 Oct 2008 13:47] Olav Sandstå
The initial cause for the problem writing to the serial log (and for the serial log object getting status "writeError == true") is the following code in the start of SerialLogWindow::write:

  uint32 length = ROUNDUP(block->length, sectorSize);
  uint32 offset = (int) (origin + ((UCHAR*) block - buffer));

The offset variable (which is the offset we should write to in the serial log file) wraps around when reaching 4 GB. Changing the type to a 64 bit variable should hopefully fix this.

Based on this we can also rule out that this was a "Sun Studio and Solaris" only problem. We would have hit this bug on any system where the serial log file grows past 4 GB.
[2 Oct 2008 9:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/55023

2845 Olav Sandstaa	2008-10-02
      Fix for Bug#39575 Falcon crashes during loading of large dataset
      
      Changed datatype for the offset within the serial log file from 32 bit to 64 bit
      in order to handle that the serial log file can grow beyond 4 GB.
[2 Oct 2008 9:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/55027

2845 Olav Sandstaa	2008-10-02
      Fix for Bug#39575 Falcon crashes during loading of large dataset
      
      Changed datatype for the offset within the serial log file from 32 bit to 64 bit
      in order to handle that the serial log file can grow beyond 4 GB.
[2 Oct 2008 9:23] Olav Sandstå
With the patch my test client (attached as file to this bug) is able to run much further than without the patch. When I finally killed it the serial log file had reached 12 GB.
[9 Jan 2009 14:05] MC Brown
A note has been added to the 6.0.8 changelog: 

When loading very large datasets into a Falcon table, MySQL may crash because the size of the Falcon serial log exceeds 4GB. The maximum supported size of the serial log file has been increased from a 32-bit to a 64-bit integer to handle larger log file sizes.