Bug #39575 Falcon crashes during loading of large dataset
Submitted: 22 Sep 2008 9:27 Modified: 9 Jan 15:05
Reporter: Olav Sandstaa
Status: Closed
Category:Server: Falcon Severity:S2 (Serious)
Version:6.0.7 OS:Sun Solaris (Solaris 10)
Assigned to: Olav Sandstaa Target Version:6.0.8
Tags: F_PLATFORM
Triage: Triaged: D2 (Serious)

[22 Sep 2008 9:27] Olav Sandstaa
Description:
During loading of a large data set into Falcon the following crash occures:

/home1/dists/falc6.0.7early/libexec/mysqld'my_print_stacktrace+0x29 [0x13473a1]
/home1/dists/falc6.0.7early/libexec/mysqld'handle_segfault+0x1ee [0xb1de9e]
/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfffffd7ffed63086]
/lib/amd64/libc.so.1'call_user_handler+0x2ff [0xfffffd7ffed57b6f]
/lib/amd64/libc.so.1'_lwp_kill+0xa [0xfffffd7ffed6abfa]
/lib/amd64/libc.so.1'raise+0x19 [0xfffffd7ffed19969]
/lib/amd64/libc.so.1'abort+0x64 [0xfffffd7ffecf5aa4]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cH__CimplRdefault_terminate6F_v_+0x9
[0xfffffd7fff0d8a59]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cH__CimplMex_terminate6F_v_+0x2e
[0xfffffd7fff0d88fe]
/opt/SUNWspro/prod/usr/lib/amd64/libCrun.so.1'__1cG__CrunIex_throw6Fpvpkn0AQstatic_type_in
fo_pF1_v_v_+0x26 [0xfffffd7fff0d9ae6]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cJSerialLogLstartRecord6M_v_+0xe0
[0xcfc7b0]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cPSerialLogRecordLstartRecord6M_v_+0xd
[0xd86d1d]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cUSRLSavepointRollbackGappend6MIi_v_+0x4a
[0xd0722a]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cLTransactionRrollbackSavepoint6Mi_v_+0xac
[0xd294ec]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cPStorageDatabaseRsavepointRollback6MpnKConn
ection_i_i_+0x1d [0xd0c47d]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRStorageConnectionRsavepointRollback6Mi_i_+
0x19 [0xd0b119]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRStorageConnectionMrollbackVerb6M_v_+0x23
[0xd0b3a3]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cQStorageInterfaceIrollback6FpnKhandlerton_p
nDTHD_b_i_+0x47 [0xcde757]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cRha_rollback_trans6FpnDTHD_b_i_+0xe5
[0xbfd305]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cZha_autocommit_or_rollback6FpnDTHD_i_i_+0x2
5 [0xbfd445]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cQdispatch_command6FnTenum_server_command_pn
DTHD_pcI_b_+0xe88 [0xb29598]
/home1/dists/falc6.0.7early/libexec/mysqld'__1cKdo_command6FpnDTHD__b_+0xbc [0xb2861c]
/home1/dists/falc6.0.7early/libexec/mysqld'handle_one_connection+0xcc [0xb2763c]
/lib/amd64/libc.so.1'_thr_setup+0x89 [0xfffffd7ffed62d09]
/lib/amd64/libc.so.1'_lwp_start+0x0 [0xfffffd7ffed62fb0]
080918  2:31:21 - mysqld got signal 6 ;

Some observations:

1. The crash occurred with binaries compiled with Sun Studio. It seems not to happen when
compiling with gcc.

2. When the crash occurs the Falcon serial log file is at 4 GB. This is repeatable so it
seems like it is related to some 4 GB boundary. (when running with gcc it seems like the
serial log file do not grow to 4 GB?).

How to repeat:
Load a large amount of data, most likely without committing during the load. More details
will be provided later as this crash was reported by someone else.

Suggested fix:
Falcon should not crash during loading of large data. Preferable it should succeed loading
data, but if error occurs it should handle this more gracefully.
[22 Sep 2008 9:38] Olav Sandstaa
By looking at the call stack and the source code it seems like there migth be several
different problems/bugs here:

1. The crash seems to occur due to Falcon throwing an exception that seems to not be
handled/catched anywhere in the following code in falcon/storage/SerialLog.cpp:

void SerialLog::startRecord()
{
   ASSERT(!recovering);

   if (writeError)
       throw SQLError(IO_ERROR, "Previous I/O error on serial log prevents further
processing");

   if (writePtr == writeBlock->data)
       putVersion();

   recordStart = writePtr;
   recordIncomplete = true;
}

2. This happens since the writeError varialble have been set to true. This happens
probably as a result of a write operation returning error. Which again probably (based on
seeing the size of the serial log file being 4 GB) is due to some "4 GB file limitation".
I think this is a bug too (unless we have a hard limit here - I have not checked yet).

3. Difference in serial log file size on Sun Studio and gcc: sounds strange and should be
investigated
[26 Sep 2008 11:37] Olav Sandstaa
While trying to reproduce this crash I run into an ASSERT that also occurred just about
when the serial log file had reached 4 GB. This crash is reported as bug 39661. It may be
caused by the same problem as in this crash.
[30 Sep 2008 9:16] Olav Sandstaa
I am able to reproduce this crash by configuring Falcon with a 14 GB record cache and 8 GB
page cache (it can probably also be reproduced with lower values than this) and then
having a client inserting records of 1004 bytes as a single transaction until the crash
occurs. The crash happens when the serial log file has  reached 4 GB. 

Verified running on Solaris 10 with Falcon compiled with Sun Studio 12.

I will attach my test client and the call stack from the crash.
[30 Sep 2008 9:23] Olav Sandstaa
Test client that inserts 100 million records of 1004 bytes in a single transaction

Attachment: LoadData.java (text/x-java), 4.03 KiB.

[30 Sep 2008 9:25] Olav Sandstaa
Call stack from the crash:

current thread: t@24
  [1] __lwp_kill(0x0, 0x6, 0xffffffff7dceef40, 0x1015bbb1c, 0xffffffff7d205400, 0x4), at
0xffffffff7dbce4b0
=>[2] my_write_core(sig = 6), line 307 in "stacktrace.c"
  [3] handle_segfault(sig = 6), line 2660 in "mysqld.cc"
  [4] __sighndlr(0x6, 0x0, 0xffffffff7aa3ddf0, 0x1003e5748, 0x0, 0x0), at
0xffffffff7dbcd3b4
  ---- called from signal handler with signal 6 (SIGABRT) ------
  [5] __lwp_kill(0x0, 0x6, 0xffffffff7f60512c, 0xffffffff7db48a04, 0x5,
0xffffffff7dcf2e78), at 0xffffffff7dbce4b0
  [6] raise(0x6, 0x6, 0x5, 0xffffffff7dcef1c0, 0xffffffff7d205400, 0x5), at
0xffffffff7db6b25c
  [7] abort(0xffffffff7df11e98, 0xffffffff7df0cc38, 0xffffffff7de06a10, 0x19b6c0, 0x2,
0x0), at 0xffffffff7db48a04
  [8] __Cimpl::default_terminate(0xffffffff7df11e98, 0xffffffff7df0cc38,
0xffffffff7df11e78, 0x10604c, 0x104e40, 0xffffffff7de064d0), at 0xffffffff7de064d4
  [9] __Cimpl::ex_terminate(0xffffffff7df0de40, 0x0, 0x0, 0xffffffff7df0de40,
0xffffffff7df0c978, 0x1), at 0xffffffff7de062b4
  [10] _ex_throw_body(0xffffffff7df0de40, 0x0, 0x0, 0x105af4, 0x3c, 0x100f741a8), at
0xffffffff7de06f04
  [11] __Crun::ex_throw(0xffffffff7df0dec8, 0x1010450d0, 0x100776318, 0xffffffff7df0de40,
0xb8, 0xffffffff7df0de40), at 0xffffffff7de06e58
  [12] SerialLog::startRecord(this = 0x101723298), line 709 in "SerialLog.cpp"
  [13] SerialLogRecord::startRecord(this = 0x1017280a8), line 224 in
"SerialLogRecord.cpp"
  [14] SRLSavepointRollback::append(this = 0x1017280a8, transactionId = 23U, savepointId =
4353646), line 31 in "SRLSavepointRollback.cpp"
  [15] Transaction::rollbackSavepoint(this = 0x3365db570, savePointId = 4353646), line
1233 in "Transaction.cpp"
  [16] StorageDatabase::savepointRollback(this = 0x1017eb318, connection = 0x101815dd8,
savePoint = 4353646), line 519 in "StorageDatabase.cpp"
  [17] StorageConnection::savepointRollback(this = 0x101815d40, savePoint = 4353646), line
213 in "StorageConnection.cpp"
  [18] StorageConnection::rollbackVerb(this = 0x101815d40), line 346 in
"StorageConnection.cpp"
  [19] StorageInterface::rollback(hton = 0x1022186b0, thd = 0x338e29040, all = false),
line 1254 in "ha_falcon.cpp"
  [20] ha_rollback_trans(thd = 0x338e29040, all = false), line 1241 in "handler.cc"
  [21] trans_rollback_stmt(thd = 0x338e29040), line 214 in "transaction.cc"
  [22] dispatch_command(command = COM_QUERY, thd = 0x338e29040, packet = 0x3392f76e1
"INSERT INTO TESTDATA VALUES(4353646, 'Olav was here Olav was here Olav was here Olav was
here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav was here
Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was here Olav
was here Olav was here Olav was here Olav was here Olav was hereOlav was here Olav was
here Olav was here Olav was here Olav was here Olav was here Olav was hereOlav was here
Olav was here Olav was here Olav was here Olav was here Olav was here Ol" ...,
packet_length = 1010U), line 1367 in "sql_parse.cc"
  [23] do_command(thd = 0x338e29040), line 688 in "sql_parse.cc"
  [24] handle_one_connection(arg = 0x338e29040), line 1153 in "sql_connect.cc"
[30 Sep 2008 11:02] Olav Sandstaa
Initial analysis/guessing on what happens: 

1. During insert of data the serial log is written. When the size of the serial log file
hits 4 GB an "error condition" (will investigate this further) occurs when calling (in
SerialLog.cpp):

   flushWindow->write(flushBlock);

and we set the status of the SerialLog object to:

   writeError = true;

2. This information is likely returned to the server which decides that we should rollback
to the latest savepoint (see call stack).

3. When Falcon does the rollback to savepoint it tries to allocate a SRLSavepointRollback
on the SerialLog. This fails due to the status of the SerialLog object is "writeError ==
true", see the following code in SerialLog::startRecord():

    	if (writeError)
		throw SQLError(IO_ERROR, "Previous I/O error on serial log prevents further
processing");

4. The excpection is never caught anywhere so the server crashes.

Will update with information about what causes the status of the SerialLog object to be
set to state "writeError == true" when hitting 4 GB file size. Need to added tracing of
this to the code that do the writing of the serial log file.
[1 Oct 2008 15:47] Olav Sandstaa
The initial cause for the problem writing to the serial log (and for the serial log object
getting status "writeError == true") is the following code in the start of
SerialLogWindow::write:

  uint32 length = ROUNDUP(block->length, sectorSize);
  uint32 offset = (int) (origin + ((UCHAR*) block - buffer));

The offset variable (which is the offset we should write to in the serial log file) wraps
around when reaching 4 GB. Changing the type to a 64 bit variable should hopefully fix
this.

Based on this we can also rule out that this was a "Sun Studio and Solaris" only problem.
We would have hit this bug on any system where the serial log file grows past 4 GB.
[2 Oct 2008 11:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/55023

2845 Olav Sandstaa	2008-10-02
      Fix for Bug#39575 Falcon crashes during loading of large dataset
      
      Changed datatype for the offset within the serial log file from 32 bit to 64 bit
      in order to handle that the serial log file can grow beyond 4 GB.
[2 Oct 2008 11:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/55027

2845 Olav Sandstaa	2008-10-02
      Fix for Bug#39575 Falcon crashes during loading of large dataset
      
      Changed datatype for the offset within the serial log file from 32 bit to 64 bit
      in order to handle that the serial log file can grow beyond 4 GB.
[2 Oct 2008 11:23] Olav Sandstaa
With the patch my test client (attached as file to this bug) is able to run much further
than without the patch. When I finally killed it the serial log file had reached 12 GB.
[9 Jan 15:05] MC Brown
A note has been added to the 6.0.8 changelog: 

When loading very large datasets into a Falcon table, MySQL may crash because the size of
the Falcon serial log exceeds 4GB. The maximum supported size of the serial log file has
been increased from a 32-bit to a 64-bit integer to handle larger log file sizes.