Bug #10058 ndb_select_count crashes cluster (in dbtup) after system restart
Submitted: 21 Apr 2005 12:13 Modified: 13 Jun 2005 15:02
Reporter: Johan Andersson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1,5.0 OS:Linux (RHEL 4 (64-bit opteron))
Assigned to: Jonas Oreland CPU Architecture:Any

[21 Apr 2005 12:13] Johan Andersson
Description:
After a system restart (followed by a "shutdown" from ndb_mgm),  the ndb_select_count crashes all ndb nodes.
There were 8 ndbd nodes.
Populated with 1M rows.

How to repeat:
Run attached test program to recreate table and populate with data (1M) rows.

The script go7 recreates the table.
The script go8 populates the table.

ndbtest.cpp has the source that recreates and populates the table.

ndb_mgm -e "shutdown"

[system restart]

ndb_select_count

Suggested fix:
-
[21 Apr 2005 12:21] Johan Andersson
Mailing test program separately due to silly 200K limit.
[21 Apr 2005 12:25] Johan Andersson
trace

Attachment: ndb_5_dbtup-bug-1-2.zip (application/x-zip-compressed, text), 91.19 KiB.

[21 Apr 2005 12:35] Johan Andersson
Has the system restart corrupted the data?
[21 Apr 2005 13:56] Martin Skold
Same crash seen in bug#10001,
was this also after a system restart?

Date/Time: Thursday 21 April 2005 - 05:37:41
Type of error: error
Message: Pointer too large
Fault ID: 2306
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 604) 0x0000000a
ProgramName: /opt/atse/cluster/mysql/bin/ndbd
ProcessID: 10876
TraceFile: /opt/atse/cluster/data_ndb/ndb_2_trace.log.1
***EOM***
[21 Apr 2005 14:19] Martin Skold
Also is this related to the large number of records?
Have you tried with a smaller database?
[21 Apr 2005 14:22] Martin Skold
} else if ((loopOpPtr.p->optype == ZDELETE) &&
               (loopOpPtr.p->prevActiveOp == RNIL)) {
      jam();
//----------------------------------------------------------------------
// There was only a delete. The original tuple still is ok.
//----------------------------------------------------------------------
    } else {
      jam();
//----------------------------------------------------------------------
// There was another operation after the delete, this must be an insert
// and we have found our copy tuple there.
//----------------------------------------------------------------------
      loopOpPtr.i = loopOpPtr.p->prevActiveOp;
      ptrCheckGuard(loopOpPtr, cnoOfOprec, operationrec); <== crashes here

Could it be that is is a DELETE, but prevActiveOp is not set to RNIL
correctly during system restart?
[22 Apr 2005 5:51] Martin Skold
Need table definition
[22 Apr 2005 8:09] Johan Andersson
We have reproduced this with 1M rows in db (small) and 50M rows in db (large).
[22 Apr 2005 8:35] Johan Andersson
Yes it was after a system restart. 
Is data corrupt is one of the questions...
[9 Jun 2005 5:26] Jonas Oreland
Pushed to 4.1.13 and 5.0.8
[13 Jun 2005 15:02] Jon Stephens
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

Documented in Change History for versions 4.1.13, 5.0.8.