MySQL Bugs: #10058: ndb_select_count crashes cluster (in dbtup) after system restart

Bug #10058	ndb_select_count crashes cluster (in dbtup) after system restart
Submitted:	21 Apr 2005 12:13	Modified:	13 Jun 2005 15:02
Reporter:	Johan Andersson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	4.1,5.0	OS:	Linux (RHEL 4 (64-bit opteron))
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
After a system restart (followed by a "shutdown" from ndb_mgm),  the ndb_select_count crashes all ndb nodes.
There were 8 ndbd nodes.
Populated with 1M rows.

How to repeat:
Run attached test program to recreate table and populate with data (1M) rows.

The script go7 recreates the table.
The script go8 populates the table.

ndbtest.cpp has the source that recreates and populates the table.

ndb_mgm -e "shutdown"

[system restart]

ndb_select_count

Suggested fix:
-

Mailing test program separately due to silly 200K limit.

trace

Attachment: ndb_5_dbtup-bug-1-2.zip (application/x-zip-compressed, text), 91.19 KiB.

Has the system restart corrupted the data?

Same crash seen in bug#10001,
was this also after a system restart?

Date/Time: Thursday 21 April 2005 - 05:37:41
Type of error: error
Message: Pointer too large
Fault ID: 2306
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 604) 0x0000000a
ProgramName: /opt/atse/cluster/mysql/bin/ndbd
ProcessID: 10876
TraceFile: /opt/atse/cluster/data_ndb/ndb_2_trace.log.1
***EOM***

Also is this related to the large number of records?
Have you tried with a smaller database?

} else if ((loopOpPtr.p->optype == ZDELETE) &&
               (loopOpPtr.p->prevActiveOp == RNIL)) {
      jam();
//----------------------------------------------------------------------
// There was only a delete. The original tuple still is ok.
//----------------------------------------------------------------------
    } else {
      jam();
//----------------------------------------------------------------------
// There was another operation after the delete, this must be an insert
// and we have found our copy tuple there.
//----------------------------------------------------------------------
      loopOpPtr.i = loopOpPtr.p->prevActiveOp;
      ptrCheckGuard(loopOpPtr, cnoOfOprec, operationrec); <== crashes here

Could it be that is is a DELETE, but prevActiveOp is not set to RNIL
correctly during system restart?

Need table definition

We have reproduced this with 1M rows in db (small) and 50M rows in db (large).

Yes it was after a system restart. 
Is data corrupt is one of the questions...

Pushed to 4.1.13 and 5.0.8

Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

Documented in Change History for versions 4.1.13, 5.0.8.