MySQL Bugs: #16872: DD: Unable to restart after shutdown

Bug #16872	DD: Unable to restart after shutdown
Submitted:	28 Jan 2006 19:21	Modified:	14 Feb 2006 8:56
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.6-alpha	OS:	Linux (Linux 32 Bit OS)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
I shutdown cluster to restart for new set of Performance tests. Upon restart recieved:

Cluster Log
2006-01-28 20:07:20 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
2006-01-28 20:07:20 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
2006-01-28 20:10:06 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 10.100.1.160, m_reserved_nodes 0000000000000006.
2006-01-28 20:10:07 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-01-28 20:10:08 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000002.
2006-01-28 20:10:13 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 10.100.1.161, m_reserved_nodes 000000000000000a.
2006-01-28 20:10:15 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2006-01-28 20:10:15 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.

NDB Error logs:
Time: Saturday 28 January 2006 - 20:07:20
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 943) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 28725
Trace: /space/run/ndb_2_trace.log.9
Version: Version 5.1.6 (alpha)
***EOM***

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 76580318 gsn: 164 "CONTINUEB" prio: 1
s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 76580315 length: 2 trace: 0 #sec: 0 fragInf: 0
 H'00000006 H'00000000
--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 76580317 gsn: 164 "CONTINUEB" prio: 1
s.bn: 260 "LGMAN", s.proc: 2, s.sigId: 76580313 length: 2 trace: 0 #sec: 0 fragInf: 0
 H'00000002 H'00000000
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 76580316 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 76580311 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 76580315 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 76580311 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 1
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 76580314 gsn: 272 "FSWRITEREQ" prio: 0
s.bn: 260 "LGMAN", s.proc: 2, s.sigId: 76580313 length: 7 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 0
 FilePointer: 22
 UserReference: H'01040002 Operation flag: H'00000003 (No sync, Format=List of global pages)
 varIndex: 3861
 numberOfPages: 11

Time: Saturday 28 January 2006 - 20:12:10
Status: Temporary error, restart node
Message: Arbitrator shutdown, please investigate error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 3826) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 27466
Trace: /space/run/ndb_3_trace.log.8
Version: Version 5.1.6 (alpha)
***EOM***
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 9497631 gsn: 343 "PREP_FAILCONF" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 9497630 length: 2 trace: 8 #sec: 0 fragInf: 0
 H'00000003 H'00000002
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 9497630 gsn: 126 "CLOSE_COMCONF" prio: 0
s.bn: 254 "CMVMI", s.proc: 3, s.sigId: 9497629 length: 19 trace: 8 #sec: 0 fragInf: 0
 xxxBlockRef = (252, 3) failNo = 2 noOfNodes = 1
 Nodes:  2
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 3, r.sigId: 9497629 gsn: 127 "CLOSE_COMREQ" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 9497628 length: 5 trace: 8 #sec: 0 fragInf: 0
 xxxBlockRef = (252, 3) failNo = 2 noOfNodes = 1
 Nodes:  2
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 9497628 gsn: 345 "PREP_FAILREQ" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 9497627 length: 5 trace: 8 #sec: 0 fragInf: 0
 xxxBlockRef = (252, 3) failNo = 2 noOfNodes = 1
 Nodes:  2
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 9497627 gsn: 153 "COMMIT_FAILCONF" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 9497626 length: 1 trace: 8 #sec: 0 fragInf: 0
 H'00000003

Second Try:
Time: Saturday 28 January 2006 - 20:12:09
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 943) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 28814
Trace: /space/run/ndb_2_trace.log.10
Version: Version 5.1.6 (alpha)

--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 9489435 gsn: 164 "CONTINUEB" prio: 1
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 9489433 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000001
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 9489434 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9489430 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 9489433 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 9489430 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 3455
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 9489432 gsn: 164 "CONTINUEB" prio: 0
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 9489429 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000002
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 2, r.sigId: 9489431 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 2, s.sigId: 9489428 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------

How to repeat:
Create 6 million row database, run performance test, restart cluster.

I have done the same thing many times wo/ problem.
Please try to narrow down to a more reproducable test case.
Alterativly upload filesystem.

File system is on ndb13, ndb14 go help yourself

I fixed a bunch of bugs in this area...probably this has been fixed

pushed into 5.1.7

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 5.1.7 changelog. Bug closed.