Bug #18782 Restart of failed data node can cause cluster to crash (START_FRAGREF)
Submitted: 4 Apr 2006 20:37 Modified: 3 Jul 2006 15:42
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.9 OS:Linux (Linux 32 Bit OS)
Assigned to: Jonas Oreland CPU Architecture:Any

[4 Apr 2006 20:37] Jonathan Miller
Description:
Trying to restart the failed data node from 18781 caused the other remaining data node to fail and brought down the cluster.

Time: Tuesday 4 April 2006 - 22:26:30
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Unhandled node failure during restart
Error object: NDBCNTR (Line: 1462) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 19036
Trace: /space/run/ndb_3_trace.log.4
Version: Version 5.1.9 (beta)
***EOM***

Time: Tuesday 4 April 2006 - 22:26:29
Status: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: Illegal signal received (GSN 374 not added)
Error object: Illegal signal received (GSN 374 not added)
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 18466
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.9 (beta)
***EOM***

--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620793 gsn: 374 "START_FRAGREF" prio: 1
s.bn: 247 "DBLQH", s.proc: 3, s.sigId: -1 length: 3 trace: 2 #sec: 0 fragInf: 0
 H'00000000 H'00000000 H'00000003
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620792 gsn: 164 "CONTINUEB" prio: 0
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620791 length: 2 trace: 8 #sec: 0 fragInf: 0
 Check Tc Counter from place 9628
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 2, r.sigId: 12620790 gsn: 247 "EVENT_REP" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620789 length: 5 trace: 2 #sec: 0 fragInf: 0
 H'00000013 H'00000003 H'00009ac1 H'0000a10b H'0000a20a
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620789 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620788 length: 2 trace: 2 #sec: 0 fragInf: 0
 Default system error lab...
 H'0000002c H'00000030
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620788 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620787 length: 2 trace: 2 #sec: 0 fragInf: 0
 Default system error lab...
 H'0000002c H'00000030
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620787 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620786 length: 2 trace: 2 #sec: 0 fragInf: 0
 Default system error lab...
 H'0000002c H'00000030
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620786 gsn: 164 "CONTINUEB" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620785 length: 2 trace: 2 #sec: 0 fragInf: 0
 Default system error lab...
 H'0000002c H'00000030
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620785 gsn: 387 "START_TOCONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: -1 length: 3 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00000003 H'00000003
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620784 gsn: 387 "START_TOCONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620782 length: 3 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00000002 H'00000003
--------------- Signal ----------------
r.bn: 254 "CMVMI", r.proc: 2, r.sigId: 12620783 gsn: 247 "EVENT_REP" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620781 length: 2 trace: 0 #sec: 0 fragInf: 0
 H'00000005 H'0000a20a
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620782 gsn: 388 "START_TOREQ" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 12620781 length: 5 trace: 2 #sec: 0 fragInf: 0
 H'00000030 H'00f60002 H'00000003 H'00000003 H'bfb11f01
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 12620781 gsn: 172 "COPY_GCICONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 183673 length: 1 trace: 2 #sec: 0 fragInf: 0
 H'00000003
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 2, r.sigId: 12620780 gsn: 409 "TIME_SIGNAL" prio: 1
s.bn: 252 "QMGR", s.proc: 2, s.sigId: 12620779 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 2, r.sigId: 12620779 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 2, s.sigId: 12620777 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 12620778 gsn: 164 "CONTINUEB" prio: 0

How to repeat:
Not sure
[21 May 2006 17:27] Jonas Oreland
Only happends once.
Fix will instead crash starting node.
[22 Jun 2006 13:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8071
[29 Jun 2006 9:50] Tomas Ulin
pushed to 5.1.12
[3 Jul 2006 15:42] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://www.mysql.com/doc/en/Installing_source_tree.html

Documented bugfix in 5.1.12 changelog.