Bug #68964 MySQL Cluster Forced node shutdown internal error
Submitted: 15 Apr 2013 19:00 Modified: 10 Jan 2014 3:02
Reporter: Matthew Boehm Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:ndb-7.2.12 OS:Linux (CentOS 6.4)
Assigned to: CPU Architecture:Any
Tags: forced shutdown, Internal error, MySQL Cluster

[15 Apr 2013 19:00] Matthew Boehm
Description:
We have a 2 NDB node setup. Node1 was powered off to move the hardware. Upon turning it back on and restarting ndbmtd, it worked for over 36hrs and still had not come online. We killed the process and added 

BuildIndexThreads=16
TwoPassInitialNodeRestartCopy=true

to help speed up the recovery of the node.

Twice now, the node has died with this error:

2013-04-15 16:38:21 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Any ideas?

How to repeat:
Restart node with attached config

Suggested fix:
Unknown
[10 Jan 2014 3:00] MySQL Verification Team
Supplied logs show:
....
2013-04-15 18:33:50 [ndbd] INFO     -- Killed by node 3 as copyfrag failed, error: 1501
2013-04-15 18:33:50 [ndbd] INFO     -- NDBCNTR (Line: 277) 0x00000006
2013-04-15 18:33:50 [ndbd] INFO     -- Error handler shutting down system
2013-04-15 18:33:50 [ndbd] INFO     -- Error handler shutdown completed - exiting
2013-04-15 18:34:06 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

The copyfrag failed error shows error 1501 as the cause.

$ perror --ndb 1501
 
NDB error code 1501: Out of undo space: Temporary error: Temporary Resource error

This is usually due to configuration settings being too small for the Cluster, or that the transactions are not being committed appropriately.

Will mark it not as a bug, since it is not an issue with Cluster.