MySQL Bugs: #28215: MySQL Cluster version 5.0.37 is unable to start due to file system incosistency

Bug #28215	MySQL Cluster version 5.0.37 is unable to start due to file system incosistency
Submitted:	3 May 2007 8:24	Modified:	3 May 2007 12:57
Reporter:	Nir Simionovich	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.0.37	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
After a cluster failure that hadn't been determined yet the reason for, the following is observed in the ndb_1_cluster log:

2007-05-03 10:59:44 [MgmSrvr] INFO     -- Node 3: DICT: index 6 rebuild done
2007-05-03 10:59:44 [MgmSrvr] INFO     -- Node 3: DICT: index 7 rebuild done
2007-05-03 10:59:44 [MgmSrvr] INFO     -- Node 3: DICT: index 9 rebuild done
2007-05-03 10:59:45 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2007-05-03 10:59:45 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed, restarting. Occured during startphase 8. Caused by error
2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.
2007-05-03 10:59:45 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 192.114.69.36, m_reserved_nodes 000000000000000a.
2007-05-03 10:59:45 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2007-05-03 10:59:46 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 opened
2007-05-03 10:59:46 [MgmSrvr] INFO     -- Node 3: Waiting 30 sec for nodes 0000000000000004 to connect, nodes [ all: 000000000000000c conne
cted: 0000000000000008 no-wait: 0000000000000000 ]
2007-05-03 10:59:46 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.
2007-05-03 10:59:49 [MgmSrvr] INFO     -- Node 3: Waiting 27 sec for nodes 0000000000000004 to connect, nodes [ all: 000000000000000c conne
cted: 0000000000000008 no-wait: 0000000000000000 ]
2007-05-03 10:59:52 [MgmSrvr] INFO     -- Node 3: Waiting 24 sec for nodes 0000000000000004 to connect, nodes [ all: 000000000000000c conne
cted: 0000000000000008 no-wait: 0000000000000000 ]

Now, after reviewing the data nodes, I've decided to remove one of the nodes and see if the cluster will come up with a single node only, but the problem persists. In addition, I've observed the following in the data node log files:

2007-05-03 10:49:19 [ndbd] INFO     -- NDB Cluster -- DB node 3
2007-05-03 10:49:19 [ndbd] INFO     -- Version 5.0.37 --
2007-05-03 10:49:19 [ndbd] INFO     -- Configuration fetched at 192.114.69.34 port 1186
2007-05-03 10:49:19 [ndbd] INFO     -- Start initiated (version 5.0.37)
2007-05-03 10:59:44 [ndbd] INFO     -- Error handler restarting system
2007-05-03 10:59:45 [ndbd] INFO     -- Error handler shutdown completed - exiting
2007-05-03 10:59:45 [ndbd] ALERT    -- Node 3: Forced node shutdown completed, restarting. Occured during startphase 8. Caused by error 281
5: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.
2007-05-03 10:59:45 [ndbd] INFO     -- Ndb has terminated (pid 26699) restarting
2007-05-03 10:59:45 [ndbd] INFO     -- Angel pid: 26660 ndb pid: 26741

How to repeat:
unknown at this point, as I don't have a clue as to what caused this to happen.

Can't repeat as the original cause is not known, 
restarting node 3 with --initial should solve
the current situation if this is the only node
reporting file system problems

Well, the situation is identical on both the nodes in the cluster, making the entire cluster non-working. I've tried bringing up the cluster with node3 then bring it up with node4, both showed the same exact issue.