MySQL Bugs: #15182: Error 2310 when starting up the cluster

Bug #15182	Error 2310 when starting up the cluster
Submitted:	23 Nov 2005 13:18	Modified:	10 Feb 2006 14:08
Reporter:	Chris Kennedy	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.0.15	OS:	Linux (Red Hat Enterprise Linux)
Assigned to:		CPU Architecture:	Any

Description:
I have a 2 node cluster, 2 replicas. I have one simple table (8 byte numberic key, and a 2 byte numeric data field). I was attempting to load up 33,000,000 rows (the config.ini file had the memory setting as: 

DataMemory= 3000M 
IndexMemory=1400M 
) 

When in the region of 31,000,000 rows were loaded, it would not take any more. I increased the DataMemory to 3200M, stopped and restarted the management node, and then selected one of the datanodes and restarted it. After it had restarted, I attempted to do likewise with the other data node, but it failed with: 

2002: Stop failed 
Node shutdown would cause system crash. 

I shutdown the cluster, and attempted to restart it, which failed: 

2005-11-23 09:51:12 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2310: 'Error while reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'. 

Is there any way I can recover from this without reloading all my data? Is this a known bug? 

Platform: 
2 HP Proliant DL585 servers: 
4x Dual Core 2.2GHz Opteron CPU s 
16GB RAM 
RHEL4 AS 64bit [Red Hat Enterprise Linux AS release 4 (Nahant Update 1)] 

MySQL release: 
mysql-max-5.0.15-linux-x86_64-glibc23.tar.gz

How to repeat:
Once in this state,  trying to bring up the cluster always fails.

We would need your logs and filesystem to analyze this.  All ndb_* files and directories.

Moreover,

did the restart of the first node really work?  It is not correct that you got the "Node shutdown would cause system crash" in that case.  It indicates that the first node either failed to restart or it hadn't finished restarting.

About your later "filesystem" error, it is recoverable under certain conditions.  The error message states "Ndbd file system error, restart node initial".  I.e. if the other node has an ok filesystem it can recover from that one by starting this node "ndbd --initial".  However to analyze if this is a possibility in this case we would need the logs mentioned above.  Also we would like to see your filesystem before you do this to try to find out what the problem is.

BR,

Tomas

files from node showing the error

Attachment: 20051123.tar.gz (application/x-gzip-compressed, text), 44.10 KiB.

ndb_mgm reported the node restart was complete:

inm_mgr@jabba1.vfl.vodafone> ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @127.0.0.1  (Version: 5.0.15, Nodegroup: 0)
id=4    @10.15.1.172  (Version: 5.0.15, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=9    @127.0.0.1  (Version: 5.0.15)

[mysqld(API)]   11 node(s)
id=20   @10.15.1.171  (Version: 5.0.15)
id=21 (not connected, accepting connect from any host)
id=22 (not connected, accepting connect from any host)
id=23 (not connected, accepting connect from any host)
id=24 (not connected, accepting connect from any host)
id=25 (not connected, accepting connect from any host)
id=26 (not connected, accepting connect from any host)
id=27 (not connected, accepting connect from any host)
id=28 (not connected, accepting connect from any host)
id=29 (not connected, accepting connect from any host)
id=30 (not connected, accepting connect from any host)

ndb_mgm> 4 stop
Node 4: Node shutdown aborted
Shutdown failed.
*  2002: Stop failed
*        Node shutdown would cause system crash
 I am atttaching log and error files from the systems.  I am afraid it will not be possible at this time to give you access to the file system.  I will be able to forward info you might need from it though.

files from second node

Attachment: 20051123_a.tar.gz (application/x-gzip-compressed, text), 55.31 KiB.

I have to same bug in Version 5.1.2-drop5p5 on Suse 64 bit. 
In detail: 
I was doing some tests with the replication feature. For that I built 2 clusters and configured one mysqld in the flirst cluster as master and one mysqld as slave in seccond cluster and but some load on the cluster with the master.

After some time both ndbd nodes on the computer where the slave run crashed.
I was able to restart one of them but the other doesn't restart: 

2005-11-25 09:11:17 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2809: 'Temporary on access to file(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'

I tried ndbd -d and ndbd --initial. Both fails.

Jörg, can you add the nodes error log and trace log files, too? The error message from the cluster log alone is not sufficient for further investigation as it misses some information ...

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".