MySQL Bugs: #38648: Node wont restart even after ndbd --initial error 2341

Bug #38648	Node wont restart even after ndbd --initial error 2341
Submitted:	8 Aug 2008 1:50	Modified:	14 Oct 2008 7:11
Reporter:	Farhad Shakeri	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.37 & 5.0.51	OS:	Linux (Fedora 6)
Assigned to:		CPU Architecture:	Any

Description:
Our Cluster has been up for over 450 days. Right now, 5 out of 6 nodes are running fine.  Tried to start node 4, it starts fine in phase 0 3 4 and after starting phase 5 it hangs for about 10 mins and dies with following error.  tried twice with ndbd --initial with similar error.

Time: Thursday 7 August 2008 - 15:47:26
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Dbdict.cpp
Error object: DBDICT (Line: 2611) 0x0000000e
Program: ndbd
Pid: 6256
Trace: /var/lib/mysql/data/ndb_4_trace.log.10
Version: Version 5.0.37
***EOM***

How to repeat:

Here is the only goof up, NDB Manager was restarted with a bad config :-( .

Added 2nd NDB Manager to config.ini without updating the IPs in my.cnf on ndb servers. 

The 2nd NDB Manager has been removed and daemon restarted.

Many thanks for writing a bug report.

MySQL 5.0.37 is really old. Please try a newer version (actual version is MySQL 5.0.51b) and let us know if you still have this problem with newer version.

Sure will do asap.

We are just wondering will it be OK to upgrade one node to 5.0.51b while the
rest are 5.0.37 ?

Since this cluster has 2 Nodegroups   should we upgrade one Node in each
Nodegroup?  Documents only talk about a single Nodegroup.

Thanks

> We are just wondering will it be OK to upgrade one node to 5.0.51b while the
rest are 5.0.37 ?

Yes, these versions are upgrade compatible, see

  http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-upgrade-downgrade-compatibility.html

>Since this cluster has 2 Nodegroups   should we upgrade one Node in each
Nodegroup?  Documents only talk about a single Nodegroup.

As soon as you got the failing node working again you should do a rolling restart as documented in 

  http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-rolling-restart.html

as even though the versions are upgrade compatible you should not run different versions on the nodes for too long.

Hi again,

Just upgrade the ndbd_manager to 5.0.51 plus upgraded node 4 to 5.0.51
but we are still getting the exact same error:
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed 
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (initial node restart)
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=0d46000eb433ef5f]
2008-08-11 17:01:04 [MgmSrvr] INFO     -- Node 2: DICT: locked by node 4 for NodeRestart
2008-08-11 17:01:04 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 4L
2008-08-11 17:01:37 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (initial node restart)
2008-08-11 17:02:56 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (initial node restart)
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 6: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 7: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 6: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: DICT: remove lock by failed node 4 for NodeRestart
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

From Node 4:

2008-08-11 17:00:48 [ndbd] INFO     -- Angel pid: 3052 ndb pid: 3053
2008-08-11 17:00:48 [ndbd] INFO     -- NDB Cluster -- DB node 4
2008-08-11 17:00:48 [ndbd] INFO     -- Version 5.0.51 --
2008-08-11 17:00:48 [ndbd] INFO     -- Configuration fetched at 192.168.1.1 port 1186
2008-08-11 17:00:48 [ndbd] INFO     -- Start initiated (version 5.0.51)
2008-08-11 17:08:39 [ndbd] INFO     -- Error handler startup shutting down system
2008-08-11 17:08:40 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-08-11 17:08:40 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-08-11 17:08:40 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

If you need the trace_log please let me know where to send it to.

Thanks

trace log for node 4

Attachment: ndb_4_trace.log.11.gz (application/x-gzip, text), 52.56 KiB.

This problem has been solved.  The problem was pin pointed to lack of memory.
We increased the hardware memory by 25%  and DataMemory by 35% and system seems
stable.  Upgrading to 5.0.51 alone was not enough.

I am closing since this issue obviously was solved.