Bug #38648 Node wont restart even after ndbd --initial error 2341
Submitted: 8 Aug 2008 1:50 Modified: 14 Oct 2008 7:11
Reporter: Farhad Shakeri Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.37 & 5.0.51 OS:Linux (Fedora 6)
Assigned to: CPU Architecture:Any

[8 Aug 2008 1:50] Farhad Shakeri
Description:
Our Cluster has been up for over 450 days. Right now, 5 out of 6 nodes are running fine.  Tried to start node 4, it starts fine in phase 0 3 4 and after starting phase 5 it hangs for about 10 mins and dies with following error.  tried twice with ndbd --initial with similar error.

Time: Thursday 7 August 2008 - 15:47:26
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Dbdict.cpp
Error object: DBDICT (Line: 2611) 0x0000000e
Program: ndbd
Pid: 6256
Trace: /var/lib/mysql/data/ndb_4_trace.log.10
Version: Version 5.0.37
***EOM***

How to repeat:

Here is the only goof up, NDB Manager was restarted with a bad config :-( .

Added 2nd NDB Manager to config.ini without updating the IPs in my.cnf on ndb servers. 

The 2nd NDB Manager has been removed and daemon restarted.
[8 Aug 2008 6:45] Susanne Ebrecht
Many thanks for writing a bug report.

MySQL 5.0.37 is really old. Please try a newer version (actual version is MySQL 5.0.51b) and let us know if you still have this problem with newer version.
[8 Aug 2008 23:27] Farhad Shakeri
Sure will do asap.

We are just wondering will it be OK to upgrade one node to 5.0.51b while the
rest are 5.0.37 ?

Since this cluster has 2 Nodegroups   should we upgrade one Node in each
Nodegroup?  Documents only talk about a single Nodegroup.

Thanks
[9 Aug 2008 5:31] Hartmut Holzgraefe
> We are just wondering will it be OK to upgrade one node to 5.0.51b while the
rest are 5.0.37 ?

Yes, these versions are upgrade compatible, see

  http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-upgrade-downgrade-compatibility.html

>Since this cluster has 2 Nodegroups   should we upgrade one Node in each
Nodegroup?  Documents only talk about a single Nodegroup.

As soon as you got the failing node working again you should do a rolling restart as documented in 

  http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-rolling-restart.html

as even though the versions are upgrade compatible you should not run different versions on the nodes for too long.
[12 Aug 2008 0:20] Farhad Shakeri
Hi again,

Just upgrade the ndbd_manager to 5.0.51 plus upgraded node 4 to 5.0.51
but we are still getting the exact same error:
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed 
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (initial node restart)
2008-08-11 17:01:03 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=0d46000eb433ef5f]
2008-08-11 17:01:04 [MgmSrvr] INFO     -- Node 2: DICT: locked by node 4 for NodeRestart
2008-08-11 17:01:04 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 4 ops: 0 poll: 0 cnt: 0 queue: 4L
2008-08-11 17:01:37 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (initial node restart)
2008-08-11 17:02:56 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (initial node restart)
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 6: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 7: Node 4 Disconnected
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 6: Communication to Node 4 closed
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: DICT: remove lock by failed node 4 for NodeRestart
2008-08-11 17:08:40 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue: 
2008-08-11 17:08:40 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

From Node 4:

2008-08-11 17:00:48 [ndbd] INFO     -- Angel pid: 3052 ndb pid: 3053
2008-08-11 17:00:48 [ndbd] INFO     -- NDB Cluster -- DB node 4
2008-08-11 17:00:48 [ndbd] INFO     -- Version 5.0.51 --
2008-08-11 17:00:48 [ndbd] INFO     -- Configuration fetched at 192.168.1.1 port 1186
2008-08-11 17:00:48 [ndbd] INFO     -- Start initiated (version 5.0.51)
2008-08-11 17:08:39 [ndbd] INFO     -- Error handler startup shutting down system
2008-08-11 17:08:40 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-08-11 17:08:40 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-08-11 17:08:40 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

If you need the trace_log please let me know where to send it to.

Thanks
[12 Aug 2008 1:21] Farhad Shakeri
trace log for node 4

Attachment: ndb_4_trace.log.11.gz (application/x-gzip, text), 52.56 KiB.

[23 Sep 2008 0:00] Farhad Shakeri
This problem has been solved.  The problem was pin pointed to lack of memory.
We increased the hardware memory by 25%  and DataMemory by 35% and system seems
stable.  Upgrading to 5.0.51 alone was not enough.
[14 Oct 2008 7:11] Bernd Ocklin
I am closing since this issue obviously was solved.