MySQL Bugs: #25623: MySQL Cluster data node (ndbd) hangs in start phase 4

Bug #25623	MySQL Cluster data node (ndbd) hangs in start phase 4
Submitted:	15 Jan 2007 11:50	Modified:	13 Apr 2009 8:41
Reporter:	Ryusuke Kajiyama	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.27	OS:	Any (n/a)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
MySQL Cluster data node (ndbd) hangs in start phase 4 with following error.

2007-01-03 11:42:08 [MgmSrvr] INFO     -- Node 4: Cluster shutdown initiated
2007-01-03 11:42:08 [MgmSrvr] INFO     -- Node 3: Cluster shutdown initiated
2007-01-03 11:42:15 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2007-01-03 11:42:15 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2007-01-03 11:42:15 [MgmSrvr] INFO     -- Node 3: Node shutdown completed.
2007-01-03 11:42:16 [MgmSrvr] INFO     -- Node 4: Node shutdown completed.
2007-01-03 11:42:16 [MgmSrvr] INFO     -- Shutting down server...
2007-01-03 11:42:18 [MgmSrvr] INFO     -- Mgmt server state: nodeid 1 freed, m_reserved_nodes 0000000000000004.
2007-01-03 11:42:18 [MgmSrvr] INFO     -- Shutdown complete
2007-01-03 12:16:51 [MgmSrvr] INFO     -- NDB Cluster Management Server. Version 5.0.27
2007-01-03 12:16:51 [MgmSrvr] INFO     -- Id: 2, Command port: 1186
2007-01-03 12:32:02 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2007-01-03 12:32:05 [MgmSrvr] INFO     -- Node 3: Waiting 28 sec for nodes 0000000000000010 to connect, nodes [ all: 0000000000000018 connected: 0000000000000008 no-wait: 0000000000000000 ]
<snip>
2007-01-03 12:33:32 [MgmSrvr] INFO     -- Node 3: Waiting 1 sec for non partitioned start, nodes [ all: 0000000000000018 connected: 0000000000000008 missing: 0000000000000010 no-wait: 0000000000000000 ]
2007-01-03 12:33:35 [MgmSrvr] INFO     -- Node 3: Start potentially partitioned with nodes 0000000000000008  [ missing: 0000000000000010 no-wait: 0000000000000000 ]
2007-01-03 12:33:35 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 1
2007-01-03 12:33:35 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed 
2007-01-03 12:33:35 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (system restart)
2007-01-03 12:33:36 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (system restart)
2007-01-03 12:34:36 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2007-01-03 12:34:36 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.

Before boot this ndbd, it have run normally and shutdown complete with "ndb_mgm -e shutdown". Using same configuration files and boot scripts for each ndb_mgmd and ndbs.

How to repeat:
Env: 2 sets of ndbd, ndb_mgmd and mysqld on 2 physical servers with RHEL4.

1) Shutdown nomamly with "ndb_mgm -e shutdown"
2) Boot all ndb_mgmd and mysqld
3) Boot ndbd without --initial

Log of cluster with error 2350

Attachment: err.txt (text/plain), 13.02 KiB.

current environment (IP addresses are masked with x)

$ ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: x.x.x.7:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @x.x.x.7  (Version: 5.0.27, Nodegroup: 0, Master)
id=4    @x.x.x.8  (Version: 5.0.27, Nodegroup: 0)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @x.x.x.7  (Version: 5.0.27)
id=2    @x.x.x.8  (Version: 5.0.27)

[mysqld(API)]   2 node(s)
id=5    @x.x.x.7  (Version: 5.0.27)
id=6    @x.x.x.8  (Version: 5.0.27)

Hi,

My guess i that you changed NoOfFragmentLogFiles,
this is only supported by doing a initial (node) restart.

Looking at error log from datanodes would confitm this,
can you please upload/paste them

/Jonas

error log from datanode (Jan 3rd is the date of test)

Attachment: ndb_3_error.log (text/plain), 9.81 KiB.

I have not changed NoOfFragmentLogFiles at all. Do you suggest i should set larger number for this value?

Following error was in the error log of datanode.
i don't get "while creating table 261". Is there any table has id or name 261?
----
Time: Wednesday 3 January 2007 - 12:34:36
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to restart, fail while creating table 261 error: 721. Most likely change of configuration
Error object: DBDICT (Line: 2560) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 30996
Trace: /DB/ndb_3_trace.log.5
Version: Version 5.0.27
***EOM***
----

I have the same(?) problem.
mysql-5.1.19

2007-06-05 04:48:04 [ndbd] INFO     -- Unable to restart, fail while creating table 6716 error: 721. Most likely change of configuration
2007-06-05 04:48:04 [ndbd] INFO     -- DBDICT (Line: 2996) 0x0000000a
2007-06-05 04:48:04 [ndbd] INFO     -- Error handler startup shutting down system
2007-06-05 04:48:05 [ndbd] INFO     -- Error handler shutdown completed - exiting
2007-06-05 04:48:05 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2007-06-05 04:48:05 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.

nothing was changed on management server

I have the same:

Time: Thursday 3 January 2008 - 12:41:56
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to restart, fail while creating table 255 error: 721. Most likely change of configuration
Error object: DBDICT (Line: 2975) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 12299
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.9
Version: Version 5.1.18 (beta)
***EOM***

2008-01-03 13:05:56 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2008-01-03 13:05:57 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 4. Caused by error 2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.
2008-01-03 13:05:57 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-01-03 13:05:57 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

Is this bug still active ?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I'm not sure what I can add that hasn't been said. I installed Mysql from the Ubuntu repos.  I can't get it to work with more than 2 databases.  Not necessarily the same two.  I get this same cryptic error message.  

Node 2: Forced node shutdown completed. Occured during startphase 1. Caused by error2350: 'Invalid configuration received from Management Server(Configuration error). Permanent error, external action needed'.

I tried looking through the documentation -- there's very little of it.  I'd like to think that it is my noobness causing it.  At the same time, clustering needs to work out of the box.  It's flakey, at best.

Any chance for a troubleshooting guide?

--Paul