MySQL Bugs: #53115: When starting cluster ndbmtd fails with error 6050 in Watchdog.cpp

Bug #53115	When starting cluster ndbmtd fails with error 6050 in Watchdog.cpp
Submitted:	23 Apr 2010 14:48	Modified:	26 Apr 2010 5:58
Reporter:	Geir Green	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-7.1	OS:	Solaris (sol 10 Sparc)
Assigned to:		CPU Architecture:	Any
Tags:	"7.1.3", "ndbmtd"

Description:
In a test case for MCM we have created a cluster:

"create cluster --package=mypkg --processhosts=ndb_mgmd@techra15,ndbmtd@techra14,ndbmtd@techra14,ndbmtd@techra14,ndbmtd@techra14,mysqld@techra14 mycluster"

"set portnumber:ndb_mgmd=25201,port:mysqld=25300,socket:mysqld=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/mysql.sock.1 mycluster"

But when we try to start the cluster we get:

 ERROR 7006 (00MGR) at line 1: 2010-04-18 21:54:40 [ndbd] INFO     -- Job Handling
 2010-04-18 21:54:40 [ndbd] INFO     -- WatchDog.cpp
 error=6050
 2010-04-18 21:54:40 [ndbd] INFO     -- Watchdog shutting down system
 2010-04-18 21:54:40 [ndbd] INFO     -- Watchdog shutdown completed - exiting
 sphase=0
 exit=-1
----
From ndb_3_error.log:

Current byte-offset of file-pointer is: 568                       

Time: Sunday 18 April 2010 - 21:54:38
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/local/cluster-mgt/cluster-7.1.3/bin/ndbmtd
Pid: 21930
Version: mysql-5.1.44 ndb-7.1.3
Trace: /export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/m
----
From config.ini:

# Generated cfgfile for MySQL Cluster 
# Do not edit 
# Date Sun Apr 18 21:54:07 2010
# Config version 42

[NDBD]
NodeId=5
DataDir=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/manager/clusters/mycluster/5/data
HostName=techra14

[NDBD]
NodeId=4
DataDir=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/manager/clusters/mycluster/4/data
HostName=techra14

[NDBD]
NodeId=3
DataDir=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/manager/clusters/mycluster/3/data
HostName=techra14

[NDBD]
NodeId=2
DataDir=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/manager/clusters/mycluster/2/data
HostName=techra14

[MYSQLD]
NodeId=6
HostName=techra14

[NDB_MGMD]
NodeId=1
HostName=techra15
PortNumber=25201
DataDir=/export/home/tmp/jagtmp/ndbdevMGT_Nightly/funcCLIProcessCommands_9/run/manager/clusters/mycluster/1/data
----
Excerpt from mgmt log:

2010-04-18 21:54:40 [MgmtSrvr] WARNING  -- Node 4: Node 1 missed heartbeat 2
2010-04-18 21:54:40 [MgmtSrvr] INFO     -- Node 4: Initial start, waiting for 2 and 3 to connect,  nodes [ all: 2, 3, 4 and 5 connected: 4 and 5 no-wait:  ]
2010-04-18 21:54:41 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2010-04-18 21:54:41 [MgmtSrvr] INFO     -- Node 4: Node 2 Connected
2010-04-18 21:54:42 [MgmtSrvr] INFO     -- Node 5: Node 2 Connected
2010-04-18 21:54:42 [MgmtSrvr] INFO     -- Node 1: Node 2 Connected

How to repeat:
n/a

configfiles, error log, ndb trace log, ndbd stdout logs

Attachment: bug_53115.tar.gz (application/x-gzip, text), 38.18 KiB.

As error message says "WatchDog terminate, internal error or massive overload on the machine running this node" & heartbeat failures. techra14 is one of those slow Sol machines and you are starting 4xndbmtd on it in parallel. Likely processes are swapped out.

Setting this on "Not a bug" due to 99,9% not being a bug but a set-up issue.