Bug #43709 Node will not restart after upgrade
Submitted: 17 Mar 2009 17:09 Modified: 4 Sep 2009 10:51
Reporter: Phil Bayfield Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-6.3 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: mysql-5.1-telco-6.3.23

[17 Mar 2009 17:09] Phil Bayfield
Description:
ndbd will not restart after upgrade:

# ndbd
2009-03-17 17:04:25 [ndbd] INFO     -- Unable to alloc node id
2009-03-17 17:04:25 [ndbd] INFO     -- Error : Could not alloc node id at 192.168.10.2 port 1186: No free node id found for ndbd(NDB).
error=2350
2009-03-17 17:04:25 [ndbd] INFO     -- Error handler restarting system
2009-03-17 17:04:25 [ndbd] INFO     -- Error handler shutdown completed - exiting
sphase=0
exit=-1

Attempting to force node id makes no difference.
Restarting management servers makes no difference.

ndb_mgm reports that the node id is free:

# ndb_mgm -e show
Connected to Management Server at: 192.168.10.3:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @192.168.10.6  (mysql-5.1.32 ndb-6.3.23, Nodegroup: 0, Master)
id=4 (not connected, accepting connect from 192.168.10.7)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @192.168.10.2  (mysql-5.1.32 ndb-6.3.23)
id=2    @192.168.10.3  (mysql-5.1.32 ndb-6.3.23)

[mysqld(API)]   5 node(s)
id=5    @192.168.10.4  (mysql-5.1.32 ndb-6.3.23)
id=6    @192.168.10.5  (mysql-5.1.32 ndb-6.3.23)
id=7    @192.168.10.4  (mysql-5.1.32 ndb-6.3.23)
id=8 (not connected, accepting connect from 192.168.10.6)
id=9 (not connected, accepting connect from 192.168.10.7)

How to repeat:
Upgraded cluster from previous version.
[17 Mar 2009 17:45] Phil Bayfield
Here is my config.ini

[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/var/lib/mysql-cluster

[NDB_MGMD]
Id=1
Hostname=192.168.10.2

[NDB_MGMD]
Id=2
Hostname=192.168.10.3

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/var/lib/mysql-cluster
DataMemory=3584M
IndexMemory=256M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=200000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
NoOfFragmentLogFiles=24
RedoBuffer=32M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=3600
BackupReportFrequency=2

### Realtime Extensions
RealTimeScheduler=1
SchedulerExecutionTimer=80
SchedulerSpinTimer=40

### Timeouts
TransactionDeadlockDetectionTimeout=5000
HeartbeatIntervalDbDb=3000
HeartbeatIntervalDbApi=3000

### Disk data related
DiskPageBufferMemory=64M
SharedGlobalMemory=256M

[NDBD]
Id=3
Hostname=192.168.10.6
LockExecuteThreadToCPU=2
LockMaintThreadsToCPU=3

[NDBD]
Id=4
Hostname=192.168.10.7
LockExecuteThreadToCPU=2
LockMaintThreadsToCPU=3

[MYSQLD]
Hostname=192.168.10.4

[MYSQLD]
Hostname=192.168.10.5

[MYSQLD]
Hostname=192.168.10.4

[MYSQLD]
Hostname=192.168.10.6

[MYSQLD]
Hostname=192.168.10.7
[18 Mar 2009 1:41] Phil Bayfield
I have managed to recover the node but only by performing a complete cluster shutdown and restart causing over 90 minutes of downtime to my websites and apps.

The purge stale sessions command had no effect in recovering the stale connection that appeared to be stopping node 4 from starting.

I restarted all nodes except node 3 (the working NDBD node) to no avail.

It appeared that the stale connection was something to do with the working NDB node, after restarting it the problem was resolved.
[18 Mar 2009 7:48] Sveta Smirnova
Thank you for the report.

What is the version number you upgrade from?
[18 Mar 2009 11:48] Phil Bayfield
I upgraded from 6.3.21
[19 Mar 2009 8:57] Bernd Ocklin
Could be maybe duplicate of http://bugs.mysql.com/bug.php?id=43224.

Can you please upload all your logs? mysql_error_reporter is a good tool to collect all logs.

Did you try to reproduce this?
[19 Mar 2009 11:03] Phil Bayfield
This is the management server log (same messages repeated):

2009-03-17 17:04:25 [MgmSrvr] WARNING  -- Allocate nodeid (0) failed. Connection from ip 192.168.10.7. Returned error string "No free node id found for ndbd(NDB)."
2009-03-17 17:04:25 [MgmSrvr] INFO     -- Mgmt server state: node id's 2 3 5 6 7 connected but not reserved

Or when I've tried to start ndbd with --ndb-nodeid=4 it would display Allocate nodeid (4) failed.

I've got multiple logs on the node itself as well which have the following:

Current byte-offset of file-pointer is: 568

Time: Tuesday 17 March 2009 - 23:17:25
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to alloc node id
Error object: Error : Could not alloc node id at 192.168.10.2 port 1186: No free node id found for ndbd(NDB).
Program: ndbd
Pid: 8368
Trace: <no tracefile>
Version: mysql-5.1.32 ndb-6.3.23-GA
***EOM***

I dont have the mysql_error_reporter but these are all I've found searching the logs manually.

My issue does look very much like http://bugs.mysql.com/bug.php?id=43224
[25 Mar 2009 10:29] Jonathan Miller
Please try version 6.3.24 and see if this address issue.
[25 Mar 2009 10:32] Phil Bayfield
Is there a release date for 6.3.24?
[10 Aug 2009 12:26] Jonas Oreland
6.3.24 (and 25) has been release, please retest
[4 Sep 2009 10:32] Phil Bayfield
Yep, this was fixed with 6.3.24 :)