MySQL Bugs: #43709: Node will not restart after upgrade

Bug #43709	Node will not restart after upgrade
Submitted:	17 Mar 2009 17:09	Modified:	4 Sep 2009 10:51
Reporter:	Phil Bayfield	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-6.3	OS:	Linux
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	mysql-5.1-telco-6.3.23

Description:
ndbd will not restart after upgrade:

# ndbd
2009-03-17 17:04:25 [ndbd] INFO     -- Unable to alloc node id
2009-03-17 17:04:25 [ndbd] INFO     -- Error : Could not alloc node id at 192.168.10.2 port 1186: No free node id found for ndbd(NDB).
error=2350
2009-03-17 17:04:25 [ndbd] INFO     -- Error handler restarting system
2009-03-17 17:04:25 [ndbd] INFO     -- Error handler shutdown completed - exiting
sphase=0
exit=-1

Attempting to force node id makes no difference.
Restarting management servers makes no difference.

ndb_mgm reports that the node id is free:

# ndb_mgm -e show
Connected to Management Server at: 192.168.10.3:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @192.168.10.6  (mysql-5.1.32 ndb-6.3.23, Nodegroup: 0, Master)
id=4 (not connected, accepting connect from 192.168.10.7)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @192.168.10.2  (mysql-5.1.32 ndb-6.3.23)
id=2    @192.168.10.3  (mysql-5.1.32 ndb-6.3.23)

[mysqld(API)]   5 node(s)
id=5    @192.168.10.4  (mysql-5.1.32 ndb-6.3.23)
id=6    @192.168.10.5  (mysql-5.1.32 ndb-6.3.23)
id=7    @192.168.10.4  (mysql-5.1.32 ndb-6.3.23)
id=8 (not connected, accepting connect from 192.168.10.6)
id=9 (not connected, accepting connect from 192.168.10.7)

How to repeat:
Upgraded cluster from previous version.

Here is my config.ini

[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/var/lib/mysql-cluster

[NDB_MGMD]
Id=1
Hostname=192.168.10.2

[NDB_MGMD]
Id=2
Hostname=192.168.10.3

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/var/lib/mysql-cluster
DataMemory=3584M
IndexMemory=256M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=200000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
NoOfFragmentLogFiles=24
RedoBuffer=32M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=3600
BackupReportFrequency=2

### Realtime Extensions
RealTimeScheduler=1
SchedulerExecutionTimer=80
SchedulerSpinTimer=40

### Timeouts
TransactionDeadlockDetectionTimeout=5000
HeartbeatIntervalDbDb=3000
HeartbeatIntervalDbApi=3000

### Disk data related
DiskPageBufferMemory=64M
SharedGlobalMemory=256M

[NDBD]
Id=3
Hostname=192.168.10.6
LockExecuteThreadToCPU=2
LockMaintThreadsToCPU=3

[NDBD]
Id=4
Hostname=192.168.10.7
LockExecuteThreadToCPU=2
LockMaintThreadsToCPU=3

[MYSQLD]
Hostname=192.168.10.4

[MYSQLD]
Hostname=192.168.10.5

[MYSQLD]
Hostname=192.168.10.4

[MYSQLD]
Hostname=192.168.10.6

[MYSQLD]
Hostname=192.168.10.7

I have managed to recover the node but only by performing a complete cluster shutdown and restart causing over 90 minutes of downtime to my websites and apps.

The purge stale sessions command had no effect in recovering the stale connection that appeared to be stopping node 4 from starting.

I restarted all nodes except node 3 (the working NDBD node) to no avail.

It appeared that the stale connection was something to do with the working NDB node, after restarting it the problem was resolved.

Thank you for the report.

What is the version number you upgrade from?

I upgraded from 6.3.21

Could be maybe duplicate of http://bugs.mysql.com/bug.php?id=43224.

Can you please upload all your logs? mysql_error_reporter is a good tool to collect all logs.

Did you try to reproduce this?

This is the management server log (same messages repeated):

2009-03-17 17:04:25 [MgmSrvr] WARNING  -- Allocate nodeid (0) failed. Connection from ip 192.168.10.7. Returned error string "No free node id found for ndbd(NDB)."
2009-03-17 17:04:25 [MgmSrvr] INFO     -- Mgmt server state: node id's 2 3 5 6 7 connected but not reserved

Or when I've tried to start ndbd with --ndb-nodeid=4 it would display Allocate nodeid (4) failed.

I've got multiple logs on the node itself as well which have the following:

Current byte-offset of file-pointer is: 568

Time: Tuesday 17 March 2009 - 23:17:25
Status: Permanent error, external action needed
Message: Invalid configuration received from Management Server (Configuration error)
Error: 2350
Error data: Unable to alloc node id
Error object: Error : Could not alloc node id at 192.168.10.2 port 1186: No free node id found for ndbd(NDB).
Program: ndbd
Pid: 8368
Trace: <no tracefile>
Version: mysql-5.1.32 ndb-6.3.23-GA
***EOM***

I dont have the mysql_error_reporter but these are all I've found searching the logs manually.

My issue does look very much like http://bugs.mysql.com/bug.php?id=43224

Please try version 6.3.24 and see if this address issue.

Is there a release date for 6.3.24?

6.3.24 (and 25) has been release, please retest

Yep, this was fixed with 6.3.24 :)