MySQL Bugs: #47316: Using two management servers does not work

Bug #47316	Using two management servers does not work
Submitted:	15 Sep 2009 9:14	Modified:	16 Sep 2009 13:18
Reporter:	Johan Andersson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-7.0.8 (bzr)	OS:	Linux
Assigned to:	Magnus Blåudd	CPU Architecture:	Any
Tags:	ndb_mgmd

Description:
Hi,

I pulled mysql-5.1-telco-7.0.8 last night and i have now built it.
I have two management servers I want to start, but they fail to start with:

ndb_1_cluster.log:

2009-09-15 10:53:17 [MgmSrvr] INFO     -- Starting initial configuration change
2009-09-15 10:53:17 [MgmSrvr] ERROR    -- The file '/etc/mysql//ndb_1_config.bin.1' already exist while preparing
2009-09-15 10:53:17 [MgmSrvr] WARNING  -- Node 1 refused configuration change, error: 6
2009-09-15 10:53:17 [MgmSrvr] WARNING  -- Node 2 refused configuration change, error: 6
2009-09-15 10:53:17 [MgmSrvr] ERROR    -- Configuration change failed! error: 6 'Prepare of config change failed'

ndb_2_cluster.log:

2009-09-15 10:53:16 [MgmSrvr] INFO     -- Got initial configuration from '/etc/mysql/config.ini', will try to set it when all ndb_mgmd(s) started
2009-09-15 10:53:16 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 192.9.73.12, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000004.
2009-09-15 10:53:16 [MgmSrvr] INFO     -- Node 2: Node 2 Connected
2009-09-15 10:53:16 [MgmSrvr] INFO     -- Id: 2, Command port: *:1186
2009-09-15 10:53:16 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
2009-09-15 10:53:16 [MgmSrvr] INFO     -- Node 1 connected
2009-09-15 10:53:17 [MgmSrvr] ERROR    -- The file '/etc/mysql//ndb_2_config.bin.1' already exist while preparing
2009-09-15 10:53:17 [MgmSrvr] ALERT    -- Node 2: Node 1 Disconnected

I do:
* start ndb_mgmd (id=1), with --initial --reload (this config has been successfully loaded with 7.0.7) 
* start ndb_mgmd (id=2) --initial --reload

What do you want me to do now?
Does it work for you with two management servers?

Config.ini below:

[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/data1/mysqlcluster/

[NDB_MGMD]
Id=1
Hostname=ps-ndb01
ArbitrationRank=1

[NDB_MGMD]
Id=2
Hostname=ps-ndb02
ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/data1/mysqlcluster/
FileSystemPathDD=/data1/mysqlcluster/
#FileSystemPathUndoFiles=/data1/mysqlcluster/
#FileSystemPathDataFiles=/data1/mysqlcluster/
DataMemory=2048M
IndexMemory=256M
LockPagesInMainMemory=0

MaxNoOfConcurrentOperations=100000

StringMemory=25
MaxNoOfTables=20000
MaxNoOfOrderedIndexes=10000
MaxNoOfUniqueHashIndexes=2500
MaxNoOfAttributes=120000
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=12
RedoBuffer=32M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=30
BackupReportFrequency=10

### Params for setting logging 
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput 
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
#ODirect=1

### Watchdog 
TimeBetweenWatchdogCheckInitial=60000

### TransactionInactiveTimeout  - should be enabled in Production 
#TransactionInactiveTimeout=30000
### CGE 6.3 - REALTIME EXTENSIONS 
#RealTimeScheduler=1
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=40

### DISK DATA 
#SharedGlobalMemory=384M
#read my blog how to set this:
#DiskPageBufferMemory=3072M

### Multithreading 
MaxNoOfExecutionThreads=8

### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=32M

BatchSizePerLocalScan=512
[NDBD]
Id=3
Hostname=ps-ndb05

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=4
Hostname=ps-ndb06

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

##	BELOW ARE TWO (INACTIVE) SLOTS FOR DATA NODES TO ALLOW FOR GROWTH
#[NDBD]
#Id=5
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

#[NDBD]
#Id=6
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[MYSQLD DEFAULT]
BatchSize=512
#BatchByteSize=2048K
#MaxScanBatchSize=2048K

[MYSQLD]
Id=7
Hostname=ps-ndb01
[MYSQLD]
Id=8
Hostname=ps-ndb01
[MYSQLD]
Id=9
Hostname=ps-ndb01
[MYSQLD]
Id=10
Hostname=ps-ndb01

How to repeat:
Cluster with two management servers:

node 1: 
ndb_mgmd -f /etc/mysql/config.ini --configdir=/etc/mysql/ --initial --reload

node 2:
ndb_mgmd -f /etc/mysql/config.ini --configdir=/etc/mysql/ --initial --reload

On node1:
ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Unable to connect with connect string: nodeid=0,localhost:1186
Retrying every 5 seconds. Attempts left: 2 ^C

Suggested fix:
-

Thank you for the report.

I can not repeat described behavior if start ndb_mgmd with --ndb-nodeid:

$$BASEDIR/libexec/ndb_mgmd -f etc/ndb_mgmd.cfg 
2009-09-15 11:42:15 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.37 ndb-7.0.8
2009-09-15 11:42:15 [MgmSrvr] INFO     -- The default config directory '/users/ssmirnova/blade12/build/mysql-5.1-telco-7.0//mysql-cluster' does not exist. Trying to create it...
2009-09-15 11:42:15 [MgmSrvr] INFO     -- Sucessfully created config directory
2009-09-15 11:42:15 [MgmSrvr] INFO     -- Reading cluster configuration from 'etc/ndb_mgmd.cfg'
2009-09-15 11:42:15 [MgmSrvr] ERROR    -- Could not determine which nodeid to use for this node. Specify it with --ndb-nodeid=<nodeid> on command line

$$BASEDIR/libexec/ndb_mgmd -f etc/ndb_mgmd.cfg --ndb-nodeid=1
2009-09-15 11:42:21 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.37 ndb-7.0.8
2009-09-15 11:42:21 [MgmSrvr] INFO     -- Reading cluster configuration from 'etc/ndb_mgmd.cfg'

$$BASEDIR/libexec/ndb_mgmd -f etc/ndb_mgmd.cfg --ndb-nodeid=2
2009-09-15 11:42:25 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.37 ndb-7.0.8
2009-09-15 11:42:25 [MgmSrvr] INFO     -- Reading cluster configuration from 'etc/ndb_mgmd.cfg'

...

$$BASEDIR/bin/ndb_mgm --ndb-mgmd-host=127.0.0.1:35128
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: 127.0.0.1:35128
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=3    @127.0.0.1  (mysql-5.1.37 ndb-7.0.8, Nodegroup: 0, Master)
id=4    @127.0.0.1  (mysql-5.1.37 ndb-7.0.8, Nodegroup: 0)

[ndb_mgmd(MGM)] 2 node(s)
id=1    @127.0.0.1  (mysql-5.1.37 ndb-7.0.8)
id=2    @127.0.0.1  (mysql-5.1.37 ndb-7.0.8)

[mysqld(API)]   4 node(s)
id=5 (not connected, accepting connect from localhost)
id=6 (not connected, accepting connect from localhost)
id=7 (not connected, accepting connect from localhost)
id=8 (not connected, accepting connect from localhost)

Please check if this solves problem in your case as well.

Same happens with only one ndb_mgmd:

ndb_mgmd -f file
stop ndb_mgmd
restart ndb_mgmd -f same-file 

With or without --reload --initial fails. And --node-id was not required beforehand.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/83321

Regression caused by new version of NdbDir::next_file iterator. Never released.

Pushed to 7.0 and 7.1.

Pushed to 7.0 and 7.1

thanks!
It works now!
-johan

Didn't appear in release per developer comments, no changelog entry needed.

Closed without further action.