MySQL Bugs: #46156: Cluster with 2 Management Nodes

Bug #46156	Cluster with 2 Management Nodes
Submitted:	13 Jul 2009 20:04	Modified:	14 Sep 2009 16:23
Reporter:	Sajjad Tariq	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Windows (XP)
Assigned to:	jack andrews	CPU Architecture:	Any
Tags:	NDB_MGMD nodes 5.1.34-ndb-7.0.6-cluster-gpl, NDB_MGMD nodes 5.1.35-ndb-7.0.7-cluster-gpl

Description:
When ndb_mgm is started on a Cluster with 2 management (ndb_mgmd) nodes, then ndb_mgm can not get the config and ndb_mgmd crashes. The same setup works for single management (ndb_mgmd) node.

=====================================================================

I start up both management node with following Config.ini

C:\Program Files\MySQL\MySQL Server 7.0\bin>ndb_mgmd --initial
2009-07-13 13:47:21 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5
.1.34 ndb-7.0.6
2009-07-13 13:47:22 [MgmSrvr] INFO     -- Reading cluster configuration from 'C:
/mysql/mysql-cluster/ndb_mgmd/config.ini'

=====================================================================

When I start the management console I get a unhandled exception error which my Visual Studio Just-In-Time Debugger tries to debug. The message states: 

An unhandled win32 exception occurred in ndb_mgmd.exe[3564]

When I debug it in VS2005 i get following message in the immediate window 

Unhandled exception at 0x7c911780 in ndb_mgmd.exe: 0xC0000005: Access violation reading location 0x41414131.

and when I break the code in the disassembly I get break at this line

7C911753  call        7C95F13F 
7C911758  jmp         7C910A45 
7C91175D  mov         esi,dword ptr [eax+4] 
7C911760  sub         esi,8 
7C911763  mov         dword ptr [ebp-38h],esi 
7C911766  mov         al,byte ptr [esi+5] 
7C911769  mov         byte ptr [ebp-1Dh],al 
7C91176C  lea         ecx,[esi+8] 
7C91176F  mov         edi,dword ptr [ecx] 
7C911771  mov         dword ptr [ebp-1B8h],edi 
7C911777  mov         edx,dword ptr [esi+0Ch] 
7C91177A  mov         dword ptr [ebp-88h],edx 
7C911780  mov         edx,dword ptr [edx]        <========== Break
7C911782  cmp         edx,dword ptr [edi+4] 
7C911785  jne         7C936AD6 
7C91178B  cmp         edx,ecx 
7C91178D  jne         7C936AD6 
7C911793  mov         ecx,dword ptr [ebp-88h] 
7C911799  mov         dword ptr [ecx],edi 
7C91179B  mov         dword ptr [edi+4],ecx 
7C91179E  cmp         edi,ecx 
7C9117A0  jne         7C9117D1 
7C9117A2  movzx       ecx,word ptr [esi] 
7C9117A5  mov         edx,ecx 
7C9117A7  shr         edx,3 
7C9117AA  mov         dword ptr [ebp-1C0h],edx 
7C9117B0  and         ecx,7 
7C9117B3  xor         edi,edi 
7C9117B5  inc         edi  
7C9117B6  shl         edi,cl 
7C9117B8  mov         dword ptr [ebp-0B0h],edi 
7C9117BE  lea         edi,[edx+ebx+158h] 
7C9117C5  xor         ecx,ecx 
7C9117C7  mov         cl,byte ptr [edi] 
7C9117C9  xor         ecx,dword ptr [ebp-0B0h]

=================================================================
On the management console I get this out put

C:\Program Files\MySQL\MySQL Server 7.0\bin>ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Failed to unpack buffer
Could not get configuration
*     0: No error
*        Executing: ndb_mgm_get_configuration
ndb_mgm> show
Connected to Management Server at: localhost:1186
Could not get configuration
*   145: Error
*        Time out talking to management server
ndb_mgm>

How to repeat:
Config.ini (for Both MGMD)
==========================

[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=C:/mysql/mysql-cluster/ndb_mgmd/

[NDB_MGMD]
Id=1
Hostname=MGMD1
ArbitrationRank=1

[NDB_MGMD]
Id=2
Hostname=MGMD2
ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=C:/mysql/mysql-cluster/ndbd/
DataMemory=1280M
IndexMemory=150M
LockPagesInMainMemory=0

MaxNoOfConcurrentOperations=100000

StringMemory=25
#MaxNoOfTables=4096
MaxNoOfOrderedIndexes=10000
MaxNoOfAttributes=10000
#MaxNoOfUniqueHashIndexes=512

DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
InitFragmentLogFiles=FULL
NoOfFragmentLogFiles=6
RedoBuffer=32M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=30
BackupReportFrequency=10

### Params for setting logging 
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for increasing Disk throughput 
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
#ODirect=1

### Watchdog 
TimeBetweenWatchdogCheckInitial=30000

### TransactionInactiveTimeout  - should be enabled in Production 
#TransactionInactiveTimeout=30000
### CGE 6.3 - REALTIME EXTENSIONS 
#RealTimeScheduler=1
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=40

### DISK DATA 
#SharedGlobalMemory=384M
#read my blog how to set this:
#DiskPageBufferMemory=3072M
BatchSizePerLocalScan=512
[NDBD]
Id=3
Hostname=ndb204

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=4
Hostname=ndb203

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=5
Hostname=ndb202

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[NDBD]
Id=6
Hostname=ndb201

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

##	BELOW ARE TWO (INACTIVE) SLOTS FOR DATA NODES TO ALLOW FOR GROWTH
#[NDBD]
#Id=7
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

#[NDBD]
#Id=8
#Hostname=

### CGE 6.3 - REALTIME EXTENSIONS 
### PLEASE NOTE THAT THE BELOW ONLY WORKS IF YOU HAVE >1 CORE.
### YOU SHOULD CHECK cat /proc/interrupts AND CHOOSE THE CPUs
### THAT GENERATE THE LEAST INTERRUPS. TYPICALLY THE CPU HANDLING
### THE INTERRUPTS FOR THE COMMUNICATION INTERFACE USED FOR THE DATA NODE SHOULD
### BE AVOIDED FOR THE LockExecuteThreadToCPU, BUT YOU CAN
### LockMaintThreadsToCPU TO THAT CPU SINCE IT DOES NOT AFFECT THE
### REALTIME ASPECTS (THIS IS TRUE FOR UP TO TWO DATA NODES ONE ONE COMPUTER.
#LockExecuteThreadToCPU=X
#LockMaintThreadsToCPU=Y

[MYSQLD DEFAULT]
BatchSize=512
#BatchByteSize=2048K
#MaxScanBatchSize=2048K

[MYSQLD]
Id=9
#Hostname=api1
[MYSQLD]
Id=10
#Hostname=api1
[MYSQLD]
Id=11
#Hostname=api1

[MYSQLD]
Id=12
#Hostname=api2
[MYSQLD]
Id=13
#Hostname=api2
[MYSQLD]
Id=14
#Hostname=api2
[MYSQLD]
Id=15
#Hostname=api2
[MYSQLD]
Id=16
#Hostname=api2

=====================================================================

my.ini
==========
# All files in this package is subject to the GPL v2 license
# More information is in the COPYING file in the top directory of this package.
# Copyright (C) 2009 severalnines.com
[ndb_mgmd]
config-file="C:/mysql/mysql-cluster/ndb_mgmd/config.ini"
configdir="C:/mysql/mysql-cluster/ndb_mgmd/"

[MYSQLD]
user=mysql
basedir="C:/Program Files/MySQL/MySQL Server 7.0/"
datadir=C:/mysql/mysql-cluster/data/
#socket=/data/mysql//mysql.sock
pid-file=mysqld.pid
port=3306
ndb-cluster-connection-pool=4
ndbcluster
ndb-connectstring="MGMD1:1186;MGMD2:1186"
ndb-force-send=1
ndb-use-exact-count=0
ndb-extra-logging=1
ndb-autoincrement-prefetch-sz=256
engine-condition-pushdown=1

#REPLICATION SPECIFIC - GENERAL
#server-id must be unique across all mysql servers participating in replication.
server-id=254
#REPLICATION SPECIFIC - MASTER
log-bin=binlog

#OTHER THINGS, BUFFERS ETC
key_buffer = 256M
max_allowed_packet = 16M
sort_buffer_size = 512K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
thread_cache_size=1024
myisam_sort_buffer_size = 8M
init_connect='set autocommit=0'
memlock
sysdate_is_now
max-connections=200
thread-cache-size=64 
query-cache-type = 0
query-cache-size = 0
table-open_cache=1024
table-cache=512
lower-case-table-names=0

[MYSQL]
#socket=/data/mysql//mysql.sock

After installing the 7.0.7 patch, I used the same config mentioned above, but mgmd can not confirm the settings now.

===============================================================================
I start mgmd1 with "--initial" and it crashes when I start mgmd2 with "--initial".

C:\Program Files\MySQL\MySQL Server 7.0\bin>ndb_mgmd --initial
2009-07-17 10:15:04 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5
.1.35 ndb-7.0.7
2009-07-17 10:15:05 [MgmSrvr] INFO     -- Reading cluster configuration from 'C:
/mysql/mysql-cluster/ndb_mgmd/config.ini'

C:\Program Files\MySQL\MySQL Server 7.0\bin>
===============================================================================
I start mgmd1 again and it crashes as mgmd2 with “--initial” is already up.

C:\Program Files\MySQL\MySQL Server 7.0\bin>ndb_mgmd
2009-07-17 10:21:31 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5
.1.35 ndb-7.0.7
2009-07-17 10:21:32 [MgmSrvr] INFO     -- Reading cluster configuration from 'C:
/mysql/mysql-cluster/ndb_mgmd/config.ini'

C:\Program Files\MySQL\MySQL Server 7.0\bin>
===============================================================================
Output from the log file.

2009-07-17 10:20:51 [MgmSrvr] INFO     -- Got initial configuration from 'C:\mysql\mysql-cluster\ndb_mgmd\config.ini', will try to set it when all ndb_mgmd(s) started
2009-07-17 10:20:51 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 10.1.23.120, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000004.
2009-07-17 10:20:51 [MgmSrvr] INFO     -- Node 2: Node 2 Connected
2009-07-17 10:20:51 [MgmSrvr] INFO     -- Id: 2, Command port: *:1186
2009-07-17 10:20:51 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
2009-07-17 10:20:52 [MgmSrvr] INFO     -- Node 1 connected
2009-07-17 10:20:52 [MgmSrvr] WARNING  -- Refusing to start initial config change when nodes have different config
This is the actual diff:
[SYSTEM]
-Name=MC_20090717101505
+Name=MC_20090717102051

[ndbd(DB)]
NodeId=3
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=4
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=5
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=6
-DiskIOThreadPool=2
+DiskIOThreadPool=8

2009-07-17 10:20:52 [MgmSrvr] ALERT    -- Node 2: Node 1 Disconnected
2009-07-17 10:21:33 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
2009-07-17 10:21:33 [MgmSrvr] WARNING  -- Refusing to start initial config change when nodes have different config
This is the actual diff:
[SYSTEM]
-Name=MC_20090717102132
+Name=MC_20090717102051

[ndbd(DB)]
NodeId=3
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=4
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=5
-DiskIOThreadPool=2
+DiskIOThreadPool=8

[ndbd(DB)]
NodeId=6
-DiskIOThreadPool=2
+DiskIOThreadPool=8

2009-07-17 10:21:33 [MgmSrvr] ALERT    -- Node 2: Node 1 Disconnected

see the warning:

>2009-07-17 10:21:33 [MgmSrvr] WARNING -- Refusing to start initial config change when nodes have different config

you must either have the same config.ini on both hosts. or don't use a config.ini in the second mgmd and have it fetch config from the first.

to have the second fetch the first one's config, use 
  --ndb-connectstring=host:port

the first problem mentioned was fixed by a patch for related bug #46061.  the second problem is not a bug (see my previous comment).

Jack,

Sorry it took this long to respond, with 7.0.7 GA release I am able to set 2 management nodes with exact same config on both mgmd.

Thank you,

Sajjad