Bug #11776 mysql cluster unable to restart after clean shutdown
Submitted: 6 Jul 2005 14:48 Modified: 22 Jul 2005 14:28
Reporter: Fabio Coatti Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.12 OS:Linux (Linux (AMD64))
Assigned to: Assigned Account CPU Architecture:Any

[6 Jul 2005 14:48] Fabio Coatti
Description:
see also:
http://forums.mysql.com/read.php?25,31822,32556#msg-32556

If I shut down cleanly the cluster, say using "shutdown" command from ndb-mgm, I'm unable to restart cluster: I've to launch ndbd with --initial option and recover from backup.
I've 2 storage nodes, 2 api nodes (mysql) active but several configured (for restoring backups) and one mgm node.
I've one storage and one api nodes on the same machine (two dual opterons) and one mgm-node (i386).
All mysql installed are 4.1.12

If I shut down one storage node, I can restart it and the cluster is working in a few seconds. If I shot down both storage nodes, the restart sequence stops at phase 4, the logs is here:

2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 111 activated done
2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 112 activated done
2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 113 activated done
2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 114 activated done
2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 115 activated done
2005-06-27 23:31:26 [MgmSrvr] INFO -- Node 3: DICT: index 116 activated done
2005-06-27 23:31:42 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2005-06-27 23:31:42 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000016.

and the sequence restarts from first phase, no change if both nodes are restarted toghether.

The configuration seems ok, due to the fact that in normal operations all works fine and if I kill one storage node I can restart it without problems. 

note on systems: storage nodes are two amd64 with 8Gb ram; on each we have one storage node and one api node. the storage nodes are connected with Gb ethernet (cross cable).
management node is on a x86 server, on a different network (i.e. communications between mgm and storage nodes crosses a router, as one can see from config.ini)
all nodes uses version 4.1.12 (mysq, ndb,management)
backups,restore and normal operations works just fine, I can kill one storage node, restart it without any problem. 

config.ini:

[NDBD DEFAULT]
NoOfReplicas=2
DataDir=/var/lib/mysql-cluster
DataMemory=5632M
IndexMemory=1700M
MaxNoOfAttributes=10000
MaxNoOfTables=1024
MaxNoOfOrderedIndexes=1024
MaxNoOfUniqueHashIndexes=512
MaxNoOfConcurrentTransactions=131072
MaxNoOfConcurrentOperations=1048576
TimeBetweenLocalCheckpoints=23
TimeBetweenGlobalCheckpoints=32000
NoOfFragmentLogFiles=200
# TimeBetweenWatchDogCheck=60000000
TimeBetweenWatchDogCheck=10000
TimeBetweenInactiveTransactionAbortCheck=2000
TransactionDeadlockDetectionTimeout=6000
NoOfDiskPagesToDiskAfterRestartTUP=160
NoOfDiskPagesToDiskAfterRestartACC=80
MaxNoOfTriggers=1536
UndoIndexBuffer=6MB
UndoDataBuffer=48MB
RedoBuffer=128MB
StopOnError=false
#LOG LEVEL
LogLevelStartup=15
LogLevelShutdown=15
LogLevelStatistic=15
LogLevelCheckpoint=15
LogLevelNodeRestart=15
LogLevelConnection=15
LogLevelError=15
LogLevelInfo=15
#OTRAS
StartFailureTimeout=1800000
# StartFailureTimeout=0
# LockPagesInMainMemory=1
[MYSQLD DEFAULT]
ArbitrationRank=0
[NDB_MGMD DEFAULT]
[TCP DEFAULT]
# Managment Server
[NDB_MGMD]
Id=1
HostName=xxx.yyy.zzz.106 # the IP of THIS SERVER
DataDir=/var/lib/mysql-cluster
# Storage Engines
[NDBD]
Id=2
HostName=xxx.yyy.kkk.2 # the IP of the SECOND SERVER
#
[NDBD]
Id=3
HostName=xxx.yyy.kkk.3 # the IP of the SECOND SERVER
# 2 MySQL Clients
# I personally leave this blank to allow rapid changes of the mysql clients;
# you can enter the hostnames of the above two servers here. I suggest you dont.
[MYSQLD]
#HostName=xxx.yyy.kkk.2 # the IP of the FIRST API SERVER
#Id=4
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
#HostName=xxx.yyy.kkk.3 # the IP of the SECOND API SERVER
#Id=5
[TCP]
NodeId1=2
NodeId2=3
HostName1=192.168.50.2
HostName2=192.168.50.3

Node 3 error log:

Date/Time: Monday 27 June 2005 - 23:31:41
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 443) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 28734
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.19
Version 4.1.12
***EOM*** 

excerpt from trace.log:
-------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6919125 gsn: 89 "ACC_SRCONF" prio: 1
s.bn: 248 "DBACC", s.proc: 3, s.sigId: 6919123 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000000
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 6919124 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919122 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 3, r.sigId: 6919123 gsn: 255 "FSCLOSECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919122 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 1
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 6919122 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919118 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 6922843 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 6922841 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 249 "DBTUP", r.proc: 3, r.sigId: 6919120 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919118 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 1
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 3, r.sigId: 6919119 gsn: 89 "ACC_SRCONF" prio: 1
s.bn: 248 "DBACC", s.proc: 3, s.sigId: 6919117 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000001
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 6919118 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919115 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 248 "DBACC", r.proc: 3, r.sigId: 6919117 gsn: 255 "FSCLOSECONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6919115 length: 1 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 0
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 6922842 gsn: 164 "CONTINUEB" prio: 0
s.bn: 253 "NDBFS", s.proc: 3, s.sigId: 6922840 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 3, r.sigId: 6922838 gsn: 264 "FSREADREQ" prio: 0
s.bn: 248 "DBACC", s.proc: 3, s.sigId: 6922837 length: 8 trace: 0 #sec: 0 fragInf: 0
 UserPointer: 1
 FilePointer: 1135
 UserReference: H'00f80003 Operation flag: H'00000000 (No sync, Format=List of pairs)
 varIndex: 1
 numberOfPages: 1
 pageData:  H'00000010, H'00000000

--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 3, r.sigId: 6922836 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 3, s.sigId: 6922834 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004

If the whole trace is needed, just ask and I'll send it by e-mai (1Mb)

How to repeat:
To repeat this, I've only to log in on mgm node and invoke shutdown for all nodes. After this, the only way to restart the storage nodes is to use --initial option
[6 Jul 2005 16:17] Fabio Coatti
attached file with full trace dump
[6 Jul 2005 20:09] Jonas Oreland
Hi,

This is a duplicate of http://bugs.mysql.com/bug.php?id=8918
(among others...there were 5 bug reports for the same bug :-))

This is when using more than 4Gb of memory, or actually when writing stuff to
disk that was mapped in an address > 4Gb.

This has been fixed in src and the fix will be in 4.1.13
[7 Jul 2005 13:57] Fabio Coatti
I've just reinstalled from src downloaded with bitkeeper, ad the behaviour is exactly the same
I've done this:
backup (start backup)
shutdown all
compiled and installed sources (both storage nodes and mgm node)
started storage nodes with --initial
recovered from backup
issued a shutdown from mgm node
started storage nodes (without --initial)

... and storage nodes with the same behaviour, i.e.:

005-07-07 15:46:54 [MgmSrvr] INFO     -- Node 2: DICT: index 117 activated done
2005-07-07 15:46:54 [MgmSrvr] INFO     -- Node 2: DICT: index 118 activated done
2005-07-07 15:46:54 [MgmSrvr] INFO     -- Node 2: DICT: index 119 activated done
2005-07-07 15:46:54 [MgmSrvr] INFO     -- Node 2: DICT: index 120 activated done
2005-07-07 15:47:10 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-07-07 15:47:10 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected

Am I doing something wrong? I'm pretty sure that the binaries installed are the right ones (just compiled), but which version should visible on mgm-node? it says 4.1.12 even for bk pulled version, right?
[18 Jul 2005 13:03] Fabio Coatti
I've gust tried with nightly build as of 14/07, but no changes to tish behavoiur...
[20 Jul 2005 8:18] Jonas Oreland
Hi

Can you attatch the error/trace log from the latest build

The old error/trace has been fixed...but you might have found another
  error...
[20 Jul 2005 17:28] Fabio Coatti
I've compiled bk-version downloaded today;

(tomorrow I can try with tar.gz of nightly build, if needed)

the behaviour is the same, the errors are here:
node 2:
Date/Time: Wednesday 20 July 2005 - 19:17:28
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 443) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 25533
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.3
Version 4.1.12
***EOM***

Node 3:

Date/Time: Wednesday 20 July 2005 - 19:17:28
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 443) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 5177
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.19
Version 4.1.12
***EOM***

I'll attach trace files asap.
[21 Jul 2005 13:47] Fabio Coatti
I've tried also with 2005-07-16 nightly build, the behaviour is slightly different, The storage nodes reached and passed Phase 9, from mgm node the status was "started", but the tables were not OK.

Steps:
Compiled snapshot (16/07/05)
backupped the wole database
stopped and relaunched storage,api and mgm nodes (--initial)
restored the backup
issued shutdown from mgm node
restarted the system, of course without --initial
when I've tried to access tables, the database was corrupt (no tables)

excerpts from mgm log:

2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: DICT: index 126 rebuild done
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: DICT: index 127 rebuild done
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: DICT: index 128 rebuild done
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 6 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 7 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 8 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 9 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 10 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Communication to Node 0 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Start phase 8 completed (system restart)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Start phase 9 completed (system restart)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Started (version 4.1.12)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 7 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 8 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 9 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 10 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Communication to Node 0 opened
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Start phase 8 completed (system restart)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Start phase 9 completed (system restart)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 3: Started (version 4.1.12)
2005-07-21 11:11:07 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 2: Node 4: API version 4.1.12
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 3: Node 4: API version 4.1.12
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 3: Node 1: API version 4.1.12
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 2: Node 1: API version 4.1.12
2005-07-21 11:11:08 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=3f2e000138a90f28]
2005-07-21 11:11:09 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=3f2e000138a90f28]
2005-07-21 11:11:53 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 reserved for ip X.Y.Z.W, m_reserved_nodes 000000000000002e.
2005-07-21 11:11:53 [MgmSrvr] INFO     -- Node 3: Node 5 Connected
2005-07-21 11:11:53 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2005-07-21 11:11:53 [MgmSrvr] INFO     -- Node 2: Node 5: API version 4.1.12
2005-07-21 11:11:53 [MgmSrvr] INFO     -- Node 3: Node 5: API version 4.1.12
2005-07-21 11:12:06 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2005-07-21 11:12:06 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2005-07-21 11:12:06 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2005-07-21 11:12:06 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: GCP Take over started
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: LCP Take over started
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: GCP Take over completed
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: LCP Take over completed (state = 4)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: ParticipatingDIH = 0000000000000000
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: ParticipatingLQH = 0000000000000000
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 1
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 0: TabCopyStatus: 4 TabUpdateStatus: 2 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 1: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 2: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle) 1(on 2)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 3: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle) 1(on 2)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 4: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle) 1(on 2)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 5: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle) 1(on 2)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3: Table 6: TabCopyStatus: 0 TabUpdateStatus: 0 TabLcpStatus: 3
[CUT]
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 0: noLcpReplicas==0  0(on 2)=5(Idle) 1(on 3)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Node 3:  Fragment 1: noLcpReplicas==0  0(on 3)=5(Idle) 1(on 2)=5(Idle)
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000002a.
2005-07-21 11:12:06 [MgmSrvr] WARNING  -- Allocate nodeid (2) failed. Connection from ip X.Y.Z.W. Returned error string "Id 2 already allocated by an
other node."
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Mgmt server state: node id's  2 4 connected but not reserved
2005-07-21 11:12:06 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 not connected but reserved
2005-07-21 11:12:07 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-07-21 11:12:07 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000022.
2005-07-21 11:12:07 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip X.Y.Z.Q, m_reserved_nodes 000000000000002a.
2005-07-21 11:12:08 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-07-21 11:12:09 [MgmSrvr] WARNING  -- Allocate nodeid (2) failed. Connection from ip X.Y.Z.W. Returned error string "Id 2 already allocated by an
other node."
2005-07-21 11:12:09 [MgmSrvr] INFO     -- Mgmt server state: node id's  2 4 connected but not reserved
2005-07-21 11:12:09 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 not connected but reserved
2005-07-21 11:12:12 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip X.Y.Z.W, m_reserved_nodes 000000000000002e.
2005-07-21 11:12:13 [MgmSrvr] INFO     -- Node 1: Node 2 Connected

error:

Date/Time: Thursday 21 July 2005 - 11:12:06
Type of error: error
Message: Pointer too large
Fault ID: 2306
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 604) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 10204
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.22
Version 4.1.12
***EOM***

Date/Time: Thursday 21 July 2005 - 11:12:05
Type of error: error
Message: Pointer too large
Fault ID: 2306
Problem data: DbtupExecQuery.cpp
Object of reference: DBTUP (Line: 604) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 16174
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.13
Version 4.1.12
***EOM***

And this, if can be useful:
Date/Time: Thursday 21 July 2005 - 11:12:53
Type of error: error
Message: Arbitrator shutdown
Fault ID: 2305
Problem data: Arbitrator decided to shutdown this node
Object of reference: QMGR (Line: 3796) 0x00000008
ProgramName: /usr/sbin/ndbd
ProcessID: 16342
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.14
Version 4.1.12
***EOM***
[21 Jul 2005 13:48] Fabio Coatti
Trace files for nightly build 16-07-05

Attachment: ndb_2_trace.log.14.gz (application/x-gzip, text), 55.69 KiB.

[21 Jul 2005 13:49] Fabio Coatti
Trace files for nightly build 16-07-05

Attachment: ndb_3_trace.log.22.gz (application/x-gzip, text), 29.69 KiB.

[21 Jul 2005 14:22] Jonas Oreland
Hi,

I'm a bit confused...this resembles the bug report that I refered to earlier that has been fixed
from _4.1.13_

But you say that you downloaded and compiled latest..and still error logs show version 4.1.12

/Jonas
[21 Jul 2005 14:35] Fabio Coatti
I understand, but I've downloaded this tar file:
mysql-4.1.14-nightly-20050716.tar.gz

and yesterday I've installed latest BK tree. The ndbd executable is compiled today; I'll try again once more to be sure to have compiled the right package, just to be on the safe side (I really hope this is my fault). 

I'll report back asap.
[22 Jul 2005 14:22] Fabio Coatti
I've tried again, after cleaning up the whole tree, this time with 4.1.14-nightly-20050716
The problems seems almost vanished, so I suppose some problme on my side. Anyway, I've seen only one issue in cluster restarting, as shown here:

2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Communication to Node 0 opened
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Start phase 8 completed (system restart)
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Start phase 9 completed (system restart)
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Started (version 4.1.14)
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (node restart)
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Node 4: API version 4.1.14
2005-07-22 14:30:00 [MgmSrvr] INFO     -- Node 2: Node 5: API version 4.1.14
2005-07-22 14:30:01 [MgmSrvr] INFO     -- Node 2: Node 1: API version 4.1.14
2005-07-22 14:30:02 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-07-22 14:30:02 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000036.
2005-07-22 14:30:02 [MgmSrvr] WARNING  -- Allocate nodeid (3) failed. Connection from ip X.Y.Z.W. Returned error string "Id 3 already allocated by another node."
2005-07-22 14:30:02 [MgmSrvr] INFO     -- Mgmt server state: node id's  3 connected but not reserved
2005-07-22 14:30:02 [MgmSrvr] INFO     -- Mgmt server state: node id's  1 not connected but reserved

I'm still investigating this, but it's hard to reproduce, so I suppose that this bug can be closed,  thanks for the help.
[22 Jul 2005 14:28] Jonas Oreland
Closed...as it after some investigation already was reported and solved.