Description:
We have a database with 6 millions profiles (90 % page usage) in it on a 2 machine cluster with 16 GB RAM on each maschine (E 2900).
We first started the cluster with --initial option. Then filled the cluster database with some 6million records.
Then we killed the ndb_mgmd and all the ndb nodes so as to restart the cluster.
The cluster is not starting there after.
After the ndb nodes finishes phase 3, the starting procedure ends with a message "node disconnected" and "node freed" in the log file.
The details of the cluster logs, ndbd log files, trace files are listed down here.
Cluster Log
-----------
2005-03-07 16:36:26 [MgmSrvr] INFO -- Id: 1, Command port: 2200
2005-03-07 16:36:26 [MgmSrvr] INFO -- Node 1: Node 4 Connected
2005-03-07 16:36:46 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved for ip 10.1.2.10, m_reserved_nodes 0000000000 00000a.
2005-03-07 16:36:47 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2005-03-07 16:36:50 [MgmSrvr] INFO -- Mgmt server state: nodeid 5 reserved for ip 10.1.2.10, m_reserved_nodes 0000000000 00002a.
2005-03-07 16:36:51 [MgmSrvr] INFO -- Node 1: Node 5 Connected
2005-03-07 16:36:55 [MgmSrvr] INFO -- Node 3: Node 5 Connected
2005-03-07 16:36:55 [MgmSrvr] INFO -- Node 5: Node 3 Connected
2005-03-07 16:37:07 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 reserved for ip 10.1.2.11, m_reserved_nodes 0000000000 00002e.
2005-03-07 16:37:07 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO -- Node 3: Node 2 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO -- Node 2: Node 3 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO -- Node 2: Node 5 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO -- Node 5: Node 2 Connected
2005-03-07 16:37:23 [MgmSrvr] ALERT -- Node 1: Node 4 Disconnected
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 1: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 2: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 3: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 5: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 2: Start phase 1 completed
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 3: CM_REGCONF president = 2, own Node = 3, our dynamic id = 2
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 2: Node 3: API version 4.1.10
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 3: Node 2: API version 4.1.10
2005-03-07 16:37:25 [MgmSrvr] INFO -- Node 3: Start phase 1 completed
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 5: CM_REGCONF president = 2, own Node = 5, our dynamic id = 3
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 2: Node 5: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 3: Node 5: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 5: Node 2: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 5: Node 3: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO -- Node 5: Start phase 1 completed
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: CM_REGCONF president = 2, own Node = 4, our dynamic id = 4
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 2: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 5: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 3: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Node 2: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Node 3: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Node 5: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Start phase 1 completed
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 5: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 3: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 2: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 3: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 5: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 2: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO -- Node 4: Start phase 3 completed (system restart)
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 3 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 4 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 5 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 6 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 7 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 8 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 10 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 11 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 13 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 14 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 16 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 17 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 18 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 20 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 21 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 23 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO -- Node 2: DICT: index 24 activated done
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 5: UNDO DBACC 0 [0 0 0 0 0 0 0 0 0]
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 2: UNDO DBACC 0 [0 0 0 0 0 0 0 0 0]
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 3: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 3: We are Node 3 with dynamic ID 2, our left neighbour is Node 4, our right i s Node 5
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 4: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 4: We are Node 4 with dynamic ID 4, our left neighbour is Node 5, our right i s Node 3
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 1: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 3: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 3: We are Node 3 with dynamic ID 2, our left neighbour is Node 4, our right i s Node 4
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 3: Communication to Node 2 closed
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 3: Communication to Node 5 closed
2005-03-07 16:40:14 [MgmSrvr] ALERT -- Node 4: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 4: We are Node 4 with dynamic ID 4, our left neighbour is Node 3, our right i s Node 3
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 4: Communication to Node 2 closed
2005-03-07 16:40:14 [MgmSrvr] INFO -- Node 4: Communication to Node 5 closed
2005-03-07 16:40:15 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2005-03-07 16:40:15 [MgmSrvr] ALERT -- Node 1: Node 4 Disconnected
2005-03-07 16:40:20 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000002a.
2005-03-07 16:40:20 [MgmSrvr] INFO -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 000000000000000a.
2005-03-07 16:40:20 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.
Node 2 Log file
---------------
Date/Time: Monday 7 Mars 2005 - 16:41:17
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupFixAlloc.cpp
Object of reference: DBTUP (Line: 288) 0x0000000a
ProgramName: /usr/local/mysql/bin/ndbd
ProcessID: 20545
TraceFile: ./ndb_2_trace.log.22
***EOM***
Other Nodes just has the following message in their log files
---------------------------------------------------------------
Date/Time: Monday 7 Mars 2005 - 16:41:18
Type of error: error
Message: Arbitrator shutdown
Fault ID: 2305
Problem data: Arbitrator decided to shutdown this node
Object of reference: QMGR (Line: 3796) 0x0000000a
ProgramName: /usr/local/mysql/bin/ndbd
ProcessID: 20523
TraceFile: ./ndb_4_trace.log.23
***EOM***
Fragment of lines from the tracefile (From 269 to 304, as you may need line 288 as listed in the ndbd node 2 log file)
----------------------------------------------------------------------------------------------------------------------
269 --------------- Signal ----------------
270 r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 329905 gsn: 264 "FSREADREQ" prio: 0
271 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329904 length: 8 trace: 1 #sec: 0 fragInf: 0
272 UserPointer: 1
273 FilePointer: 167
274 UserReference: H'00f90002 Operation flag: H'00000001 (No sync, Format=Array of pages)
275 varIndex: 2
276 numberOfPages: 1
277 pageData: H'00000000, H'00000000
278
279 --------------- Signal ----------------
280 r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 329904 gsn: 164 "CONTINUEB" prio: 1
281 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329902 length: 2 trace: 1 #sec: 0 fragInf: 0
282 H'00000000 H'00000000
283 --------------- Signal ----------------
284 r.bn: 248 "DBACC", r.proc: 2, r.sigId: 329903 gsn: 164 "CONTINUEB" prio: 1
285 s.bn: 248 "DBACC", s.proc: 2, s.sigId: 329901 length: 1 trace: 1 #sec: 0 fragInf: 0
286 H'00000003
287 --------------- Signal ----------------
288 r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 329902 gsn: 386 "START_RECREQ" prio: 1
289 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 329900 length: 2 trace: 1 #sec: 0 fragInf: 0
290 receivingNodeId: 16318466 senderRef: (2, 247)
291 keepGci: 1 lastCompletedGci: 1 newestGci: 2
292 --------------- Signal ----------------
293 r.bn: 248 "DBACC", r.proc: 2, r.sigId: 329901 gsn: 386 "START_RECREQ" prio: 1
294 s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 329900 length: 2 trace: 1 #sec: 0 fragInf: 0
295 receivingNodeId: 16252930 senderRef: (2, 247)
296 keepGci: 1 lastCompletedGci: 1 newestGci: 2
297 --------------- Signal ----------------
298 r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 329900 gsn: 428 "TUP_SRCONF" prio: 1
299 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329898 length: 2 trace: 1 #sec: 0 fragInf: 0
300 H'00000001 H'00000003
301 --------------- Signal ----------------
302 r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 329899 gsn: 164 "CONTINUEB" prio: 1
303 s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 329896 length: 1 trace: 0 #sec: 0 fragInf: 0
304 Scanning the memory channel again with no delay
As the size of the entire trace file is larger i can't attach the whole trace file. I am not sure which lines from the trace file you may need. If you need any more info from the trace files, please do let me know. i will upload those lines from the trace file.
How to repeat:
use mysql-max 4.1.10 64 bit version. Built a cluster on Sun Solaris 9 64bit and fill it with data records in it. Then kill the ndb processes and try to restart the cluster nodes. so simple at the look but doesn't start.
You will hit the bug.