Bug #9025 Cluster doesn 't start (nodes crashes after phase 3)
Submitted: 7 Mar 2005 16:51 Modified: 13 Jun 2005 14:41
Reporter: Jörg Nowak Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.10 OS:Sun Solaris 64 bit
Assigned to: Jonas Oreland CPU Architecture:Any

[7 Mar 2005 16:51] Jörg Nowak
Description:
We have a database with 6 millions profiles (90 % page usage) in it on a 2 machine cluster with 16 GB RAM on each maschine (E 2900). 

We first started the cluster with --initial option. Then filled the cluster database with some 6million records.

Then we killed the ndb_mgmd and all the ndb nodes so as to restart the cluster.
The cluster is not starting there after.

After the ndb nodes finishes phase 3, the starting procedure ends with a message "node disconnected" and "node freed" in the log file. 

The details of the cluster logs, ndbd log files, trace files are listed down here.

Cluster Log
-----------

2005-03-07 16:36:26 [MgmSrvr] INFO     -- Id: 1, Command port: 2200
2005-03-07 16:36:26 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2005-03-07 16:36:46 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 reserved for ip 10.1.2.10, m_reserved_nodes 0000000000 00000a.
2005-03-07 16:36:47 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2005-03-07 16:36:50 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 reserved for ip 10.1.2.10, m_reserved_nodes 0000000000 00002a.
2005-03-07 16:36:51 [MgmSrvr] INFO     -- Node 1: Node 5 Connected
2005-03-07 16:36:55 [MgmSrvr] INFO     -- Node 3: Node 5 Connected
2005-03-07 16:36:55 [MgmSrvr] INFO     -- Node 5: Node 3 Connected
2005-03-07 16:37:07 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 10.1.2.11, m_reserved_nodes 0000000000 00002e.
2005-03-07 16:37:07 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO     -- Node 3: Node 2 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO     -- Node 2: Node 3 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO     -- Node 2: Node 5 Connected
2005-03-07 16:37:11 [MgmSrvr] INFO     -- Node 5: Node 2 Connected
2005-03-07 16:37:23 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 5: Node 4 Connected
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 2: Start phase 1 completed
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 3: CM_REGCONF president = 2, own Node = 3, our dynamic id = 2
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 2: Node 3: API version 4.1.10
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 3: Node 2: API version 4.1.10
2005-03-07 16:37:25 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 5: CM_REGCONF president = 2, own Node = 5, our dynamic id = 3
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 2: Node 5: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 3: Node 5: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 5: Node 2: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 5: Node 3: API version 4.1.10
2005-03-07 16:37:27 [MgmSrvr] INFO     -- Node 5: Start phase 1 completed
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: CM_REGCONF president = 2, own Node = 4, our dynamic id = 4
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 2: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 5: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 3: Node 4: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Node 2: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Node 3: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Node 5: API version 4.1.10
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 5: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 5: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (system restart)
2005-03-07 16:37:28 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (system restart)
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 3 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 4 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 5 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 6 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 7 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 8 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 10 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 11 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 13 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 14 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 16 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 17 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 18 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 20 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 21 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 23 activated done
2005-03-07 16:37:41 [MgmSrvr] INFO     -- Node 2: DICT: index 24 activated done
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 5:  UNDO DBACC 0 [0 0 0 0 0 0 0 0 0]
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 2:  UNDO DBACC 0 [0 0 0 0 0 0 0 0 0]
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 3: We are Node 3 with dynamic ID 2, our left neighbour is Node 4, our right i s Node 5
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 4: Node 2 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 4: We are Node 4 with dynamic ID 4, our left neighbour is Node 5, our right i s Node 3
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 1: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 3: We are Node 3 with dynamic ID 2, our left neighbour is Node 4, our right i s Node 4
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2005-03-07 16:40:14 [MgmSrvr] ALERT    -- Node 4: Node 5 Disconnected
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 4: We are Node 4 with dynamic ID 4, our left neighbour is Node 3, our right i s Node 3
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 4: Communication to Node 2 closed
2005-03-07 16:40:14 [MgmSrvr] INFO     -- Node 4: Communication to Node 5 closed
2005-03-07 16:40:15 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2005-03-07 16:40:15 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2005-03-07 16:40:20 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 000000000000002a.
2005-03-07 16:40:20 [MgmSrvr] INFO     -- Mgmt server state: nodeid 5 freed, m_reserved_nodes 000000000000000a.
2005-03-07 16:40:20 [MgmSrvr] INFO     -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000002.

Node 2 Log file
---------------

Date/Time: Monday 7 Mars 2005 - 16:41:17
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupFixAlloc.cpp
Object of reference: DBTUP (Line: 288) 0x0000000a
ProgramName: /usr/local/mysql/bin/ndbd
ProcessID: 20545
TraceFile: ./ndb_2_trace.log.22
***EOM***

Other Nodes just has the following message in their log files
---------------------------------------------------------------

Date/Time: Monday 7 Mars 2005 - 16:41:18
Type of error: error
Message: Arbitrator shutdown
Fault ID: 2305
Problem data: Arbitrator decided to shutdown this node
Object of reference: QMGR (Line: 3796) 0x0000000a
ProgramName: /usr/local/mysql/bin/ndbd
ProcessID: 20523
TraceFile: ./ndb_4_trace.log.23
***EOM***

Fragment of lines from the tracefile (From 269 to 304, as you may need line 288 as listed in the ndbd node 2 log file)
----------------------------------------------------------------------------------------------------------------------

   269  --------------- Signal ----------------
   270  r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 329905 gsn: 264 "FSREADREQ" prio: 0
   271  s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329904 length: 8 trace: 1 #sec: 0 fragInf: 0
   272   UserPointer: 1
   273   FilePointer: 167
   274   UserReference: H'00f90002 Operation flag: H'00000001 (No sync, Format=Array of pages)
   275   varIndex: 2
   276   numberOfPages: 1
   277   pageData:  H'00000000, H'00000000
   278
   279  --------------- Signal ----------------
   280  r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 329904 gsn: 164 "CONTINUEB" prio: 1
   281  s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329902 length: 2 trace: 1 #sec: 0 fragInf: 0
   282   H'00000000 H'00000000
   283  --------------- Signal ----------------
   284  r.bn: 248 "DBACC", r.proc: 2, r.sigId: 329903 gsn: 164 "CONTINUEB" prio: 1
   285  s.bn: 248 "DBACC", s.proc: 2, s.sigId: 329901 length: 1 trace: 1 #sec: 0 fragInf: 0
   286   H'00000003
   287  --------------- Signal ----------------
   288  r.bn: 249 "DBTUP", r.proc: 2, r.sigId: 329902 gsn: 386 "START_RECREQ" prio: 1
   289  s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 329900 length: 2 trace: 1 #sec: 0 fragInf: 0
   290   receivingNodeId: 16318466 senderRef: (2, 247)
   291   keepGci: 1 lastCompletedGci: 1 newestGci: 2
   292  --------------- Signal ----------------
   293  r.bn: 248 "DBACC", r.proc: 2, r.sigId: 329901 gsn: 386 "START_RECREQ" prio: 1
   294  s.bn: 247 "DBLQH", s.proc: 2, s.sigId: 329900 length: 2 trace: 1 #sec: 0 fragInf: 0
   295   receivingNodeId: 16252930 senderRef: (2, 247)
   296   keepGci: 1 lastCompletedGci: 1 newestGci: 2
   297  --------------- Signal ----------------
   298  r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 329900 gsn: 428 "TUP_SRCONF" prio: 1
   299  s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 329898 length: 2 trace: 1 #sec: 0 fragInf: 0
   300   H'00000001 H'00000003
   301  --------------- Signal ----------------
   302  r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 329899 gsn: 164 "CONTINUEB" prio: 1
   303  s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 329896 length: 1 trace: 0 #sec: 0 fragInf: 0
   304   Scanning the memory channel again with no delay

   As the size of the entire trace file is larger i can't attach the whole trace file. I am not sure which lines from the trace file you may need. If you need any more info from the trace files, please do let me know. i will upload those lines from the trace file.

   

How to repeat:
use mysql-max 4.1.10 64 bit version. Built a cluster on Sun Solaris 9 64bit and fill it with data records in it. Then kill the ndb processes and try to restart the cluster nodes. so simple at the look but doesn't start.

You will hit the bug.
[8 Mar 2005 8:20] Tomas Ulin
If it is is possible could you please upload the full filesystem as bug9025.tar.gz

ftp://ftp.mysql.com:/pub/mysql/upload/

To recover it is also possible it may work with a partial restart...  since it seems it is nr 2 that seems to have the problem, exclude that from the restart...
[10 Mar 2005 10:49] Jörg Nowak
Hello Tomas Ulin,
         I even tried to do a partial restart. but no use. I could not successfully start the cluster. i tried many partial restart combinations..like node2 and node4, node3 and node5, node4 and node5, node2 and node3. nothing works.
         
        Do you want the whole of the filesystem. i suppose..it is very big. which nodes file system do you want. please do let me know.

Jörg Nowak.
[10 Mar 2005 10:49] Jörg Nowak
Hello Tomas Ulin,
         I even tried to do a partial restart. but no use. I could not successfully start the cluster. i tried many partial restart combinations..like node2 and node4, node3 and node5, node4 and node5, node2 and node3. nothing works.
         
        Do you want the whole of the filesystem. i suppose..it is very big. which nodes file system do you want. please do let me know.

Jörg Nowak.
[8 Jun 2005 22:25] Tomas Ulin
we believe this is the same bug as #11186 which has been fixed

patch will be included in 4.1.14 and 5.0.8
[8 Jun 2005 22:32] Tomas Ulin
sorry ignore the above w.r.t. duplicate bug, wrong bug,
[8 Jun 2005 22:34] Tomas Ulin
time to go to bed :), should be 4.1.13 and 5.0.8 and nothing else
[9 Jun 2005 5:28] Jonas Oreland
pushed to 4.1.13 and 5.0.8
[13 Jun 2005 14:41] Jon Stephens
Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

Documented in Change History for versions 4.1.13, 5.0.8. Closed.