Description:
We have a 6 node 3 replicas cluster.
I stopt one (node 5) of the 6 data nodes from the mgmt_client.
If I restart the data node, the restart will fail and cause an other (node 6)data node to crach as well.
Mgmt-client output:
ndb_mgm> all status
Node 1: started (Version 5.0.22)
Node 2: started (Version 5.0.22)
Node 3: started (Version 5.0.22)
Node 4: started (Version 5.0.22)
Node 5: started (Version 5.0.22)
Node 6: started (Version 5.0.22)
ndb_mgm> 5 stop
Node 5: Node shutdown initiated
Node 5 has shutdown.
ndb_mgm> Node 5: Node shutdown completed.
Node 6: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: \'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node\'.
Node 5: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 6. Caused by error 2308: \'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node\'.
Cluster log.
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 1: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 2: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 3: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 4: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Node 6: API version 5.0.22
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 1 completed
Jul 21 13:58:40 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 2 completed (node restart)
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 3 completed (node restart)
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Receive arbitrator node 7 [ticket=39e8000186cc0ea3]
Jul 21 13:58:41 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Start phase 4 completed (node restart)
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 6 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 7 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 8 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 9 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 10 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 11 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 12 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 13 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 14 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 15 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: DICT: index 16 activated
Jul 21 13:59:30 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Local checkpoint 1783 started. Keep GCI = 16582 oldest restorable GCI = 16592
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 7: Node 6 Connected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Node 6 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 6: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: \'Internal program error
(failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node\'.
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 7: Node 5 Connected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Node 5 Disconnected
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 5 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 closed
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Arbitration check won - node group majority
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6]
Jul 21 13:59:32 nl2-db4 NDB[10219]: [MgmSrvr] Node 5: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 6. Caused by error 2
308: \'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node\'.
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 2: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 4: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 1: Communication to Node 6 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 5 opened
Jul 21 13:59:36 nl2-db4 NDB[10219]: [MgmSrvr] Node 3: Communication to Node 6 opened
Node 5:
/var/db/6-nodes/log/ndb_5_error.log
Time: Friday 21 July 2006 - 13:59:32
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 6 disconected
Error object: QMGR (Line: 2481) 0x0000000a
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 14876
Trace: /var/db/6-nodes/log/ndb_5_trace.log.3
Version: Version 5.0.22
***EOM***
Node 6:
/var/db/6-nodes/log/ndb_6_error.log
Time: Friday 21 July 2006 - 13:59:31
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 3657) 0x0000000a
Program: /opt/mysqlcluster/libexec/ndbd
Pid: 13729
Trace: /var/db/6-nodes/log/ndb_6_trace.log.4
Version: Version 5.0.22
***EOM***
How to repeat:
Restart one of the data nodes.