Description:
I start with a 2 replica 4 node cluster.
id=1 @192.168.171.4 (Version: 5.0.27, Nodegroup: 0)
id=2 @192.168.171.5 (Version: 5.0.27, Nodegroup: 0)
id=3 @192.168.171.6 (Version: 5.0.27, Nodegroup: 1)
id=4 @192.168.171.7 (Version: 5.0.27, Nodegroup: 1, Master)
Cluster log:
Nov 1 17:26:08 nl2-db4 NDB[12871]: [MgmSrvr] Node 4: Local checkpoint 206 started. Keep GCI = 450 oldest restorable GCI = 387 Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 5: Node 4 Connected Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 4 Disconnected Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Node 4 Disconnected Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: Node 4 Disconnected Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: Communication to Node 4 closed Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Communication to Node 4 closed Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Communication to Node 4 closed Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: Arbitration check won - node group majority Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: President restarts arbitration thread [state=6] Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: GCP Take over started Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 4: Forced node shutdown completed. Initiated by signal 9.
Nov 1 17:26:12 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: GCP Take over completed Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 5: Node 1 Connected Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Node 1 Disconnected Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 1 Disconnected Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Communication to Node 1 closed Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Communication to Node 1 closed Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Network partitioning - arbitration required Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: President restarts arbitration thread [state=7] Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 1: Forced node shutdown completed. Initiated by signal 9.
Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Arbitration won - positive reply from node 5 Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: GCP Take over started Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: GCP Take over completed Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Prepare arbitrator node 5 [ticket=62a70006a457963a] Nov 1 17:26:13 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Started arbitrator node 5 [ticket=62a70006a457963a] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: LCP Take over started Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: ParticipatingDIH = 000000000000000c Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: ParticipatingLQH = 000000000000000c Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=2 000000000000000c] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=2 000000000000000c] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: LCP Take over completed (state = 5) Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: ParticipatingDIH = 000000000000000c Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: ParticipatingLQH = 000000000000000c Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=2 000000000000000c] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=2 000000000000000c] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=2 000000000000000c] Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: m_LCP_COMPLETE_REP_From_Master_Received = 0 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 2: started: table=3 fragment=0 replica=25 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 2: started: table=3 fragment=2 replica=29 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 2: queued: table=4 fragment=0 replica=33 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 2: queued: table=4 fragment=2 replica=37 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 3: started: table=3 fragment=1 replica=27 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 3: started: table=3 fragment=3 replica=31 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 3: queued: table=4 fragment=1 replica=35 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Node 3: queued: table=4 fragment=3 replica=39 Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 2: Communication to Node 1 opened Nov 1 17:26:53 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Communication to Node 1 opened Nov 1 17:27:10 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Failure handling of node 4 has not completed in 1 min. - state = 3
How to repeat:
How to:
Node 4 is Master.
Node 4 is killed "kill -9".
Node 1 starts becoming Master.
Node 1 is killed.
Node 3 becomes Master.
I end up with this
Nov 1 17:27:10 nl2-db4 NDB[12871]: [MgmSrvr] Node 3: Failure handling of node 4 has not completed in 1 min. - state = 3
Which makes it impossible to restart node 4 without stopping the cluster.