Description:
When SUMA switches ownership of bucket 1 to another node, the status changes to 100, followed immediately by a “shutdown handover” and “shutdown initiated”. This causes the node to terminate and exit the cluster, triggering troubleshooting on other nodes.
How to repeat:
This is what I found while checking the logs for a 4 node, 2 copy 2 slice, cluster of four data nodes including node2, node3, node4, and node5. While restarting the nodes, a strange failure occurred and was troubleshot, fortunately he ended up error-handling successfully.
node2,node3,node5 are error handling.
```
2024-10-18 21:37:07 [ndbd] INFO -- Node 2 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 3 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 5 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 4 has completed its restart
2024-10-18 21:37:08 [ndbd] INFO -- Allocate event buffering page chunk in SUMA, 16 pages, first page ref = 7060
2024-10-18 21:37:37 [ndbd] INFO -- prepare to takeover bucket: 1
2024-10-18 21:37:41 [ndbd] INFO -- 22826/0 (22825/4294967295) switchover complete bucket 1 state: 200
2024-10-18 21:37:41 [ndbd] INFO -- shutdown handover takeover
2024-10-18 21:37:43 [ndbd] INFO -- Transporter 3 to node 4 disconnected in state: 0
2024-10-18 21:37:43 [ndbd] INFO -- findNeighbours from: 5587 old (left: 5 right: 3) new (5 3)
2024-10-18 21:37:43 [ndbd] ALERT -- Arbitration check won - node group majority
2024-10-18 21:37:43 [ndbd] INFO -- President restarts arbitration thread [state=6]
2024-10-18 21:37:43 [ndbd] INFO -- NR Status: node=4,OLD=Initial state,NEW=Node failed, fail handling ongoing
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Started failure handling for node 4
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Starting take over of node 4
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 4 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION, NF_BLOCK_HANDLE.
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 4 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION.
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: GCP completion 22826/10 waiting for node failure handling (1) to complete. Seizing record for GCP.
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 4 waiting for NF_TAKEOVER.
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Completed take over of failed node 4
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 4 complete.
2024-10-18 21:37:43 [ndbd] INFO -- DBTC 0: Completing GCP 22826/10 on node failure takeover completion.
2024-10-18 21:37:43 [ndbd] INFO -- NR Status: node=4,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2024-10-18 21:37:43 [ndbd] INFO -- Node 4 has completed node fail handling
2024-10-18 21:37:43 [ndbd] INFO -- Alloc node id for node 4 succeeded
2024-10-18 21:37:43 [ndbd] INFO -- NR Status: node=4,OLD=Node failure handling complete,NEW=Allocated node id
2024-10-18 21:37:44 [ndbd] INFO -- Adjusting disk write speed bounds due to : Node restart ongoing
2024-10-18 21:37:50 [ndbd] INFO -- Cluster shutdown durable gci : 22830
2024-10-18 21:37:51 [ndbd] INFO -- The data node run-time environment has been stopped
2024-10-18 21:37:51 [ndbd] INFO -- Shutdown initiated
2024-10-18 21:37:51 [ndbd] INFO -- Shutdown completed - exiting
2024-10-18 21:37:51 [ndbd] INFO -- Angel shutting down
2024-10-18 21:37:51 [ndbd] INFO -- Node 2: Node shutdown completed.
2024-10-18 21:37:51 [ndbd] INFO -- Angel pid: 119131 started child: 119132
2024-10-18 21:37:51 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing
2024-10-18 21:37:51 [ndbd] INFO -- Configuration fetched from '192.8.0.8:1186', generation: 1
2024-10-18 21:37:51 [ndbd] INFO -- Changing directory to '/usr/local/mysql-cluster/data'
```
fail was on node4:
```
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 67 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 69 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 71 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 75 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 76 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 78 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): index id 80 rebuild done
2024-10-18 21:37:04 [ndbd] INFO -- LDM(0): We have completed restoring our fragments and executed REDO log and rebuilt ordered indexes
2024-10-18 21:37:04 [ndbd] INFO -- Start phase 4 completed
2024-10-18 21:37:04 [ndbd] INFO -- Phase 4 restored all fragments from local disk up to a consistent global checkpoint id
2024-10-18 21:37:04 [ndbd] INFO -- Start NDB start phase 4
2024-10-18 21:37:04 [ndbd] INFO -- NDB start phase 4 completed
2024-10-18 21:37:04 [ndbd] INFO -- During cluster start/restart only master runs phase 5 of NDB start phases
2024-10-18 21:37:04 [ndbd] INFO -- Report to master node our state and wait for master
2024-10-18 21:37:04 [ndbd] INFO -- Master node 2 have reached completion of NDB start phase 5
2024-10-18 21:37:04 [ndbd] INFO -- Start NDB start phase 6
2024-10-18 21:37:04 [ndbd] INFO -- NDB start phase 6 completed
2024-10-18 21:37:04 [ndbd] INFO -- Start phase 5 completed
2024-10-18 21:37:04 [ndbd] INFO -- Phase 5 waited for local checkpoint to complete
2024-10-18 21:37:04 [ndbd] INFO -- Start phase 6 completed
2024-10-18 21:37:04 [ndbd] INFO -- Phase 6 updated blocks about that we've now reached the started state.
2024-10-18 21:37:04 [ndbd] INFO -- Activating bucket 1 in SUMA
2024-10-18 21:37:04 [ndbd] INFO -- Start phase 7 completed
2024-10-18 21:37:04 [ndbd] INFO -- Phase 7 mainly activated the asynchronous change events process, and some other background processes
2024-10-18 21:37:04 [ndbd] INFO -- Start NDB start phase 7
2024-10-18 21:37:04 [ndbd] INFO -- Foreign Key enabling Starting
2024-10-18 21:37:05 [ndbd] INFO -- Foreign key enabling Completed
2024-10-18 21:37:05 [ndbd] INFO -- NDB start phase 7 completed
2024-10-18 21:37:05 [ndbd] INFO -- Start phase 8 completed
2024-10-18 21:37:05 [ndbd] INFO -- Phase 8 enabled foreign keys and waited forall nodes to complete start up to this point
2024-10-18 21:37:05 [ndbd] INFO -- Prepare arbitrator node 1 [ticket=d112585f6a11580b]
2024-10-18 21:37:07 [ndbd] INFO -- Restart complete, updated local sysfile
2024-10-18 21:37:07 [ndbd] INFO -- Start phase 9 completed
2024-10-18 21:37:07 [ndbd] INFO -- Phase 9 enabled APIs to start connecting
2024-10-18 21:37:07 [ndbd] INFO -- Start phase 50 completed
2024-10-18 21:37:07 [ndbd] INFO -- Start phase 101 completed
2024-10-18 21:37:07 [ndbd] INFO -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2024-10-18 21:37:07 [ndbd] INFO -- Node started
2024-10-18 21:37:07 [ndbd] INFO -- Node 2 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 4 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 3 has completed its restart
2024-10-18 21:37:07 [ndbd] INFO -- Node 5 has completed its restart
2024-10-18 21:37:08 [ndbd] INFO -- Allocate event buffering page chunk in SUMA, 16 pages, first page ref = 7060
2024-10-18 21:37:37 [ndbd] INFO -- Suma: handover to node 2 gci: 22826 buckets: 00000002 (2)
2024-10-18 21:37:41 [ndbd] INFO -- 22826/0 (22825/4294967295) switchover complete bucket 1 state: 100
2024-10-18 21:37:41 [ndbd] INFO -- shutdown handover
2024-10-18 21:37:42 [ndbd] INFO -- The data node run-time environment has been stopped
2024-10-18 21:37:42 [ndbd] INFO -- Shutdown initiated
```
I will upload all the log files I have.
Suggested fix:
SUMA had some sort of issue when switching bucket 1, which caused node 4 to perform a shutdown operation.SUMA is responsible for managing asynchronous change events in the cluster, so it is possible that there was an exception in asynchronous event delivery or bucket switching.