Description:
Here is the description in error log. I have a 2 replica 2 slice cluster of 2 data nodes as node2, node3, node4, node5. While node2, node4, node5 describe like this:
```
Current byte-offset of file-pointer is: 1067
Time: Friday 18 October 2024 - 21:18:58
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 3 disconnected in phase: 2
Error object: QMGR (Line: 3917) 0x00000002
Program: ndbd
Pid: 104211
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.1 [t1..t1]
***EOM***
```
While node3 describe like this:
```
Current byte-offset of file-pointer is: 1067
Time: Friday 18 October 2024 - 21:18:58
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 103342
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_3_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_3_trace.log.1 [t1..t1]
***EOM***
```
How to repeat:
From the log, it appears that the problem occurs during the first phase of cluster startup (startphase 1). The log shows that during the startup process, node 3 disconnected at "2024-10-18 21:18:58", and then the QMGR module of node 2 triggered the error handling mechanism, which led to the forced shutdown of node 2.
```
2024-10-18 21:18:54 [ndbd] INFO -- Started thread, index = 44, id = 117758, type = NdbfsThread
2024-10-18 21:18:54 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 3, name = DBINFO
2024-10-18 21:18:54 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 4, name = DBTUP
2024-10-18 21:18:54 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 5, name = DBACC
2024-10-18 21:18:54 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 6, name = DBTC
2024-10-18 21:18:54 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 7, name = DBLQH
2024-10-18 21:18:55 [ndbd] INFO -- LCPFragWatchdog : WarnElapsed : 60000(ms) MaxElapsed 180000(ms) : MaxGcpWaitLimit 0(ms) period millis : 1000
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 8, name = DBTUX
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 9, name = DBDICT
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 10, name = DBDIH
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 11, name = NDBCNTR
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 12, name = QMGR
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 13, name = TRIX
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 14, name = BACKUP
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 15, name = DBUTIL
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 16, name = SUMA
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 17, name = TSMAN
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 18, name = LGMAN
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 19, name = PGMAN
2024-10-18 21:18:55 [ndbd] INFO -- pgman: page buffer entries = 10
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 20, name = RESTORE
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 21, name = DBSPJ
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 22, name = THRMAN
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 23, name = DBQLQH
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 24, name = DBQACC
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 25, name = DBQTUP
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 26, name = QBACKUP
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 27, name = DBQTUX
2024-10-18 21:18:55 [ndbd] INFO -- Sending READ_CONFIG_REQ to index = 28, name = QRESTORE
2024-10-18 21:18:55 [ndbd] INFO -- READ_CONFIG_REQ phase completed, this phase is used to read configuration and to calculate various sizes and allocate almost all memory needed by the data node in its lifetime
2024-10-18 21:18:55 [ndbd] INFO -- Not initial start
2024-10-18 21:18:55 [ndbd] INFO -- Local sysfile: Node restorable on its own, gci: 0, version: 70603
2024-10-18 21:18:55 [ndbd] INFO -- Start phase 0 completed
2024-10-18 21:18:55 [ndbd] INFO -- Phase 0 has made some file system initialisations
2024-10-18 21:18:55 [ndbd] INFO -- We are running with 0 LDM workers and 4 REDO log parts. This means that we can avoid using a mutex to access REDO log parts
2024-10-18 21:18:55 [ndbd] INFO -- Watchdog KillSwitch off.
2024-10-18 21:18:55 [ndbd] INFO -- Starting QMGR phase 1
2024-10-18 21:18:55 [ndbd] INFO -- Starting with m_restart_seq set to 26
2024-10-18 21:18:55 [ndbd] INFO -- DIH reported normal start, now starting the Node Inclusion Protocol
2024-10-18 21:18:58 [ndbd] INFO -- Transporter 2 to node 3 disconnected in state: 0
2024-10-18 21:18:58 [ndbd] INFO -- Node 3 disconnected in phase: 3
2024-10-18 21:18:58 [ndbd] INFO -- QMGR (Line: 3917) 0x00000002
2024-10-18 21:18:58 [ndbd] INFO -- Error handler shutting down system
2024-10-18 21:18:58 [ndbd] INFO -- Error handler shutdown completed - exiting
2024-10-18 21:18:59 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
```
Suggested fix:
How about add fault tolerance to the startup process?