MySQL Bugs: #116424: Node failed during system restart in NDB cluster

Bug #116424	Node failed during system restart in NDB cluster
Submitted:	19 Oct 2024 17:31	Modified:	28 Oct 2024 14:50
Reporter:	CunDi Fang	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.35-cluster MySQL Cluster Community S	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Here is the description in error log. I have a 2 replica 2 slice cluster of 2 data nodes as node2, node3, node4, node5. While node2, node4, node5 describe like this:
```
Current byte-offset of file-pointer is: 1067

Time: Friday 18 October 2024 - 21:18:58
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 3 disconnected in phase: 2
Error object: QMGR (Line: 3917) 0x00000002
Program: ndbd
Pid: 104211
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.1 [t1..t1]
***EOM***
```
While node3 describe like this:
```
Current byte-offset of file-pointer is: 1067

Time: Friday 18 October 2024 - 21:18:58
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 103342
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_3_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_3_trace.log.1 [t1..t1]
***EOM***
```

How to repeat:
From the log, it appears that the problem occurs during the first phase of cluster startup (startphase 1). The log shows that during the startup process, node 3 disconnected at "2024-10-18 21:18:58", and then the QMGR module of node 2 triggered the error handling mechanism, which led to the forced shutdown of node 2.

```
2024-10-18 21:18:54 [ndbd] INFO     -- Started thread, index = 44, id = 117758, type = NdbfsThread
2024-10-18 21:18:54 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 3, name = DBINFO
2024-10-18 21:18:54 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 4, name = DBTUP
2024-10-18 21:18:54 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 5, name = DBACC
2024-10-18 21:18:54 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 6, name = DBTC
2024-10-18 21:18:54 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 7, name = DBLQH
2024-10-18 21:18:55 [ndbd] INFO     -- LCPFragWatchdog : WarnElapsed : 60000(ms) MaxElapsed 180000(ms) : MaxGcpWaitLimit 0(ms) period millis : 1000
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 8, name = DBTUX
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 9, name = DBDICT
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 10, name = DBDIH
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 11, name = NDBCNTR
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 12, name = QMGR
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 13, name = TRIX
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 14, name = BACKUP
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 15, name = DBUTIL
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 16, name = SUMA
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 17, name = TSMAN
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 18, name = LGMAN
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 19, name = PGMAN
2024-10-18 21:18:55 [ndbd] INFO     -- pgman: page buffer entries = 10
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 20, name = RESTORE
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 21, name = DBSPJ
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 22, name = THRMAN
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 23, name = DBQLQH
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 24, name = DBQACC
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 25, name = DBQTUP
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 26, name = QBACKUP
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 27, name = DBQTUX
2024-10-18 21:18:55 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 28, name = QRESTORE
2024-10-18 21:18:55 [ndbd] INFO     -- READ_CONFIG_REQ phase completed, this phase is used to read configuration and to calculate various sizes and allocate almost all memory needed by the data node in its lifetime
2024-10-18 21:18:55 [ndbd] INFO     -- Not initial start
2024-10-18 21:18:55 [ndbd] INFO     -- Local sysfile: Node restorable on its own, gci: 0, version: 70603
2024-10-18 21:18:55 [ndbd] INFO     -- Start phase 0 completed
2024-10-18 21:18:55 [ndbd] INFO     -- Phase 0 has made some file system initialisations
2024-10-18 21:18:55 [ndbd] INFO     -- We are running with 0 LDM workers and 4 REDO log parts. This means that we can avoid using a mutex to access REDO log parts
2024-10-18 21:18:55 [ndbd] INFO     -- Watchdog KillSwitch off.
2024-10-18 21:18:55 [ndbd] INFO     -- Starting QMGR phase 1
2024-10-18 21:18:55 [ndbd] INFO     -- Starting with m_restart_seq set to 26
2024-10-18 21:18:55 [ndbd] INFO     -- DIH reported normal start, now starting the Node Inclusion Protocol
2024-10-18 21:18:58 [ndbd] INFO     -- Transporter 2 to node 3 disconnected in state: 0
2024-10-18 21:18:58 [ndbd] INFO     -- Node 3 disconnected in phase: 3
2024-10-18 21:18:58 [ndbd] INFO     -- QMGR (Line: 3917) 0x00000002
2024-10-18 21:18:58 [ndbd] INFO     -- Error handler shutting down system
2024-10-18 21:18:58 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-18 21:18:59 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
```

Suggested fix:
How about add fault tolerance to the startup process?

I cannot reproduce this crash. Give us more details about your setup (hardware, software) and what were you doing before the crash. Why did the node restart?

I am running MySQL NDB cluster in a docker container, whose image is ubuntu 22.04. Node 5 restarted because I was using the command 'ndb_mgm -e 'restart node 5''.

I was trying to do this two commands:
restartDataNode5:ndb_mgm -e 'restart node 5' 
It reported: Failed to restart node 5. 

After I tried to:
backupData: 'ndb_mgm -e 'START BACKUP''
It reported: Backup failed.

I hope these can help you.

Hi,

The part of the log you reported (2 nodes in "other node..") are not a bug.

As for the crash of the "other node" I see in logs I cannot reproduce that.

In any case, before "start backup" you should check that all nodes are "stable" no starting/stopping/restarting etc.. when you are trying to start backup.