MySQL Bugs: #116509: Temporary error, restart node in NDB cluster

Bug #116509	Temporary error, restart node in NDB cluster
Submitted:	30 Oct 2024 16:00	Modified:	28 Nov 2024 13:31
Reporter:	CunDi Fang	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.40-cluster MySQL Cluster Community S	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In a 1 manager(node 1), 4 ndbd(node 2-5), 2 sql(node 6,7) cluster, 

Time: Saturday 26 October 2024 - 01:58:21
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 4383
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_5_trace.log.2
Trace file path: /usr/local/mysql-cluster/data/ndb_5_trace.log.2 [t1..t1]
***EOM***

How to repeat:
A signaling error occurred at line 809 of the SimulatedBlock.cpp file. The error message indicates a failure in sending a signal to a module (GSN: 12, signal length 1). The specific error occurred during the first stage of the startup process, resulting in a forced shutdown of the node and triggering a reboot operation.

Here is the log:
```
2024-10-26 01:58:20 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 26, name = QBACKUP
2024-10-26 01:58:20 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 27, name = DBQTUX
2024-10-26 01:58:20 [ndbd] INFO     -- Sending READ_CONFIG_REQ to index = 28, name = QRESTORE
2024-10-26 01:58:20 [ndbd] INFO     -- READ_CONFIG_REQ phase completed, this phase is used to read configuration and to calcu
late various sizes and allocate almost all memory needed by the data node in its lifetime
2024-10-26 01:58:20 [ndbd] INFO     -- Not initial start
2024-10-26 01:58:20 [ndbd] INFO     -- Local sysfile: Node restorable on its own, gci: 0, version: 70603
2024-10-26 01:58:20 [ndbd] INFO     -- Start phase 0 completed
2024-10-26 01:58:20 [ndbd] INFO     -- Phase 0 has made some file system initialisations
2024-10-26 01:58:20 [ndbd] INFO     -- We are running with 0 LDM workers and 4 REDO log parts. This means that we can avoid u
sing a mutex to access REDO log parts
2024-10-26 01:58:20 [ndbd] INFO     -- Watchdog KillSwitch off.
2024-10-26 01:58:20 [ndbd] INFO     -- Starting QMGR phase 1
2024-10-26 01:58:20 [ndbd] INFO     -- Starting with m_restart_seq set to 26
2024-10-26 01:58:20 [ndbd] INFO     -- DIH reported normal start, now starting the Node Inclusion Protocol
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
Base address/slide: 0x6322fd065000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x6322fd065000 ...
atos -s 0x6322fd065000 ...
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x6322fd5f2bd1]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x6322fd54b773]
ndbd(+0x578b5b) [0x6322fd5ddb5b]
ndbd(SimulatedBlock::sendSignal(unsigned int, unsigned short, SignalT<25u>*, unsigned int, JobBufferLevel) const+0xa4) [0x6322fd5ddf34]
ndbd(DbUtil::runOperation(Signal*, Ptr<DbUtil::Transaction>&, Ptr<DbUtil::Operation>&, unsigned int)+0x276) [0x6322fd41c166]
ndbd(DbUtil::runTransaction(Signal*, Ptr<DbUtil::Transaction>)+0x114) [0x6322fd41c524]
ndbd(+0x5f4680) [0x6322fd659680]
ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x6322fd5dd825]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x6322fd5f4627]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x7e1) [0x6322fd1a4361]
ndbd(real_main(int, char**)+0x4f5) [0x6322fd1a49c5]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x6322fd1a6055]
ndbd(real_main(int, char**)+0x87e) [0x6322fd1a4d4e]
ndbd(+0x129982) [0x6322fd18e982]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7d1d779dfd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7d1d779dfe40]
ndbd(_start+0x25) [0x6322fd197ff5]
2024-10-26 01:58:21 [ndbd] INFO     -- Signal (GSN: 12, Length: 1, Rec Block No: 0)
2024-10-26 01:58:21 [ndbd] INFO     -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
2024-10-26 01:58:21 [ndbd] INFO     -- Error handler shutting down system
2024-10-26 01:58:21 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-26 01:58:21 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2339: 'Send signal error(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2024-10-26 01:58:24 [ndbd] INFO     -- Angel pid: 4456 started child: 4457
2024-10-26 01:58:24 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
2024-10-26 01:58:24 [ndbd] INFO     -- Configuration fetched from '192.168.10.8:1186', generation: 1
2024-10-26 01:58:24 [ndbd] INFO     -- Changing directory to '/usr/local/mysql-cluster/data'
```

Suggested fix:
As can be seen from the stack information, the error occurs in the SimulatedBlock::sendSignal function, which indicates that when communicating between cluster nodes, the system attempts to send a signal (Signal) to a module (usually the process of allocating resources for data or metadata) and the operation fails. The reason for this failure may be that the target module is not ready to receive the signal, or there is an inconsistency in resource allocation. This issue occurs during a READ_CONFIG_REQ configuration read of MySQL Cluster. The logs indicate that as the node enters QMGR phase 1 (denoting the cluster manager), the node's initialization process performs a read configuration request that appears to read some incorrect or incompatible configuration, which triggers the signaling issue.

Hi,

Can you please share full logs from all nodes and if you did anything before this happened. Also is this bare metal or docker?

Thanks

I run these on docker, 5 container with in one docker network