MySQL Bugs: #116427: A strange node crashes at the same time in NDB cluster

Bug #116427	A strange node crashes at the same time in NDB cluster
Submitted:	19 Oct 2024 18:49	Modified:	30 Oct 2024 10:43
Reporter:	CunDi Fang	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.40-cluster MySQL Cluster Community S	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
I have a 2 replica 2 slice cluster of 2 data nodes as node2, node3, node4, node5. While node2, node3, node4 describe like this:
node2```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:46:48
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 185381
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_2_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.1 [t1..t1]
***EOM***
```

node3```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:46:52
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 105312
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_3_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_3_trace.log.1 [t1..t1]
***EOM***
```

node4```
Current byte-offset of file-pointer is: 2065

Time: Saturday 19 October 2024 - 07:46:53
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 107163
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.1 [t1..t1]
***EOM***

Time: Saturday 19 October 2024 - 07:48:31
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 107471
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.2
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.2 [t1..t1]
***EOM***
```

while in node5:
```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:48:31
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 4 disconnected in phase: 2
Error object: QMGR (Line: 3917) 0x00000002
Program: ndbd
Pid: 108704
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_5_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_5_trace.log.1 [t1..t1]
***EOM***
```

How to repeat:
From the logs, it was found that this should have been a problem with an attempted troubleshooting, which then led to a single node crash and eventual cluster service outage.

Node 2, node 3, and node 4 sent multiple READ_CONFIG_REQ requests during the startup process and completed the configuration read phase.
Node 3 had a stacktrace during the startup process, which showed that the ndbd process had an internal error while processing the signal, with a specific error code of 2339, and an error message of “Send signal error (Internal error, programming error, or missing error message, please report a bug). Send signal error (Internal error, programming error, or missing error message, please report a bug). Temporary error, restart node”.
Due to this error, node 3 was forced to shut down.

Node 4 disconnects from node 3 during startup, causing it to disconnect during phase 2 (stage 2).
Node 5 is also forced to shut down after detecting the failure of other nodes (e.g., node 3 and node 4) with an error code of 2308, stating “Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node”.

Suggested fix:
A force shutdown-induced error-handling failure problem.

Hi,

This looks like a hardware problem but might be a bug on our side but I do not see how to reproduce this. Why was this node restarting in the first place?

2024-10-19 07:46:48 [ndbd] INFO     -- DIH reported normal start, now starting the Node Inclusion Protocol
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
Base address/slide: 0x63a025560000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x63a025560000 ...
atos -s 0x63a025560000 ...
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x63a025ab6761]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x63a025a13ef3]
ndbd(+0x540cfb) [0x63a025aa0cfb]
ndbd(SimulatedBlock::sendSignal(unsigned int, unsigned short, SignalT<25u>*, unsigned int, JobBufferLevel) const+0xa4) [0x63a025aa3904]
ndbd(DbUtil::runOperation(Signal*, Ptr<DbUtil::Transaction>&, Ptr<DbUtil::Operation>&, unsigned int)+0x276) [0x63a02590e0a6]
ndbd(DbUtil::runTransaction(Signal*, Ptr<DbUtil::Transaction>)+0x114) [0x63a02590e464]
ndbd(+0x5ba5d0) [0x63a025b1a5d0]
ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x63a025a9e865]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x63a025ab43b7]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x1248) [0x63a0256978e8]
ndbd(real_main(int, char**)+0x4f5) [0x63a0256980b5]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x63a025699745]
ndbd(real_main(int, char**)+0x87e) [0x63a02569843e]
ndbd(+0x1223a2) [0x63a0256823a2]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e5892b34d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7e5892b34e40]
ndbd(_start+0x25) [0x63a02568b665]
2024-10-19 07:46:48 [ndbd] INFO     -- Signal (GSN: 12, Length: 1, Rec Block No: 0)
2024-10-19 07:46:48 [ndbd] INFO     -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
2024-10-19 07:46:48 [ndbd] INFO     -- Error handler shutting down system
2024-10-19 07:46:48 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-19 07:46:48 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2339: 'Send signal error(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Bug #116426 is marked as duplicate of this one

Thank you for your reply, I read my log file and found that I am trying three actions. 

Starting at 2024-10-19T07:46:27, I first tried to remove node 5 from the cluster, the command I used was "ndb_mgm -e node 5 stop". This action succeeded.

Then I tried to restart the whole cluster, the command I used was "ndb_mgm -e 'all stop'" with "ndbd". This action also worked. 

Finally I tried to back up the data, using the command "ndb_mgm -e "START BACKUP"", which failed. 

Then I tried to check data consistency across nodes after 2024-10-19T07:46:48 by executing "SELECT COLUMN_NAME FROM INFORMATION_SCHEMA. KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' mytest4' AND TABLE_NAME = 'test9' AND CONSTRAINT_NAME = 'PRIMARY' ORDER BY ORDINAL_POSITION;", and the results are correct.

I don't think it's a hardware level issue.

I am not sure I understand. How do you expect backup to run if the cluster is down? There must be something before backup and after cluster shutdown?

Hi, I apologize for the confusion my description caused you. From my logs, my complete operation looks like this: 
``` 
time: 2024-10-19T07:46:27.843235Z 
received buf_3: unexpected-error 
removeDataNode: Node 5 removed successfully. 
restartCluster: Cluster restarted successfully. 
backupData: Backup failed. Error code: 1 
``` 
Starting at 07:46:27, I first tried to remove node 5, and then tried to reboot the whole cluster, and here's how I did it: 
First remove all nodes by "ndb_mgm -e 'ALL STOP'" as you already know. 
Then I tried to restart the cluster by running ndbd on the data node, node2-node5. This step succeeded. 
Finally, I tried to backup the data and failed at this step. 
Also, I checked my hardware and found no corruption, I don't think it's a hardware level problem.

I am failing to reproduce this.

Are you sure you are using 8.0.40 as your description states 8.0.35 ?

I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.

If you can setup a fresh cluster and reproduce the issue that would be awesome but please store "everything" you are doing to reproduce the problem.

I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.