Bug #116427 A strange node crashes at the same time in NDB cluster
Submitted: 19 Oct 18:49 Modified: 30 Oct 10:43
Reporter: CunDi Fang Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:8.0.40-cluster MySQL Cluster Community S OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[19 Oct 18:49] CunDi Fang
Description:
I have a 2 replica 2 slice cluster of 2 data nodes as node2, node3, node4, node5. While node2, node3, node4 describe like this:
node2```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:46:48
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 185381
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_2_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_2_trace.log.1 [t1..t1]
***EOM***
```

node3```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:46:52
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 105312
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_3_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_3_trace.log.1 [t1..t1]
***EOM***
```

node4```
Current byte-offset of file-pointer is: 2065

Time: Saturday 19 October 2024 - 07:46:53
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 107163
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.1 [t1..t1]
***EOM***

Time: Saturday 19 October 2024 - 07:48:31
Status: Temporary error, restart node
Message: Send signal error (Internal error, programming error or missing error message, please report a bug)
Error: 2339
Error data: Signal (GSN: 12, Length: 1, Rec Block No: 0)
Error object: /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
Program: ndbd
Pid: 107471
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_4_trace.log.2
Trace file path: /usr/local/mysql-cluster/data/ndb_4_trace.log.2 [t1..t1]
***EOM***
```

while in node5:
```
Current byte-offset of file-pointer is: 1067

Time: Saturday 19 October 2024 - 07:48:31
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 4 disconnected in phase: 2
Error object: QMGR (Line: 3917) 0x00000002
Program: ndbd
Pid: 108704
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_5_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_5_trace.log.1 [t1..t1]
***EOM***
```

How to repeat:
From the logs, it was found that this should have been a problem with an attempted troubleshooting, which then led to a single node crash and eventual cluster service outage.

Node 2, node 3, and node 4 sent multiple READ_CONFIG_REQ requests during the startup process and completed the configuration read phase.
Node 3 had a stacktrace during the startup process, which showed that the ndbd process had an internal error while processing the signal, with a specific error code of 2339, and an error message of “Send signal error (Internal error, programming error, or missing error message, please report a bug). Send signal error (Internal error, programming error, or missing error message, please report a bug). Temporary error, restart node”.
Due to this error, node 3 was forced to shut down.

Node 4 disconnects from node 3 during startup, causing it to disconnect during phase 2 (stage 2).
Node 5 is also forced to shut down after detecting the failure of other nodes (e.g., node 3 and node 4) with an error code of 2308, stating “Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node”.

Suggested fix:
A force shutdown-induced error-handling failure problem.
[21 Oct 14:29] MySQL Verification Team
Hi,

This looks like a hardware problem but might be a bug on our side but I do not see how to reproduce this. Why was this node restarting in the first place?

2024-10-19 07:46:48 [ndbd] INFO     -- DIH reported normal start, now starting the Node Inclusion Protocol
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
Base address/slide: 0x63a025560000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x63a025560000 ...
atos -s 0x63a025560000 ...
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x63a025ab6761]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x63a025a13ef3]
ndbd(+0x540cfb) [0x63a025aa0cfb]
ndbd(SimulatedBlock::sendSignal(unsigned int, unsigned short, SignalT<25u>*, unsigned int, JobBufferLevel) const+0xa4) [0x63a025aa3904]
ndbd(DbUtil::runOperation(Signal*, Ptr<DbUtil::Transaction>&, Ptr<DbUtil::Operation>&, unsigned int)+0x276) [0x63a02590e0a6]
ndbd(DbUtil::runTransaction(Signal*, Ptr<DbUtil::Transaction>)+0x114) [0x63a02590e464]
ndbd(+0x5ba5d0) [0x63a025b1a5d0]
ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x63a025a9e865]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x63a025ab43b7]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x1248) [0x63a0256978e8]
ndbd(real_main(int, char**)+0x4f5) [0x63a0256980b5]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x63a025699745]
ndbd(real_main(int, char**)+0x87e) [0x63a02569843e]
ndbd(+0x1223a2) [0x63a0256823a2]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e5892b34d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7e5892b34e40]
ndbd(_start+0x25) [0x63a02568b665]
2024-10-19 07:46:48 [ndbd] INFO     -- Signal (GSN: 12, Length: 1, Rec Block No: 0)
2024-10-19 07:46:48 [ndbd] INFO     -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809
2024-10-19 07:46:48 [ndbd] INFO     -- Error handler shutting down system
2024-10-19 07:46:48 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-19 07:46:48 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2339: 'Send signal error(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[21 Oct 14:34] MySQL Verification Team
Bug #116426 is marked as duplicate of this one
[22 Oct 8:51] CunDi Fang
Thank you for your reply, I read my log file and found that I am trying three actions. 

Starting at 2024-10-19T07:46:27, I first tried to remove node 5 from the cluster, the command I used was "ndb_mgm -e node 5 stop". This action succeeded.

Then I tried to restart the whole cluster, the command I used was "ndb_mgm -e 'all stop'" with "ndbd". This action also worked. 

Finally I tried to back up the data, using the command "ndb_mgm -e "START BACKUP"", which failed. 

Then I tried to check data consistency across nodes after 2024-10-19T07:46:48 by executing "SELECT COLUMN_NAME FROM INFORMATION_SCHEMA. KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' mytest4' AND TABLE_NAME = 'test9' AND CONSTRAINT_NAME = 'PRIMARY' ORDER BY ORDINAL_POSITION;", and the results are correct.

I don't think it's a hardware level issue.
[24 Oct 9:35] MySQL Verification Team
I am not sure I understand. How do you expect backup to run if the cluster is down? There must be something before backup and after cluster shutdown?
[24 Oct 16:20] CunDi Fang
Hi, I apologize for the confusion my description caused you. From my logs, my complete operation looks like this: 
``` 
time: 2024-10-19T07:46:27.843235Z 
received buf_3: unexpected-error 
removeDataNode: Node 5 removed successfully. 
restartCluster: Cluster restarted successfully. 
backupData: Backup failed. Error code: 1 
``` 
Starting at 07:46:27, I first tried to remove node 5, and then tried to reboot the whole cluster, and here's how I did it: 
First remove all nodes by "ndb_mgm -e 'ALL STOP'" as you already know. 
Then I tried to restart the cluster by running ndbd on the data node, node2-node5. This step succeeded. 
Finally, I tried to backup the data and failed at this step. 
Also, I checked my hardware and found no corruption, I don't think it's a hardware level problem.
[28 Oct 14:54] MySQL Verification Team
I am failing to reproduce this.

Are you sure you are using 8.0.40 as your description states 8.0.35 ?
[30 Oct 9:36] CunDi Fang
I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.
[30 Oct 9:56] MySQL Verification Team
If you can setup a fresh cluster and reproduce the issue that would be awesome but please store "everything" you are doing to reproduce the problem.
[30 Oct 10:43] CunDi Fang
I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.