Bug #116427 | A strange node crashes at the same time in NDB cluster | ||
---|---|---|---|
Submitted: | 19 Oct 18:49 | Modified: | 30 Oct 10:43 |
Reporter: | CunDi Fang | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | 8.0.40-cluster MySQL Cluster Community S | OS: | Any |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[19 Oct 18:49]
CunDi Fang
[21 Oct 14:29]
MySQL Verification Team
Hi, This looks like a hardware problem but might be a bug on our side but I do not see how to reproduce this. Why was this node restarting in the first place? 2024-10-19 07:46:48 [ndbd] INFO -- DIH reported normal start, now starting the Node Inclusion Protocol For help with below stacktrace consult: https://dev.mysql.com/doc/refman/en/using-stack-trace.html Also note that stack_bottom and thread_stack will always show up as zero. Base address/slide: 0x63a025560000 With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in stacktrace with the base address before passing them to tool. For tools that have options for slide use that, e.g.: llvm-symbolizer --adjust-vma=0x63a025560000 ... atos -s 0x63a025560000 ... stack_bottom = 0 thread_stack 0x0 ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x63a025ab6761] ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x63a025a13ef3] ndbd(+0x540cfb) [0x63a025aa0cfb] ndbd(SimulatedBlock::sendSignal(unsigned int, unsigned short, SignalT<25u>*, unsigned int, JobBufferLevel) const+0xa4) [0x63a025aa3904] ndbd(DbUtil::runOperation(Signal*, Ptr<DbUtil::Transaction>&, Ptr<DbUtil::Operation>&, unsigned int)+0x276) [0x63a02590e0a6] ndbd(DbUtil::runTransaction(Signal*, Ptr<DbUtil::Transaction>)+0x114) [0x63a02590e464] ndbd(+0x5ba5d0) [0x63a025b1a5d0] ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x63a025a9e865] ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x63a025ab43b7] ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x1248) [0x63a0256978e8] ndbd(real_main(int, char**)+0x4f5) [0x63a0256980b5] ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x63a025699745] ndbd(real_main(int, char**)+0x87e) [0x63a02569843e] ndbd(+0x1223a2) [0x63a0256823a2] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e5892b34d90] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7e5892b34e40] ndbd(_start+0x25) [0x63a02568b665] 2024-10-19 07:46:48 [ndbd] INFO -- Signal (GSN: 12, Length: 1, Rec Block No: 0) 2024-10-19 07:46:48 [ndbd] INFO -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:809 2024-10-19 07:46:48 [ndbd] INFO -- Error handler shutting down system 2024-10-19 07:46:48 [ndbd] INFO -- Error handler shutdown completed - exiting 2024-10-19 07:46:48 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occurred during startphase 1. Caused by error 2339: 'Send signal error(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[21 Oct 14:34]
MySQL Verification Team
Bug #116426 is marked as duplicate of this one
[22 Oct 8:51]
CunDi Fang
Thank you for your reply, I read my log file and found that I am trying three actions. Starting at 2024-10-19T07:46:27, I first tried to remove node 5 from the cluster, the command I used was "ndb_mgm -e node 5 stop". This action succeeded. Then I tried to restart the whole cluster, the command I used was "ndb_mgm -e 'all stop'" with "ndbd". This action also worked. Finally I tried to back up the data, using the command "ndb_mgm -e "START BACKUP"", which failed. Then I tried to check data consistency across nodes after 2024-10-19T07:46:48 by executing "SELECT COLUMN_NAME FROM INFORMATION_SCHEMA. KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' KEY_COLUMN_USAGE WHERE TABLE_SCHEMA = ' mytest4' AND TABLE_NAME = 'test9' AND CONSTRAINT_NAME = 'PRIMARY' ORDER BY ORDINAL_POSITION;", and the results are correct. I don't think it's a hardware level issue.
[24 Oct 9:35]
MySQL Verification Team
I am not sure I understand. How do you expect backup to run if the cluster is down? There must be something before backup and after cluster shutdown?
[24 Oct 16:20]
CunDi Fang
Hi, I apologize for the confusion my description caused you. From my logs, my complete operation looks like this: ``` time: 2024-10-19T07:46:27.843235Z received buf_3: unexpected-error removeDataNode: Node 5 removed successfully. restartCluster: Cluster restarted successfully. backupData: Backup failed. Error code: 1 ``` Starting at 07:46:27, I first tried to remove node 5, and then tried to reboot the whole cluster, and here's how I did it: First remove all nodes by "ndb_mgm -e 'ALL STOP'" as you already know. Then I tried to restart the cluster by running ndbd on the data node, node2-node5. This step succeeded. Finally, I tried to backup the data and failed at this step. Also, I checked my hardware and found no corruption, I don't think it's a hardware level problem.
[28 Oct 14:54]
MySQL Verification Team
I am failing to reproduce this. Are you sure you are using 8.0.40 as your description states 8.0.35 ?
[30 Oct 9:36]
CunDi Fang
I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.
[30 Oct 9:56]
MySQL Verification Team
If you can setup a fresh cluster and reproduce the issue that would be awesome but please store "everything" you are doing to reproduce the problem.
[30 Oct 10:43]
CunDi Fang
I'm sorry for the confusion, and just to be clear, I am using 8.0.40, and there is stated version information in the error log. Maybe it really is a version issue? I did find that this error condition exists. Do I need to reproduce it to try again? Though I think there may not be new info on the log.