Bug #117879 Signal 8 received; Floating point exception
Submitted: 3 Apr 21:05 Modified: 16 Apr 10:06
Reporter: Shaun Hirst Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:8.4.2 OS:Ubuntu
Assigned to: CPU Architecture:Any

[3 Apr 21:05] Shaun Hirst
Description:
2025-02-28 07:14:12 [ndbd] INFO     -- Transporter 2 to node 3 disconnected in state: 0
2025-02-28 07:14:12 [ndbd] INFO     -- findNeighbours from: 5587 old (left: 3 right: 3) new (65535 65535)
2025-02-28 07:14:12 [ndbd] ALERT    -- Network partitioning - arbitration required
2025-02-28 07:14:12 [ndbd] INFO     -- President restarts arbitration thread [state=7]
2025-02-28 07:14:12 [ndbd] ALERT    -- Arbitration won - positive reply from node 1
2025-02-28 07:14:12 [ndbd] INFO     -- NR Status: node=3,OLD=Initial state,NEW=Node failed, fail handling ongoing
2025-02-28 07:14:12 [ndbd] INFO     -- Master takeover started from 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Started failure handling for node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Starting take over of node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2025-02-28 07:14:12 [ndbd] INFO     -- start_resend(1,
2025-02-28 07:14:12 [ndbd] INFO     -- empty bucket (7189111/13 7189111/12) -> active
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: GCP completion 7189111/13 waiting for node failure handling (1) to complete. Seizing record for GCP.
2025-02-28 07:14:12 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart ongoing
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 3 waiting for NF_TAKEOVER.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Completed take over of failed node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 3 complete.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Completing GCP 7189111/13 on node failure takeover completion.
2025-02-28 07:14:12 [ndbd] INFO     -- Started arbitrator node 1 [ticket=9bf65594a472dc87]
2025-02-28 07:14:13 [ndbd] INFO     -- NR Status: node=3,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2025-02-28 07:14:13 [ndbd] INFO     -- Node 3 has completed node fail handling
2025-02-28 07:14:25 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart finished
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
2025-02-28 07:14:42 [ndbd] INFO     -- Received signal 8. Running error handler.
Base address/slide: 0x56248e7c2000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x56248e7c2000 ...
atos -s 0x56248e7c2000 ...
stack_bottom = 0 thread_stack 0x0
 #0 0x7ff02aabc51f <unknown>
 #1 0x56248ea9d02b _ZN5Dbspj13scanFrag_sendEP6Signal3PtrINS_7RequestEES2_INS_8TreeNodeEE
 #2 0x56248ea91d81 _ZN5Dbspj16execSCAN_NEXTREQEP6Signal
 #3 0x56248ed94b8f <unknown>
 #4 0x56248ed059d4 _ZN13FastScheduler5doJobEj
 #5 0x56248ed1f036 _ZN12ThreadConfig13ipControlLoopEP9NdbThread
 #6 0x56248e8f360b _Z8ndbd_runbiPKciS0_bbbjiimS0_i
 #7 0x56248e8f3dd4 _Z9real_mainiPPc
 #8 0x56248e8f5477 _Z9angel_runPKcRK6VectorI10BaseStringES0_iS0_bbbiiS0_i
 #9 0x56248e8f4159 _Z9real_mainiPPc
 #10 0x56248e8e0881 main
 #11 0x7ff02aaa3d8f <unknown>
 #12 0x7ff02aaa3e3f __libc_start_main
 #13 0x56248e8e9d24 _start
 #14 0xffffffffffffffff <unknown>
2025-02-28 07:14:42 [ndbd] INFO     -- Signal 8 received; Floating point exception
2025-02-28 07:14:42 [ndbd] INFO     -- ./storage/ndb/src/kernel/ndbd.cpp
2025-02-28 07:14:42 [ndbd] INFO     -- Error handler signal shutting down system
2025-02-28 07:14:42 [ndbd] INFO     -- Error handler shutdown completed - exiting
2025-02-28 07:14:43 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error mess
age, please report a bug). Temporary error, restart node'.

How to repeat:
this happened on our production node that runs a number databases, it happened after we moved some additional databases.

we have tried to recreate on  our acceptance system but so far no luck.

we plan to start our migration again on production again soon, but more slowly.

Anything we can provide to ensure this is not going to be an issue going for please let us know

Suggested fix:
in order to get it back up and running all we did was revert the services using the database back to there the old setups, all the data that was migrated stayed
[4 Apr 8:00] Martin Hosking
Some extra context info:
The issue started after we migrated several large DB from our current standard MySQL server to this cluster.  These DB take thousands of requests per minute and everything worked fine for several hours before the NDBD failed.
During the issue we restarted our NDBD nodes several times and the issue kept reoccuring.  It only resolved when we repointed our services that where using the migrated DB to use the old MySQL instance and then restarted the NDBD nodes in a more staggered approach ie we started one and waited untill it got to the point it was waiting for the other node and then restarted the other node (we only have 2 nodes).
[4 Apr 9:51] MySQL Verification Team
Hi,

I am having issue reproducing this (I reproduced the original 91028). Can you, please, upload the resulting archive from ndb_error_reporter or just collect all log files from all nodes manually and upload to the report.

Thanks
[16 Apr 10:06] Shaun Hirst
Hi,

I see the ticket has moved to verified I assume that means you have been able to reproduce the issue?

are you able to share what causes the issue and any workaround that would allow us to complete our migration. as I assume the issue will comeback if we migrate with no actions taken.

if no workaround is possible do you have a rough idea when this might get fixed are we talking weeks/months or years.
[16 Apr 16:27] MySQL Verification Team
Hi,

No. I was not able to reproduce this but you provided enough data for further analysis by the ndbcluster team.