MySQL Bugs: #91028: NDB data node fails and crashes with SIGFPE, floating point exception

Bug #91028	NDB data node fails and crashes with SIGFPE, floating point exception
Submitted:	25 May 2018 14:26	Modified:	3 Apr 15:50
Reporter:	Andrew Tikhonov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.4.13	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	NDB MySQL SIGFPE floating point exception provisioning

Description:
SIGFPE, floating pointer exception during high provisioning load.

Where problem occurs:

./storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
Dblqh::IOTracker::tick

Traceback:

storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#0  0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1  0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2  0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3  0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4  executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5  FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6  0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7  0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
    initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8  0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9  0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
    connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()

How to repeat:
Run 

gdb ndbd
(gdb) set follow-fork-mode child
(gdb) handle SIGFPE stop nopass
(gdb) run
(gdb) tb

Start provisioning traffic to MySQL cluster. One of the data nodes will fail.

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffff7fed740 (LWP 31730)]
0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
26950   /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp: No such file or directory.
Missing separate debuginfos, use: debuginfo-install MySQL-Cluster-server-advanced-7.4.13-1.el7.x86_64

(gdb) bt
#0  0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1  0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2  0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3  0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4  executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5  FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6  0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7  0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
    initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8  0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9  0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
    connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()

Data node logs

2018-05-25 15:50:00 [ndbd] INFO     -- Received signal 8. Running error handler.
2018-05-25 15:50:00 [ndbd] INFO     -- Signal 8 received; Floating point exception
2018-05-25 15:50:00 [ndbd] INFO     -- /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp
2018-05-25 15:50:00 [ndbd] INFO     -- Error handler signal restarting system
2018-05-25 15:50:00 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-05-25 15:50:00 [ndbd] ALERT    -- Node 3: Forced node shutdown completed, restarting. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-05-25 15:50:00 [ndbd] INFO     -- Ndb has terminated (pid 32121) restarting
2018-05-25 15:50:00 [ndbd] INFO     -- Angel reconnected to '172.26.221.19:1186'
2018-05-25 15:50:03 [ndbd] INFO     -- Angel reallocated nodeid: 3
2018-05-25 15:50:03 [ndbd] INFO     -- Angel pid: 31729 started child: 959
2018-05-25 15:50:03 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
2018-05-25 15:50:03 [ndbd] INFO     -- Configuration fetched from '172.26.221.19:1186', generation: 1
2018-05-25 15:50:03 [ndbd] INFO     -- Changing directory to '/var/lib/mysql/cluster_data'
NDBMT: non-mt
2018-05-25 15:50:03 [ndbd] INFO     -- NDB Cluster -- DB node 3
2018-05-25 15:50:03 [ndbd] INFO     -- mysql-5.6.34 ndb-7.4.13 --
2018-05-25 15:50:03 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2018-05-25 15:50:03 [ndbd] INFO     -- Memory Allocation for global memory pools Starting
2018-05-25 15:50:03 [ndbd] INFO     -- numa_set_interleave_mask(numa_all_nodes) : OK
2018-05-25 15:50:03 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 10780Mb initial: 10908Mb
2018-05-25 15:50:03 [ndbd] INFO     -- Touch Memory Starting, 8320 pages, page size = 32768
2018-05-25 15:50:03 [ndbd] INFO     -- Touch Memory Completed
2018-05-25 15:50:03 [ndbd] INFO     -- Lock Memory Starting, 8320 pages, page size = 32768

Hi Andrew,

Can you reproduce this on-demand? You just have a high load and this happens?

all best
Bogdan

Dear Bogdan, yes. It doesn't happen always. Judging by the code that produces the exception it should not happen always. It tends to crash more often when records are fairly big, try including a varchar into a record or text and fill varchar up to capacity.

I created a file to provision 100K records with each record close to ~4K - it crashes every time.

The traceback indicates maxlag is 0.

0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)

Check in the `tick` method, places where the division by maxlag happens.

    if (lag > maxlag)
    {
      /**
       * We did have lag last second...
       *   increase m_lag_cnt and check if it has reached maxlag_cnt
       */
      Uint32 tmp = m_lag_cnt;
      m_lag_cnt += (lag / maxlag);
      if (tmp < maxlag_cnt && m_lag_cnt >= maxlag_cnt)
      {
        retVal = -1; // start aborting transactions
      }
    }

Regards,
Andrew

Do you need me to resubmit the bug with more detailed "how to repeat" instructions ? Do you have everything ?

Regards,
Andrew

Here's a script to repeat.

---- repeat.sh ----
FILE=bug.sql

echo "creating.."

echo "DROP TABLE IF EXISTS bug_test;" > $FILE
echo "CREATE TABLE bug_test (id1 bigint(20) NOT NULL,id2 bigint(20) DEFAULT NULL,id3 bigint(20) DEFAULT NULL,payload1 varchar(4096) DEFAULT NULL, payload2 varchar(4096) DEFAULT NULL, PRIMARY KEY (id1), UNIQUE KEY bug_UNIQUE2 (id2), UNIQUE KEY bug_UNIQUE3 (id3), INDEX id1_idx (id1), INDEX id2_idx (id2) USING BTREE, INDEX id3_idx (id2) USING BTREE) ENGINE=ndbcluster DEFAULT CHARSET=latin1;" >> $FILE;

for x in {1..100000}; do

echo "insert into bug_test (id1, id2, id3, payload1, payload2) values (100000000000"$x", 100000000000"$x", 100000000000"$x", '<FILL IN ~4K OF TEXT>',
'<FILL IN ~4K OF TEXT>');" >> $FILE;

done

echo "provisioning.."
time mysql --database=db < $FILE

Hi,
Thanks for additional data and script to reproduce. I'll re-test everything and update bug report when I'm done.

Thanks
Bogdan

(no need to resubmit anything)

Hi,

did manage to reproduce this so verifying it now.

thanks for your efforts!
all best
Bogdan

Fixed in NDB 7.4.20, 7.5.10, 7.6.5, and 8.0.13. Closed.

we are also see this error messag, this started happening after we migrated more databases onto the cluster. let me know if we can provide more info

We are suing 8.4.2

2025-02-28 07:14:12 [ndbd] INFO     -- Transporter 2 to node 3 disconnected in state: 0
2025-02-28 07:14:12 [ndbd] INFO     -- findNeighbours from: 5587 old (left: 3 right: 3) new (65535 65535)
2025-02-28 07:14:12 [ndbd] ALERT    -- Network partitioning - arbitration required
2025-02-28 07:14:12 [ndbd] INFO     -- President restarts arbitration thread [state=7]
2025-02-28 07:14:12 [ndbd] ALERT    -- Arbitration won - positive reply from node 1
2025-02-28 07:14:12 [ndbd] INFO     -- NR Status: node=3,OLD=Initial state,NEW=Node failed, fail handling ongoing
2025-02-28 07:14:12 [ndbd] INFO     -- Master takeover started from 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Started failure handling for node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Starting take over of node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION.
2025-02-28 07:14:12 [ndbd] INFO     -- start_resend(1,
2025-02-28 07:14:12 [ndbd] INFO     -- empty bucket (7189111/13 7189111/12) -> active
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: GCP completion 7189111/13 waiting for node failure handling (1) to complete. Seizing record for GCP.
2025-02-28 07:14:12 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart ongoing
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 3 waiting for NF_TAKEOVER.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Completed take over of failed node 3
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 3 complete.
2025-02-28 07:14:12 [ndbd] INFO     -- DBTC 0: Completing GCP 7189111/13 on node failure takeover completion.
2025-02-28 07:14:12 [ndbd] INFO     -- Started arbitrator node 1 [ticket=9bf65594a472dc87]
2025-02-28 07:14:13 [ndbd] INFO     -- NR Status: node=3,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete
2025-02-28 07:14:13 [ndbd] INFO     -- Node 3 has completed node fail handling
2025-02-28 07:14:25 [ndbd] INFO     -- Adjusting disk write speed bounds due to : Node restart finished
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
2025-02-28 07:14:42 [ndbd] INFO     -- Received signal 8. Running error handler.
Base address/slide: 0x56248e7c2000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x56248e7c2000 ...
atos -s 0x56248e7c2000 ...
stack_bottom = 0 thread_stack 0x0
 #0 0x7ff02aabc51f <unknown>
 #1 0x56248ea9d02b _ZN5Dbspj13scanFrag_sendEP6Signal3PtrINS_7RequestEES2_INS_8TreeNodeEE
 #2 0x56248ea91d81 _ZN5Dbspj16execSCAN_NEXTREQEP6Signal
 #3 0x56248ed94b8f <unknown>
 #4 0x56248ed059d4 _ZN13FastScheduler5doJobEj
 #5 0x56248ed1f036 _ZN12ThreadConfig13ipControlLoopEP9NdbThread
 #6 0x56248e8f360b _Z8ndbd_runbiPKciS0_bbbjiimS0_i
 #7 0x56248e8f3dd4 _Z9real_mainiPPc
 #8 0x56248e8f5477 _Z9angel_runPKcRK6VectorI10BaseStringES0_iS0_bbbiiS0_i
 #9 0x56248e8f4159 _Z9real_mainiPPc
 #10 0x56248e8e0881 main
 #11 0x7ff02aaa3d8f <unknown>
 #12 0x7ff02aaa3e3f __libc_start_main
 #13 0x56248e8e9d24 _start
 #14 0xffffffffffffffff <unknown>
2025-02-28 07:14:42 [ndbd] INFO     -- Signal 8 received; Floating point exception
2025-02-28 07:14:42 [ndbd] INFO     -- ./storage/ndb/src/kernel/ndbd.cpp
2025-02-28 07:14:42 [ndbd] INFO     -- Error handler signal shutting down system
2025-02-28 07:14:42 [ndbd] INFO     -- Error handler shutdown completed - exiting
2025-02-28 07:14:43 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error mess
age, please report a bug). Temporary error, restart node'.

Posted by developer:
 
Please raise a new bug. Looks like the same Unix error code but from a different code path.

new bug created https://bugs.mysql.com/bug.php?id=117879