Bug #91028 | NDB data node fails and crashes with SIGFPE, floating point exception | ||
---|---|---|---|
Submitted: | 25 May 2018 14:26 | Modified: | 3 Apr 15:50 |
Reporter: | Andrew Tikhonov | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 7.4.13 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | NDB MySQL SIGFPE floating point exception provisioning |
[25 May 2018 14:26]
Andrew Tikhonov
[25 May 2018 16:06]
Andrew Tikhonov
Data node logs 2018-05-25 15:50:00 [ndbd] INFO -- Received signal 8. Running error handler. 2018-05-25 15:50:00 [ndbd] INFO -- Signal 8 received; Floating point exception 2018-05-25 15:50:00 [ndbd] INFO -- /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp 2018-05-25 15:50:00 [ndbd] INFO -- Error handler signal restarting system 2018-05-25 15:50:00 [ndbd] INFO -- Error handler shutdown completed - exiting 2018-05-25 15:50:00 [ndbd] ALERT -- Node 3: Forced node shutdown completed, restarting. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2018-05-25 15:50:00 [ndbd] INFO -- Ndb has terminated (pid 32121) restarting 2018-05-25 15:50:00 [ndbd] INFO -- Angel reconnected to '172.26.221.19:1186' 2018-05-25 15:50:03 [ndbd] INFO -- Angel reallocated nodeid: 3 2018-05-25 15:50:03 [ndbd] INFO -- Angel pid: 31729 started child: 959 2018-05-25 15:50:03 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing 2018-05-25 15:50:03 [ndbd] INFO -- Configuration fetched from '172.26.221.19:1186', generation: 1 2018-05-25 15:50:03 [ndbd] INFO -- Changing directory to '/var/lib/mysql/cluster_data' NDBMT: non-mt 2018-05-25 15:50:03 [ndbd] INFO -- NDB Cluster -- DB node 3 2018-05-25 15:50:03 [ndbd] INFO -- mysql-5.6.34 ndb-7.4.13 -- 2018-05-25 15:50:03 [ndbd] INFO -- WatchDog timer is set to 6000 ms 2018-05-25 15:50:03 [ndbd] INFO -- Memory Allocation for global memory pools Starting 2018-05-25 15:50:03 [ndbd] INFO -- numa_set_interleave_mask(numa_all_nodes) : OK 2018-05-25 15:50:03 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 10780Mb initial: 10908Mb 2018-05-25 15:50:03 [ndbd] INFO -- Touch Memory Starting, 8320 pages, page size = 32768 2018-05-25 15:50:03 [ndbd] INFO -- Touch Memory Completed 2018-05-25 15:50:03 [ndbd] INFO -- Lock Memory Starting, 8320 pages, page size = 32768
[28 May 2018 16:10]
MySQL Verification Team
Hi Andrew, Can you reproduce this on-demand? You just have a high load and this happens? all best Bogdan
[29 May 2018 9:43]
Andrew Tikhonov
Dear Bogdan, yes. It doesn't happen always. Judging by the code that produces the exception it should not happen always. It tends to crash more often when records are fairly big, try including a varchar into a record or text and fill varchar up to capacity. I created a file to provision 100K records with each record close to ~4K - it crashes every time. The traceback indicates maxlag is 0. 0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0) Check in the `tick` method, places where the division by maxlag happens. if (lag > maxlag) { /** * We did have lag last second... * increase m_lag_cnt and check if it has reached maxlag_cnt */ Uint32 tmp = m_lag_cnt; m_lag_cnt += (lag / maxlag); if (tmp < maxlag_cnt && m_lag_cnt >= maxlag_cnt) { retVal = -1; // start aborting transactions } } Regards, Andrew
[29 May 2018 12:05]
Andrew Tikhonov
Do you need me to resubmit the bug with more detailed "how to repeat" instructions ? Do you have everything ? Regards, Andrew
[29 May 2018 13:42]
Andrew Tikhonov
Here's a script to repeat. ---- repeat.sh ---- FILE=bug.sql echo "creating.." echo "DROP TABLE IF EXISTS bug_test;" > $FILE echo "CREATE TABLE bug_test (id1 bigint(20) NOT NULL,id2 bigint(20) DEFAULT NULL,id3 bigint(20) DEFAULT NULL,payload1 varchar(4096) DEFAULT NULL, payload2 varchar(4096) DEFAULT NULL, PRIMARY KEY (id1), UNIQUE KEY bug_UNIQUE2 (id2), UNIQUE KEY bug_UNIQUE3 (id3), INDEX id1_idx (id1), INDEX id2_idx (id2) USING BTREE, INDEX id3_idx (id2) USING BTREE) ENGINE=ndbcluster DEFAULT CHARSET=latin1;" >> $FILE; for x in {1..100000}; do echo "insert into bug_test (id1, id2, id3, payload1, payload2) values (100000000000"$x", 100000000000"$x", 100000000000"$x", '<FILL IN ~4K OF TEXT>', '<FILL IN ~4K OF TEXT>');" >> $FILE; done echo "provisioning.." time mysql --database=db < $FILE
[29 May 2018 17:54]
MySQL Verification Team
Hi, Thanks for additional data and script to reproduce. I'll re-test everything and update bug report when I'm done. Thanks Bogdan
[29 May 2018 17:55]
MySQL Verification Team
(no need to resubmit anything)
[27 Jun 2018 17:29]
MySQL Verification Team
Hi, did manage to reproduce this so verifying it now. thanks for your efforts! all best Bogdan
[31 Dec 2018 21:01]
Jon Stephens
Fixed in NDB 7.4.20, 7.5.10, 7.6.5, and 8.0.13. Closed.
[3 Mar 17:16]
Shaun Hirst
we are also see this error messag, this started happening after we migrated more databases onto the cluster. let me know if we can provide more info We are suing 8.4.2 2025-02-28 07:14:12 [ndbd] INFO -- Transporter 2 to node 3 disconnected in state: 0 2025-02-28 07:14:12 [ndbd] INFO -- findNeighbours from: 5587 old (left: 3 right: 3) new (65535 65535) 2025-02-28 07:14:12 [ndbd] ALERT -- Network partitioning - arbitration required 2025-02-28 07:14:12 [ndbd] INFO -- President restarts arbitration thread [state=7] 2025-02-28 07:14:12 [ndbd] ALERT -- Arbitration won - positive reply from node 1 2025-02-28 07:14:12 [ndbd] INFO -- NR Status: node=3,OLD=Initial state,NEW=Node failed, fail handling ongoing 2025-02-28 07:14:12 [ndbd] INFO -- Master takeover started from 3 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Started failure handling for node 3 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Starting take over of node 3 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Step NF_BLOCK_HANDLE completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_SCAN, NF_CHECK_TRANSACTION. 2025-02-28 07:14:12 [ndbd] INFO -- start_resend(1, 2025-02-28 07:14:12 [ndbd] INFO -- empty bucket (7189111/13 7189111/12) -> active 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Step NF_CHECK_SCAN completed, failure handling for node 3 waiting for NF_TAKEOVER, NF_CHECK_TRANSACTION. 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: GCP completion 7189111/13 waiting for node failure handling (1) to complete. Seizing record for GCP. 2025-02-28 07:14:12 [ndbd] INFO -- Adjusting disk write speed bounds due to : Node restart ongoing 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Step NF_CHECK_TRANSACTION completed, failure handling for node 3 waiting for NF_TAKEOVER. 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Completed take over of failed node 3 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Step NF_TAKEOVER completed, failure handling for node 3 complete. 2025-02-28 07:14:12 [ndbd] INFO -- DBTC 0: Completing GCP 7189111/13 on node failure takeover completion. 2025-02-28 07:14:12 [ndbd] INFO -- Started arbitrator node 1 [ticket=9bf65594a472dc87] 2025-02-28 07:14:13 [ndbd] INFO -- NR Status: node=3,OLD=Node failed, fail handling ongoing,NEW=Node failure handling complete 2025-02-28 07:14:13 [ndbd] INFO -- Node 3 has completed node fail handling 2025-02-28 07:14:25 [ndbd] INFO -- Adjusting disk write speed bounds due to : Node restart finished For help with below stacktrace consult: https://dev.mysql.com/doc/refman/en/using-stack-trace.html Also note that stack_bottom and thread_stack will always show up as zero. 2025-02-28 07:14:42 [ndbd] INFO -- Received signal 8. Running error handler. Base address/slide: 0x56248e7c2000 With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in stacktrace with the base address before passing them to tool. For tools that have options for slide use that, e.g.: llvm-symbolizer --adjust-vma=0x56248e7c2000 ... atos -s 0x56248e7c2000 ... stack_bottom = 0 thread_stack 0x0 #0 0x7ff02aabc51f <unknown> #1 0x56248ea9d02b _ZN5Dbspj13scanFrag_sendEP6Signal3PtrINS_7RequestEES2_INS_8TreeNodeEE #2 0x56248ea91d81 _ZN5Dbspj16execSCAN_NEXTREQEP6Signal #3 0x56248ed94b8f <unknown> #4 0x56248ed059d4 _ZN13FastScheduler5doJobEj #5 0x56248ed1f036 _ZN12ThreadConfig13ipControlLoopEP9NdbThread #6 0x56248e8f360b _Z8ndbd_runbiPKciS0_bbbjiimS0_i #7 0x56248e8f3dd4 _Z9real_mainiPPc #8 0x56248e8f5477 _Z9angel_runPKcRK6VectorI10BaseStringES0_iS0_bbbiiS0_i #9 0x56248e8f4159 _Z9real_mainiPPc #10 0x56248e8e0881 main #11 0x7ff02aaa3d8f <unknown> #12 0x7ff02aaa3e3f __libc_start_main #13 0x56248e8e9d24 _start #14 0xffffffffffffffff <unknown> 2025-02-28 07:14:42 [ndbd] INFO -- Signal 8 received; Floating point exception 2025-02-28 07:14:42 [ndbd] INFO -- ./storage/ndb/src/kernel/ndbd.cpp 2025-02-28 07:14:42 [ndbd] INFO -- Error handler signal shutting down system 2025-02-28 07:14:42 [ndbd] INFO -- Error handler shutdown completed - exiting 2025-02-28 07:14:43 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error mess age, please report a bug). Temporary error, restart node'.
[3 Apr 13:07]
MySQL Verification Team
Posted by developer: Please raise a new bug. Looks like the same Unix error code but from a different code path.
[3 Apr 21:07]
Shaun Hirst
new bug created https://bugs.mysql.com/bug.php?id=117879