Bug #91062 MySQL Cluster NDB node crashes with SIGFPE, floating point exception
Submitted: 29 May 2018 13:52 Modified: 11 Jun 2018 16:27
Reporter: Andrew Tikhonov Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.4.13 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: NDB MySQL SIGFPE floating point exception provisioning

[29 May 2018 13:52] Andrew Tikhonov
Description:
SIGFPE, floating point exception is generated during high provisioning load. Data Node crashes and, the following error is printed by the mysql:

ERROR 1297 (HY000) at line 103945: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

Logs of NDB data node:

2018-05-25 15:50:00 [ndbd] INFO     -- Received signal 8. Running error handler.
2018-05-25 15:50:00 [ndbd] INFO     -- Signal 8 received; Floating point exception
2018-05-25 15:50:00 [ndbd] INFO     -- /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp
2018-05-25 15:50:00 [ndbd] INFO     -- Error handler signal restarting system
2018-05-25 15:50:00 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-05-25 15:50:00 [ndbd] ALERT    -- Node 3: Forced node shutdown completed, restarting. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-05-25 15:50:00 [ndbd] INFO     -- Ndb has terminated (pid 32121) restarting
2018-05-25 15:50:00 [ndbd] INFO     -- Angel reconnected to '172.26.221.19:1186'
2018-05-25 15:50:03 [ndbd] INFO     -- Angel reallocated nodeid: 3
2018-05-25 15:50:03 [ndbd] INFO     -- Angel pid: 31729 started child: 959
2018-05-25 15:50:03 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
2018-05-25 15:50:03 [ndbd] INFO     -- Configuration fetched from '172.26.221.19:1186', generation: 1
2018-05-25 15:50:03 [ndbd] INFO     -- Changing directory to '/var/lib/mysql/cluster_data'
NDBMT: non-mt
2018-05-25 15:50:03 [ndbd] INFO     -- NDB Cluster -- DB node 3
2018-05-25 15:50:03 [ndbd] INFO     -- mysql-5.6.34 ndb-7.4.13 --
2018-05-25 15:50:03 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2018-05-25 15:50:03 [ndbd] INFO     -- Memory Allocation for global memory pools Starting
2018-05-25 15:50:03 [ndbd] INFO     -- numa_set_interleave_mask(numa_all_nodes) : OK
2018-05-25 15:50:03 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 10780Mb initial: 10908Mb
2018-05-25 15:50:03 [ndbd] INFO     -- Touch Memory Starting, 8320 pages, page size = 32768
2018-05-25 15:50:03 [ndbd] INFO     -- Touch Memory Completed
2018-05-25 15:50:03 [ndbd] INFO     -- Lock Memory Starting, 8320 pages, page size = 32768

Where problem occurs, investigation:

./storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
Dblqh::IOTracker::tick

Traceback:

storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#0  0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1  0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2  0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3  0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4  executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5  FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6  0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7  0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
    initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8  0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9  0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
    connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()

The traceback indicates maxlag is 0.

0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)

Check in the `tick` method, places where the division by maxlag happens.

    if (lag > maxlag)
    {
      /**
       * We did have lag last second...
       *   increase m_lag_cnt and check if it has reached maxlag_cnt
       */
      Uint32 tmp = m_lag_cnt;
      m_lag_cnt += (lag / maxlag);
      if (tmp < maxlag_cnt && m_lag_cnt >= maxlag_cnt)
      {
        retVal = -1; // start aborting transactions
      }
    }

How to repeat:
Script to repeat, N.B. replace the <FILL IN ~4K OF TEXT> placeholders with actual 4K of text:

---- repeat.sh ----
FILE=bug.sql

echo "creating.."

echo "DROP TABLE IF EXISTS bug_test;" > $FILE
echo "CREATE TABLE bug_test (id1 bigint(20) NOT NULL,id2 bigint(20) DEFAULT NULL,id3 bigint(20) DEFAULT NULL,payload1 varchar(4096) DEFAULT NULL, payload2 varchar(4096) DEFAULT NULL, PRIMARY KEY (id1), UNIQUE KEY bug_UNIQUE2 (id2), UNIQUE KEY bug_UNIQUE3 (id3), INDEX id1_idx (id1), INDEX id2_idx (id2) USING BTREE, INDEX id3_idx (id2) USING BTREE) ENGINE=ndbcluster DEFAULT CHARSET=latin1;" >> $FILE;

for x in {1..100000}; do

echo "insert into bug_test (id1, id2, id3, payload1, payload2) values (100000000000"$x", 100000000000"$x", 100000000000"$x", '<FILL IN ~4K OF TEXT>',
'<FILL IN ~4K OF TEXT>');" >> $FILE;

done

echo "provisioning.."
time mysql --database=db < $FILE

---- end of repeat.sh ----

Run data node in debug mode:

gdb ndbd
(gdb) set follow-fork-mode child
(gdb) handle SIGFPE stop nopass
(gdb) run
(gdb) tb

Start provisioning traffic to MySQL cluster. 
>./repeat.sh

One of the data nodes will fail and get to the gdb debug console. Use `tb` to print the traceback of the exception.

Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffff7fed740 (LWP 31730)]
0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
26950   /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp: No such file or directory.
Missing separate debuginfos, use: debuginfo-install MySQL-Cluster-server-advanced-7.4.13-1.el7.x86_64

(gdb) bt
#0  0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1  0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2  0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3  0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4  executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5  FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6  0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7  0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
    initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8  0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9  0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
    at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
    connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()

Suggested fix:
The `maxlag` should be checked for 0 before division.
[29 May 2018 15:40] Andrew Tikhonov
Reproducible with script:

---- reproduce.sh ----
FILE=bug.sql

echo "creating.."

echo "DROP TABLE IF EXISTS bug_test;" > $FILE
echo "
CREATE TABLE bug_test (
  id1 bigint(20) NOT NULL,
  id2 bigint(20) DEFAULT NULL,
  id3 varchar(255) NOT NULL DEFAULT '0',
  id4 VARCHAR(255) NOT NULL DEFAULT 'default',
  data1 VARCHAR(255) NOT NULL DEFAULT '',
  data2 VARCHAR(2048) NOT NULL DEFAULT '',
  data3 varchar(255) DEFAULT NULL,
  data4 varchar(4096) DEFAULT NULL,
  id5 bigint(20) DEFAULT NULL,
  id6 bigint(20) NOT NULL,
  data5 VARCHAR(24) NULL DEFAULT 'default',
  data6 varchar(255) DEFAULT NULL,
  time1 TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  time2 TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (id1),
  UNIQUE KEY bug_UNIQUE (id2, id3),
  INDEX id1_idx (id1),
  INDEX id2_idx (id2, id3) USING BTREE,
  INDEX id6_idx (id6) USING BTREE
) ENGINE=ndbcluster DEFAULT CHARSET=latin1;" >> $FILE

for x in {1..100000};do
echo "insert into bug_test (id1, id2, id3, id4, data1, data2, data3, data4, id5, id6, data5, data6, time1, time2) values (1223000000000"$x", 1223000000000"$x", '0', 'SDGHDFGSDFGDFGDFGDFG', '', 'SDFGDFGDFGDSFGDFGDFGDSFGSDGDSFGDFGDFGDFGDFGDSFG', '', '<REPLACE WITH 4K OF TEXT>', 1526066404, 300 ,'VALID', 'UNKNOWN', '2018-05-11 17:50:06', '2018-05-11 17:50:06');" >> $FILE

done

echo "provisioning.."
time mysql --database=db < $FILE
---- end of reproduce.sh ----
[11 Jun 2018 13:13] MySQL Verification Team
Hi,

What is your config, most importantly timebeteweenlocalcheckpoints and nooffragmentlogfiles? ... the only thing I'm getting running this script is 410 - redo log files overloaded.. 

btw the "4k string" can be replaced with  REPEAT('X', 4096) for simpler script.

I don't think this is a bug, I believe this is just misconfigured cluster. To properly configure/size your system you have to contact MySQL Support team.

all best
Bogdan
[11 Jun 2018 14:51] Andrew Tikhonov
Hi,

timebeteweenlocalcheckpoints = 6
nooffragmentlogfiles = 150

I think division by zero, even if it happens infrequently, is a bug.

In our case, we're not able to restore a database.

Regards,
Andrew
[11 Jun 2018 15:13] MySQL Verification Team
testing with newer version to see if it's maybe fixed but I Can't load the table with your configuration?!

[root@gedora mysql]# bin/mysql -uroot -pxxx test < bug.sql
mysql: [Warning] Using a password on the command line interface can be insecure.
ERROR 1297 (HY000) at line 7232: Got temporary error 1220 'REDO log files overloaded (increase FragmentLogFileSize)' from NDBCLUSTER
[11 Jun 2018 15:37] MySQL Verification Team
witn mysql-5.7.13 ndb-7.5.3 I can't reproduce your bug using your script
lemme 7.4.13 possibly there was a bug and we fixed it
[11 Jun 2018 16:27] MySQL Verification Team
hi,

7.4.12, 7.4.13 can't reproduce neither..

you can reproduce using your script every time?

[root@gedora mysql]# time  bin/mysql test < bug.sql

real    0m57.436s
user    0m1.202s
sys     0m1.005s
[root@gedora mysql]#