Description:
SIGFPE, floating point exception is generated during high provisioning load. Data Node crashes and, the following error is printed by the mysql:
ERROR 1297 (HY000) at line 103945: Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER
Logs of NDB data node:
2018-05-25 15:50:00 [ndbd] INFO -- Received signal 8. Running error handler.
2018-05-25 15:50:00 [ndbd] INFO -- Signal 8 received; Floating point exception
2018-05-25 15:50:00 [ndbd] INFO -- /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp
2018-05-25 15:50:00 [ndbd] INFO -- Error handler signal restarting system
2018-05-25 15:50:00 [ndbd] INFO -- Error handler shutdown completed - exiting
2018-05-25 15:50:00 [ndbd] ALERT -- Node 3: Forced node shutdown completed, restarting. Initiated by signal 8. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2018-05-25 15:50:00 [ndbd] INFO -- Ndb has terminated (pid 32121) restarting
2018-05-25 15:50:00 [ndbd] INFO -- Angel reconnected to '172.26.221.19:1186'
2018-05-25 15:50:03 [ndbd] INFO -- Angel reallocated nodeid: 3
2018-05-25 15:50:03 [ndbd] INFO -- Angel pid: 31729 started child: 959
2018-05-25 15:50:03 [ndbd] INFO -- Normal start of data node using checkpoint and log info if existing
2018-05-25 15:50:03 [ndbd] INFO -- Configuration fetched from '172.26.221.19:1186', generation: 1
2018-05-25 15:50:03 [ndbd] INFO -- Changing directory to '/var/lib/mysql/cluster_data'
NDBMT: non-mt
2018-05-25 15:50:03 [ndbd] INFO -- NDB Cluster -- DB node 3
2018-05-25 15:50:03 [ndbd] INFO -- mysql-5.6.34 ndb-7.4.13 --
2018-05-25 15:50:03 [ndbd] INFO -- WatchDog timer is set to 6000 ms
2018-05-25 15:50:03 [ndbd] INFO -- Memory Allocation for global memory pools Starting
2018-05-25 15:50:03 [ndbd] INFO -- numa_set_interleave_mask(numa_all_nodes) : OK
2018-05-25 15:50:03 [ndbd] INFO -- Ndbd_mem_manager::init(1) min: 10780Mb initial: 10908Mb
2018-05-25 15:50:03 [ndbd] INFO -- Touch Memory Starting, 8320 pages, page size = 32768
2018-05-25 15:50:03 [ndbd] INFO -- Touch Memory Completed
2018-05-25 15:50:03 [ndbd] INFO -- Lock Memory Starting, 8320 pages, page size = 32768
Where problem occurs, investigation:
./storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
Dblqh::IOTracker::tick
Traceback:
storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#0 0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1 0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2 0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3 0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4 executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5 FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6 0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7 0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8 0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9 0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()
The traceback indicates maxlag is 0.
0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
Check in the `tick` method, places where the division by maxlag happens.
if (lag > maxlag)
{
/**
* We did have lag last second...
* increase m_lag_cnt and check if it has reached maxlag_cnt
*/
Uint32 tmp = m_lag_cnt;
m_lag_cnt += (lag / maxlag);
if (tmp < maxlag_cnt && m_lag_cnt >= maxlag_cnt)
{
retVal = -1; // start aborting transactions
}
}
How to repeat:
Script to repeat, N.B. replace the <FILL IN ~4K OF TEXT> placeholders with actual 4K of text:
---- repeat.sh ----
FILE=bug.sql
echo "creating.."
echo "DROP TABLE IF EXISTS bug_test;" > $FILE
echo "CREATE TABLE bug_test (id1 bigint(20) NOT NULL,id2 bigint(20) DEFAULT NULL,id3 bigint(20) DEFAULT NULL,payload1 varchar(4096) DEFAULT NULL, payload2 varchar(4096) DEFAULT NULL, PRIMARY KEY (id1), UNIQUE KEY bug_UNIQUE2 (id2), UNIQUE KEY bug_UNIQUE3 (id3), INDEX id1_idx (id1), INDEX id2_idx (id2) USING BTREE, INDEX id3_idx (id2) USING BTREE) ENGINE=ndbcluster DEFAULT CHARSET=latin1;" >> $FILE;
for x in {1..100000}; do
echo "insert into bug_test (id1, id2, id3, payload1, payload2) values (100000000000"$x", 100000000000"$x", 100000000000"$x", '<FILL IN ~4K OF TEXT>',
'<FILL IN ~4K OF TEXT>');" >> $FILE;
done
echo "provisioning.."
time mysql --database=db < $FILE
---- end of repeat.sh ----
Run data node in debug mode:
gdb ndbd
(gdb) set follow-fork-mode child
(gdb) handle SIGFPE stop nopass
(gdb) run
(gdb) tb
Start provisioning traffic to MySQL cluster.
>./repeat.sh
One of the data nodes will fail and get to the gdb debug console. Use `tb` to print the traceback of the exception.
Program received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7ffff7fed740 (LWP 31730)]
0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
26950 /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp: No such file or directory.
Missing separate debuginfos, use: debuginfo-install MySQL-Cluster-server-advanced-7.4.13-1.el7.x86_64
(gdb) bt
#0 0x00000000005412ab in Dblqh::IOTracker::tick (this=0x129af90, now=<optimized out>, maxlag=0, maxlag_cnt=0)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:26950
#1 0x000000000054140a in Dblqh::timer_handling (this=this@entry=0x104a230, signal=signal@entry=0xdf3960 <globalData+7168>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3087
#2 0x0000000000541561 in Dblqh::execTIME_SIGNAL (this=0x104a230, signal=0xdf3960 <globalData+7168>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:3072
#3 0x000000000071269d in executeFunction (f=<optimized out>, signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1477
#4 executeFunction_async (signal=<optimized out>, gsn=<optimized out>, this=0x104a230)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1444
#5 FastScheduler::doJob (this=0xdef1c0 <globalScheduler>, loopStartCount=<optimized out>)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/FastScheduler.cpp:154
#6 0x0000000000711555 in ThreadConfig::ipControlLoop (this=0xe5d580, pThis=0xe74c70)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/vm/ThreadConfig.cpp:276
#7 0x0000000000412cb4 in ndbd_run (foreground=<optimized out>, report_fd=<optimized out>, connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=0x0, no_start=<optimized out>, initial=false,
initialstart=false, allocated_nodeid=3, connect_retries=12, connect_delay=5)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/ndbd.cpp:1022
#8 0x0000000000411116 in real_main (argc=0, argv=0xe557b8, argv@entry=0xe435b0)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:204
#9 0x0000000000414cfb in spawn_process (progname=<optimized out>, args=...)
at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:432
#10 retry_spawn_process (progname=<optimized out>, args=...) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:451
#11 angel_run (progname=progname@entry=0x7fffffffe847 "/usr/sbin/ndbd", original_args=..., connect_str=<optimized out>, force_nodeid=<optimized out>, bind_address=<optimized out>, initial=false, no_start=false, daemon=true,
connnect_retries=12, connect_delay=5) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/angel.cpp:663
#12 0x0000000000411180 in real_main (argc=0, argv=0xe1e618) at /export/home/pb2/build/sb_0-20777380-1476347309.44/rpm/BUILD/mysql-cluster-com-7.4.13/mysql-cluster-com-7.4.13/storage/ndb/src/kernel/main.cpp:223
#13 0x00007ffff6dfab15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000410a15 in _start ()
Suggested fix:
The `maxlag` should be checked for 0 before division.