Bug #90606 Segmentation fault in ndbd.cpp
Submitted: 24 Apr 2018 14:03 Modified: 5 Aug 2018 20:19
Reporter: Daniël van Eeden (OCA) Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.6.4 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[24 Apr 2018 14:03] Daniël van Eeden
Description:
I was moving a big InnoDB table to NDB: "ALTER TABLE <table> TABLESPACE <ts> STORAGE DISK ENGINE NDBCLUSTER" This was running for a few days.

Then one node failed because of a hardware issue.

Then this happened on remaining nodes:
2018-04-24 09:15:33 [ndbd] INFO     -- Node 4 has completed node fail handling
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 18, tabLcpStatus: 3
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 21, tabLcpStatus: 3
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 19, tabLcpStatus: 3
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 20, tabLcpStatus: 3
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 17, tabLcpStatus: 3
<Free extend stuff>
2018-04-24 09:15:34 [ndbd] INFO     -- DROP_TAB_REQ: tab: 16, tabLcpStatus: 3
<More free extend stuff>
2018-04-24 09:15:34 [ndbd] INFO     -- Received signal 11. Running error handler.
2018-04-24 09:15:34 [ndbd] INFO     -- Signal 11 received; Segmentation fault
2018-04-24 09:15:34 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-28202170-1523555460.9/rpm/BUILD/mysql-cluster-gpl-7.6.4/mysql-cluster-gpl-7.6.4/storage/ndb/src/kernel/ndbd.cpp
2018-04-24 09:15:34 [ndbd] INFO     -- Error handler signal shutting down system
2018-04-24 09:15:34 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-04-24 09:15:52 [ndbd] ALERT    -- Node 6: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Tuesday 24 April 2018 - 09:15:34
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: /export/home2/pb2/build/sb_1-28202170-1523555460.9/rpm/BUILD/mysql-cluster-gpl-7.6.4/mysql-cluster-gpl-7.6.4/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 43967 thr: 7
Version: mysql-5.7.20 ndb-7.6.4
Trace file name: ndb_6_trace.log.14_t7
Trace file path: /var/lib/mysql-cluster/ndb_6_trace.log.14 [t1..t22]
***EOM***

Just starting the node again was working fine.

How to repeat:
Unknown
[26 Apr 2018 7:30] Mikael Ronström
The most likely cause of this segfault is that the variable m_dirty_pages contains
an illegal value.

To be able to track this easier if it happens again I have added more information about
code running in this case at crash time. I have also added a bunch of new asserts in
production code to ensure that we hit any anomalies in the code at the root rather than
at its usage.

Will also see if we can reproduce this bug by combining node restarts with drops of disk
data tables.
[30 Apr 2018 14:39] MySQL Verification Team
Changing to feedback for the comments done by Mikael.
[1 Jun 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[4 Jul 2018 10:46] Daniël van Eeden
The cluster failed because we lost network connectivity for a bit.
Then starting the cluster failed.
Not 100% sure this is the same bug, but it looks related.

Time: Wednesday 4 July 2018 - 12:34:04
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 10895 thr: 4
Version: mysql-5.7.22 ndb-7.6.6
Trace file name: ndb_28_trace.log.12_t4
Trace file path: /var/lib/mysql-cluster/ndb_28_trace.log.12 [t1..t17]
***EOM***

2018-07-04 12:34:04 [ndbd] INFO     -- LDM instance 3: Restored T288F301 LCP 103423 rows, 184485 row operations, 1291 millis, 142900 row operations/sec)
2018-07-04 12:34:04 [ndbd] INFO     -- Received signal 11. Running error handler.
2018-07-04 12:34:04 [ndbd] INFO     -- Signal 11 received; Segmentation fault
2018-07-04 12:34:04 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp
2018-07-04 12:34:04 [ndbd] INFO     -- Error handler signal shutting down system
2018-07-04 12:34:04 [ndbd] INFO     -- Error handler shutdown completed - aborting
2018-07-04 12:34:19 [ndbd] INFO     -- Child process terminated by signal 6
2018-07-04 12:34:19 [ndbd] ALERT    -- Node 28: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

(gdb) bt
#0  0x00007f9f22c03277 in raise () from /lib64/libc.so.6
#1  0x00007f9f22c04968 in abort () from /lib64/libc.so.6
#2  0x00000000004056b3 in childAbort (error_code=error_code@entry=6000, currentStartPhase=4, exit_code=-1)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp:615
#3  0x000000000041338d in NdbShutdown (error_code=error_code@entry=6000, type=NST_ErrorHandlerSignal, restartType=restartType@entry=NRT_Default)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp:1273
#4  0x00000000006d678c in ErrorReporter::handleError (messageID=messageID@entry=6000, problemData=problemData@entry=0x7f896914a350 "Signal 11 received; Segmentation fault", 
    objRef=objRef@entry=0x7a0ce8 "/export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp", nst=nst@entry=NST_ErrorHandlerSignal)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/error/ErrorReporter.cpp:280
#5  0x0000000000412ed5 in handler_error (signum=11) at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/ndbd.cpp:663
#6  <signal handler called>
#7  0x000000000043ea28 in Restore::prepare_parts_for_execution (this=this@entry=0x18b8f30, signal=signal@entry=0x7f896914b4c0, file_ptr=...)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/restore.cpp:2061
#8  0x000000000043ec7c in Restore::open_data_file (this=0x18b8f30, signal=0x7f896914b4c0, file_ptr=...)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/restore.cpp:2148
#9  0x000000000043f205 in Restore::execFSREMOVECONF (this=<optimized out>, signal=<optimized out>)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/restore.cpp:697
#10 0x0000000000774b60 in executeFunction (f=<optimized out>, signal=0x7f896914b4c0, gsn=<optimized out>, this=0x18b8f30)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1585
#11 executeFunction_async (signal=0x7f896914b4c0, gsn=<optimized out>, this=0x18b8f30)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1552
#12 execute_signals (max_signals=262080, sig=0x7f896914b4c0, r=0x7f9f21eb1220, h=0x7f9f21ea8100, q=0x7f9f21ea8000, selfptr=0x7f9f21ea7f80)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/vm/mt.cpp:5559
#13 run_job_buffers (selfptr=selfptr@entry=0x7f9f21ea7dc0, sig=sig@entry=0x7f896914b4c0, send_sum=@0x7f896914ab80: 0, flush_sum=@0x7f896914ab90: 0, pending_send=@0x7f896914ab70: true)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/vm/mt.cpp:5606
#14 0x0000000000776f6e in mt_job_thread_main (thr_arg=<optimized out>)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/vm/mt.cpp:6576
#15 0x000000000070a2ef in ndb_thread_wrapper (_ss=0xe86600)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/common/portlib/NdbThread.c:258
#16 0x00007f9f23bcde25 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f9f22ccbbad in clone () from /lib64/libc.so.6
[4 Jul 2018 14:04] Daniël van Eeden
(gdb) frame
#7  0x000000000043ea28 in Restore::prepare_parts_for_execution (this=this@entry=0x1878940, signal=signal@entry=0x7f2d94ad02c0, file_ptr=...)
    at /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/restore.cpp:2061
2061    in /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/restore.cpp
(gdb) info locals
j = 0
partPair = {
  startPart = 17486, 
  numParts = 16962
}
part_id = 17486
i = 2003
lcpCtlFilePtr = 0x1894130
(gdb) print *lcpCtlFilePtr
$26 = {
  fileHeader = {
    Magic = "NDBBCKUP", 
    BackupVersion = 460292, 
    SectionType = 1, 
    SectionLength = 9, 
    FileType = 6, 
    BackupId = 0, 
    BackupKey_0 = 0, 
    BackupKey_1 = 0, 
    ByteOrder = 305419896, 
    NdbVersion = 460294, 
    MySQLVersion = 329494
  }, 
  Checksum = 1313227893, 
  ValidFlag = 1, 
  TableId = 298, 
  FragmentId = 133, 
  CreateTableVersion = 1, 
  CreateGci = 574830, 
  MaxGciCompleted = 5578265, 
  MaxGciWritten = 5578266, 
  LcpId = 347481, 
  LocalLcpId = 0, 
  MaxPageCount = 1055, 
  MaxNumberDataFiles = 2064, 
  LastDataFileNumber = 921, 
  MaxPartPairs = 2048, 
  NumPartPairs = 2048, 
  RowCountLow = 1434713, 
  RowCountHigh = 0, 
  FutureUse = {0 <repeats 16 times>}, 
  partPairs = {{
      startPart = 1847, 
      numParts = 1
    }}
}
(gdb) print file_ptr.p->m_part_state[part_id]
Cannot access memory at address 0x7f2d583d8dc8
(gdb) print partPair.startPart
$27 = 17486
(gdb) print part_id
$28 = 17486
(gdb) print lcpCtlFilePtr->partPairs[i]
$29 = {
  startPart = 17486, 
  numParts = 16962
}
(gdb) print lcpCtlFilePtr->partPairs[i-1]
$30 = {
  startPart = 1801, 
  numParts = 1
}
(gdb) print file_ptr.p->m_part_state[1801]
$31 = Restore::File::PART_IGNORED
(gdb) print file_ptr.p->m_part_state[1802]
$32 = Restore::File::PART_ALL_CHANGES
[5 Jul 2018 20:19] MySQL Verification Team
Hi Daniel,

Mikael is lookin at this bug in the restore.cpp but this is not the same as the original issue reported. The original issue, if it happens again we'll hopefully have more data. Mikael will update you when he has more data about the restore.cpp one

all best
Bogdan
[6 Jul 2018 11:40] Mikael Ronström
A fix for the restore issue reported in this bug is now on its way into the next 7.6 release.
[6 Aug 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".