Description:
Data node crashes in Suma during what seems to be some form of scan
2015-07-13 15:06:11 [ndbd] INFO -- g:\ade\build\sb_0-15888083-1436783526.18\mysql-cluster-gpl-7.5.0\storage\ndb\src\kernel\blocks\suma\suma.cpp
2015-07-13 15:06:11 [ndbd] INFO -- SUMA (Line: 3156) 0x00000002
2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutting down system
2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutdown completed - exiting
Suma recieves a SCAN_FRAG_REF from Dblqh, with the last word being the error code set to 4110 which is TuxBoundInfo::InvalidAttrInfo.
--------------- Signal ----------------
r.bn: 257 "SUMA", r.proc: 2, r.sigId: 17622 gsn: 352 "SCAN_FRAGREF" prio: 1
s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 24326 length: 4 trace: 0 #sec: 0 fragInf: 0
H'00000000 H'00000000 H'10100200 H'0000100e
This signal is sent from Dbqlh after having performed a "direct call" to c_tux->execTUX_BOUND_INFO(signal) which then runs the code to set error:
if (unlikely(offset != boundLen)) {
jam();
scan.m_errorCode = TuxBoundInfo::InvalidAttrInfo;
req->errorCode = scan.m_errorCode;
return;
}
which can be seen by signal trace:
---> signal
DblqhMain.cpp 11274 22253 22253 11456 11464 12791
DbtuxScan.cpp 00033
DbtuxGen.cpp 00400 00400 00402
DblqhMain.cpp 11617
DbtuxScan.cpp 00171 00204 00216 00231 00244 00216 00231 00244 00216
00265 <<<<<<<
DblqhMain.cpp 11759 11761 11774
DbtupStoredProcDef.cpp 00038 00061
DblqhMain.cpp 11801 13379
DbtuxScan.cpp 00340 00356 00443 00490
DbtuxSearch.cpp 00230 00248 00253 00256 00230 00248 00253 00302 00313
00302 00313 00302 00313 00348
DbtuxScan.cpp 00799 01086 00806 00633
DblqhMain.cpp 10376 12088
DbtuxScan.cpp 00340 00384 00387
DbtuxNode.cpp 00598 00602
DblqhMain.cpp 10431
DbtupStoredProcDef.cpp 00038 00084
DblqhMain.cpp 12606 13019 13052 07875 09026
DbtuxScan.cpp 00437
DblqhMain.cpp 03706
DbtupBuffer.cpp 00035
DblqhMain.cpp 03706
DbtupBuffer.cpp 00035
--------------- Signal ----------------
r.bn: 261/1 "PGMAN", r.proc: 2, r.sigId: 303993 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
H'00000000
--------------- Signal ----------------
r.bn: 247/1 "DBLQH", r.proc: 2, r.sigId: 303992 gsn: 353 "SCAN_FRAGREQ" prio: 1
s.bn: 257 "SUMA", s.proc: 2, s.sigId: 6044 length: 12 trace: 0 #sec: 2 fragInf: 0
senderData: 0x0
resultRef: 0x1010002
savePointId: 0
flags: hdr attrLen: 0 reorg: 0 corr: 0 stat: 0 ni: 0
tableId: 6
fragmentNo: 1
keyLen: 0
schemaVersion: 0x1
transId1: 0x0
transId2: 0x10100200
clientOpPtr: 0x0
batch_size_rows: 16
batch_size_bytes: 0
How to repeat:
Only reproducable on Windows.
Various testcases fails with data node crash in same place.
Only seen for 7.5
Seems to be a lot of "Ndb kernel thread 4 is stuck in: Job Handling elapsed=101" and "Watchdog: Warning overslept 251 ms, expected 100 ms." which perhaps causing signal reordering?
Suggested fix:
.