Description:
Data node crashes in Suma during what seems to be some form of scan
2015-07-13 15:06:11 [ndbd] INFO -- g:\ade\build\sb_0-15888083-1436783526.18\mysql-cluster-gpl-7.5.0\storage\ndb\src\kernel\blocks\suma\suma.cpp
2015-07-13 15:06:11 [ndbd] INFO -- SUMA (Line: 3156) 0x00000002
2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutting down system
2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutdown completed - exiting
Suma recieves a SCAN_FRAG_REF from Dblqh, with the last word being the error code set to 4110 which is TuxBoundInfo::InvalidAttrInfo.
--------------- Signal ----------------
r.bn: 257 "SUMA", r.proc: 2, r.sigId: 17622 gsn: 352 "SCAN_FRAGREF" prio: 1
s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 24326 length: 4 trace: 0 #sec: 0 fragInf: 0
H'00000000 H'00000000 H'10100200 H'0000100e
This signal is sent from Dbqlh after having performed a "direct call" to c_tux->execTUX_BOUND_INFO(signal) which then runs the code to set error:
if (unlikely(offset != boundLen)) {
jam();
scan.m_errorCode = TuxBoundInfo::InvalidAttrInfo;
req->errorCode = scan.m_errorCode;
return;
}
which can be seen by signal trace:
---> signal
DblqhMain.cpp 11274 22253 22253 11456 11464 12791
DbtuxScan.cpp 00033
DbtuxGen.cpp 00400 00400 00402
DblqhMain.cpp 11617
DbtuxScan.cpp 00171 00204 00216 00231 00244 00216 00231 00244 00216
00265 <<<<<<<
DblqhMain.cpp 11759 11761 11774
DbtupStoredProcDef.cpp 00038 00061
DblqhMain.cpp 11801 13379
DbtuxScan.cpp 00340 00356 00443 00490
DbtuxSearch.cpp 00230 00248 00253 00256 00230 00248 00253 00302 00313
00302 00313 00302 00313 00348
DbtuxScan.cpp 00799 01086 00806 00633
DblqhMain.cpp 10376 12088
DbtuxScan.cpp 00340 00384 00387
DbtuxNode.cpp 00598 00602
DblqhMain.cpp 10431
DbtupStoredProcDef.cpp 00038 00084
DblqhMain.cpp 12606 13019 13052 07875 09026
DbtuxScan.cpp 00437
DblqhMain.cpp 03706
DbtupBuffer.cpp 00035
DblqhMain.cpp 03706
DbtupBuffer.cpp 00035
--------------- Signal ----------------
r.bn: 261/1 "PGMAN", r.proc: 2, r.sigId: 303993 gsn: 761 "STOP_FOR_CRASH" prio: 0
s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0
H'00000000
--------------- Signal ----------------
r.bn: 247/1 "DBLQH", r.proc: 2, r.sigId: 303992 gsn: 353 "SCAN_FRAGREQ" prio: 1
s.bn: 257 "SUMA", s.proc: 2, s.sigId: 6044 length: 12 trace: 0 #sec: 2 fragInf: 0
senderData: 0x0
resultRef: 0x1010002
savePointId: 0
flags: hdr attrLen: 0 reorg: 0 corr: 0 stat: 0 ni: 0
tableId: 6
fragmentNo: 1
keyLen: 0
schemaVersion: 0x1
transId1: 0x0
transId2: 0x10100200
clientOpPtr: 0x0
batch_size_rows: 16
batch_size_bytes: 0
How to repeat:
Only reproducable on Windows.
Various testcases fails with data node crash in same place.
Only seen for 7.5
Seems to be a lot of "Ndb kernel thread 4 is stuck in: Job Handling elapsed=101" and "Watchdog: Warning overslept 251 ms, expected 100 ms." which perhaps causing signal reordering?
Suggested fix:
.
Description: Data node crashes in Suma during what seems to be some form of scan 2015-07-13 15:06:11 [ndbd] INFO -- g:\ade\build\sb_0-15888083-1436783526.18\mysql-cluster-gpl-7.5.0\storage\ndb\src\kernel\blocks\suma\suma.cpp 2015-07-13 15:06:11 [ndbd] INFO -- SUMA (Line: 3156) 0x00000002 2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutting down system 2015-07-13 15:06:11 [ndbd] INFO -- Error handler shutdown completed - exiting Suma recieves a SCAN_FRAG_REF from Dblqh, with the last word being the error code set to 4110 which is TuxBoundInfo::InvalidAttrInfo. --------------- Signal ---------------- r.bn: 257 "SUMA", r.proc: 2, r.sigId: 17622 gsn: 352 "SCAN_FRAGREF" prio: 1 s.bn: 247/1 "DBLQH", s.proc: 2, s.sigId: 24326 length: 4 trace: 0 #sec: 0 fragInf: 0 H'00000000 H'00000000 H'10100200 H'0000100e This signal is sent from Dbqlh after having performed a "direct call" to c_tux->execTUX_BOUND_INFO(signal) which then runs the code to set error: if (unlikely(offset != boundLen)) { jam(); scan.m_errorCode = TuxBoundInfo::InvalidAttrInfo; req->errorCode = scan.m_errorCode; return; } which can be seen by signal trace: ---> signal DblqhMain.cpp 11274 22253 22253 11456 11464 12791 DbtuxScan.cpp 00033 DbtuxGen.cpp 00400 00400 00402 DblqhMain.cpp 11617 DbtuxScan.cpp 00171 00204 00216 00231 00244 00216 00231 00244 00216 00265 <<<<<<< DblqhMain.cpp 11759 11761 11774 DbtupStoredProcDef.cpp 00038 00061 DblqhMain.cpp 11801 13379 DbtuxScan.cpp 00340 00356 00443 00490 DbtuxSearch.cpp 00230 00248 00253 00256 00230 00248 00253 00302 00313 00302 00313 00302 00313 00348 DbtuxScan.cpp 00799 01086 00806 00633 DblqhMain.cpp 10376 12088 DbtuxScan.cpp 00340 00384 00387 DbtuxNode.cpp 00598 00602 DblqhMain.cpp 10431 DbtupStoredProcDef.cpp 00038 00084 DblqhMain.cpp 12606 13019 13052 07875 09026 DbtuxScan.cpp 00437 DblqhMain.cpp 03706 DbtupBuffer.cpp 00035 DblqhMain.cpp 03706 DbtupBuffer.cpp 00035 --------------- Signal ---------------- r.bn: 261/1 "PGMAN", r.proc: 2, r.sigId: 303993 gsn: 761 "STOP_FOR_CRASH" prio: 0 s.bn: 0 "SYS", s.proc: 0, s.sigId: 0 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000000 --------------- Signal ---------------- r.bn: 247/1 "DBLQH", r.proc: 2, r.sigId: 303992 gsn: 353 "SCAN_FRAGREQ" prio: 1 s.bn: 257 "SUMA", s.proc: 2, s.sigId: 6044 length: 12 trace: 0 #sec: 2 fragInf: 0 senderData: 0x0 resultRef: 0x1010002 savePointId: 0 flags: hdr attrLen: 0 reorg: 0 corr: 0 stat: 0 ni: 0 tableId: 6 fragmentNo: 1 keyLen: 0 schemaVersion: 0x1 transId1: 0x0 transId2: 0x10100200 clientOpPtr: 0x0 batch_size_rows: 16 batch_size_bytes: 0 How to repeat: Only reproducable on Windows. Various testcases fails with data node crash in same place. Only seen for 7.5 Seems to be a lot of "Ndb kernel thread 4 is stuck in: Job Handling elapsed=101" and "Watchdog: Warning overslept 251 ms, expected 100 ms." which perhaps causing signal reordering? Suggested fix: .