Bug #90765 api node crashes under heavy connect disconnect
Submitted: 5 May 2018 21:23 Modified: 7 Jun 2018 14:35
Reporter: Gilad Odinak Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S2 (Serious)
Version:5.7.8 OS:CentOS
Assigned to: MySQL Verification Team CPU Architecture:Any

[5 May 2018 21:23] Gilad Odinak
Description:
API node crashes (signal 11) in NdbOperationExec.cpp, line 1608 when

NdbOperation::insertKEYINFO_NdbRecord calls

theLastKEYINFO->setLength(NdbApiSignal::MaxSignalWords - keyInfoRemain);

because theLastKEYINFO == NULL

theLastKEYINFO is set by allocKeyInfo() above at line 1591; that function return -1 if it failes to do so in which case insertKEYINFO_NdbRecord  returns that value

So it seems theLastKEYINFO is set to NULL by some other thread between these two points of execution of insertKEYINFO_NdbRecord

How to repeat:
Happens randomly when used in a large production system. I don't have a simple text case/code

Suggested fix:
My simple workaround is to add

if (theLastKEYINFO == NULL) {
  // Race condition probably with NdbOperation::postExecuteRelease
  return -1;
}

Just before the call to theLastKEYINFO->setLength(...)
[5 May 2018 22:52] Gilad Odinak
Actually the problem seems to be that we never enter the above while block (which really is an if block) so a better fix is

if (byteSize < keyInfoRemain*4) {
    setErrorCodeAbort(4000);
    return -1;
}

while (byteSize > keyInfoRemain*4)
{
....
[6 May 2018 2:25] Gilad Odinak
The crash happens when on entry to
NdbOperation::insertKEYINFO_NdbRecord(const char *value,Uint32 byteSize)

 byteSize == 0 and keyInfoRemain == 0 and  theLastKEYINFO == NULL

This happen once per 140K queries on one node, and once per 1.8M queries on another noe.
[7 May 2018 14:35] MySQL Verification Team
Hi,

I'm having issue reproducing the problem. Do you have a reproducible case we can use?

thanks
Bogdan
[8 Jun 2018 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".