MySQL Bugs: #72387: Node x failed to size, Cluster is getting really slow and throws errors

Bug #72387	Node x failed to size, Cluster is getting really slow and throws errors
Submitted:	18 Apr 2014 17:22	Modified:	10 May 2016 21:17
Reporter:	Stefan Auweiler	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.1.23	OS:	Solaris (Solaris 10 x86)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	Failed to Seize, node communication, Transaction Coordinator

Description:
We face a lot of errors in our cluster setup.

Some systems are reporting errors similar to this messsage:
"op: 247163 node: 12 failed to seize"

Sometimes it comes for some minutes, and sometimes we can only overcome this situation by restarting nodes.

During this situation, the cluster is really slow and we can not keep up processing the incoming data in time, which is a desaster.

It only happens in our production environment. In the lab and test centers, we haven't seen it even with reasonably higher load, where we fire about 140% of the production peek load continously.

We found the related lines in the code (DbtcMain.cpp) up to Release 7.2.16:

void Dbtc::execTRIG_ATTRINFO(Signal* signal)
{
  jamEntry();
  TrigAttrInfo * const trigAttrInfo =  (TrigAttrInfo *)signal->getDataPtr();
  Uint32 attrInfoLength = signal->getLength() - TrigAttrInfo::StaticLength;
  const Uint32 *src = trigAttrInfo->getData();
  FiredTriggerPtr firedTrigPtr;
  
  TcFiredTriggerData key;
  key.fireingOperation = trigAttrInfo->getConnectionPtr();
  key.nodeId = refToNode(signal->getSendersBlockRef());
  if(!c_firedTriggerHash.find(firedTrigPtr, key)){
    jam();
    /* TODO : Node failure handling (use sig-train assembly) */
    if(!c_firedTriggerHash.seize(firedTrigPtr)){
      jam();
      /**
       * Will be handled when FIRE_TRIG_ORD arrives
       */
      ndbout_c("op: %d node: %d failed to seize",
	       key.fireingOperation, key.nodeId);
      return;
    }
    ndbrequire(firedTrigPtr.p->keyValues.getSize() == 0 &&
	       firedTrigPtr.p->beforeValues.getSize() == 0 &&
	       firedTrigPtr.p->afterValues.getSize() == 0);
    
    firedTrigPtr.p->nodeId = refToNode(signal->getSendersBlockRef());
    firedTrigPtr.p->fireingOperation = key.fireingOperation;
    firedTrigPtr.p->triggerId = trigAttrInfo->getTriggerId();
    c_firedTriggerHash.add(firedTrigPtr);
  }

Maybe this will speed up the process...

I'll append the erroro reporter files soon.

How to repeat:
It happen on a regular base

System is finally broken...

While running error reporter, two nodes of the same group failed and the cluster shut down.

As it is the production system, I had first to take care of the cluster. I tried to restart it, it always stuck in 4 (last was 3) for a long period of time...

Finally I decided to do an initial restart, applied the schema and filled in the required data...
As it is a session db, all remaining sessions are containing the nescessary information during its next updates, therefore some tables are non logging anyway. 

Unfortunately, I destroyed the logfiles, as I could not waste the time to copy them around...

Not sure, whether the error will happen on a clean cluster anymore, as it happened only on this system at this customer...

I hope, you have any idea, on what could have happened and how to prevent it in the future, as we are not happy to have to deal with an instable system.

As the source code seems to have this issue even in 7.2.6 (and above?), just upgrading seems to be waste of time for this issue.

Thanks 
Stefan

It starts again.
I now could take the ndb_error_reporter files and uploaded them to the FTP path.

Do you need any additional information?

Thank you very much.
Regards Stefan

Our database is growing very fast.
In the last 14 hours we've collected about 30Million rows (session data) which we can not process fas enough because of the error.

I just stopped half of the cluster, to see, whether it is a problem, with the synchronous replication. It speeds things up at the cost of reliability...

any idea?

Best regards 
Stefan

CPU Graphs per Processor and configured thraed

Attachment: CPU node 3.docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document, text), 195.67 KiB.

From this graphs, you might see, which process is used.
I've filtered by the CPUs, assigned during configuration.

Additional information:

during the times of slowing down the database, we faced tons of
 MySQL error 'Got temporary error 291 'Out of scanfrag records in TC (increase MaxNoOfLocalScans)' from NDBCLUSTER'

Which hast been stopped by disabling 5 of the 10 nodes...

...and during the last 50 Minutes, we processed 6 Million of the collected data, so I'd think, that most the queries are fine.

We are heavily using partitioning by key (Subscriber Number) and we have one of the session data table created without logging

Hi,

apologies it took this much to get a response but you could always contact MySQL Support and get 24/7 help.

This is something I did used to see back in the days on 7.1 and solaris but it was hardware and not software related. If my memory serves me correctly in one instance it was a dying hard drive and in another it was faulty ram on the raid controller. Now since there's no ndb_error_report I can't say for sure it's the same issue so if you are still experiencing the same problem please get me the ndb_error_reporter and please also check with your system admin if all hw tests pass with flying colors. The storage subsystem needs to be thoroughly checked.

kind regards
Bogdan Kecman

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".