MySQL Bugs: #11623: ndb process became unresponsive but didn't really die

Bug #11623	ndb process became unresponsive but didn't really die
Submitted:	28 Jun 2005 21:45	Modified:	1 Sep 2005 6:54
Reporter:	Patrick Chun	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	4.1.12	OS:	Linux (Linux 2.4.20-31.9)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
When we tried to access the cluster at around 10am today, we got error message thru our application saying:

               Can't lock file (errno: 4009)

We have also found in our error log file 'ndb_2_error.log':

      Date/Time: Monday 27 June 2005 - 05:03:18
      Type of error: error
      Message: System error
      Fault ID: 2303
      Problem data: Node 2 killed this node because GCP stop was detected
      Object of reference: NDBCNTR (Line: 193) 0x0000000a
      ProgramName: /usr/sbin/ndbd
      ProcessID: 3529
      TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.7
      Version 4.1.12

The ndb process apparently became unresponsive at 5:03am, but, at around 10am, it can still be seen when using something like 'ps aux'.  

This machine was running seemingly normally on the day before; we were doing normal SQL SELECT/UPDATE/INSERT without any apparent problem.  Curiously, this machine was not used at or around 5am, Monday -- the time of crash; it was just sitting in our lab.

When our team got back to work in the morning and when we tried to execute the above-mentioned command '/sur/bin/ndb_mgm -e shutdown', we can see that this didn't kill the ndbd process(es) even though 'ps aux' can still see them.

How to repeat:
The ndbd process seems stable most of the time.  The above symptom only happen once in a while.  

Suggested fix:
It will not be a show-stopper as it is now if ndbd could be changed so that it would dies totally instead of hanging around but become totally unresponsive. This way, once it has died, it would not be detected by 'ps aux' and sever scripts could be written to re-start it.

Have you tried "StopOnError: N"
This way ndbd will automatically restart it self...

Thank you for the tip.  We have not tried modifying the 'StopOnError' parameter to 'N' for ndb so that it will restart after an error.  We will give this a try.  However, given that this error is not easily reproducible, it will be difficult to report back as to its effectiveness.

Anyhow, perhaps it would still be better to get to the bottom of the problem as to why ndb process would become unresponsive.  We have the trace file, if needed.

Yours,
Patrick Chun

Could be duplicate of bug#9961?

Yes, it looks the same as the "GCP stop" error (Bug #9961) for Version 5.x clustering.  The line number is also the same (NDBCNTR (Line: 193)).  If Jonas Oreland can kindly fix in future 4.1.x release, it would be perfect!

The trace log file regarding this ndbd server crash

Attachment: ndb_2_trace.log.7.txt (text/plain), 165.15 KiB.

I think the conclusion was that this is a duplicate of 9961.
So I'm closing it...

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/29422