Bug #11623 ndb process became unresponsive but didn't really die
Submitted: 28 Jun 2005 21:45 Modified: 1 Sep 2005 6:54
Reporter: Patrick Chun Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.12 OS:Linux (Linux 2.4.20-31.9)
Assigned to: Jonas Oreland CPU Architecture:Any

[28 Jun 2005 21:45] Patrick Chun
Description:
When we tried to access the cluster at around 10am today, we got error message thru our application saying:

               Can't lock file (errno: 4009)

We have also found in our error log file 'ndb_2_error.log':

      Date/Time: Monday 27 June 2005 - 05:03:18
      Type of error: error
      Message: System error
      Fault ID: 2303
      Problem data: Node 2 killed this node because GCP stop was detected
      Object of reference: NDBCNTR (Line: 193) 0x0000000a
      ProgramName: /usr/sbin/ndbd
      ProcessID: 3529
      TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.7
      Version 4.1.12

The ndb process apparently became unresponsive at 5:03am, but, at around 10am, it can still be seen when using something like 'ps aux'.  

This machine was running seemingly normally on the day before; we were doing normal SQL SELECT/UPDATE/INSERT without any apparent problem.  Curiously, this machine was not used at or around 5am, Monday -- the time of crash; it was just sitting in our lab.

When our team got back to work in the morning and when we tried to execute the above-mentioned command '/sur/bin/ndb_mgm -e shutdown', we can see that this didn't kill the ndbd process(es) even though 'ps aux' can still see them.

How to repeat:
The ndbd process seems stable most of the time.  The above symptom only happen once in a while.  

Suggested fix:
It will not be a show-stopper as it is now if ndbd could be changed so that it would dies totally instead of hanging around but become totally unresponsive. This way, once it has died, it would not be detected by 'ps aux' and sever scripts could be written to re-start it.
[29 Jun 2005 8:58] Jonas Oreland
Have you tried "StopOnError: N"
This way ndbd will automatically restart it self...
[29 Jun 2005 19:37] Patrick Chun
Thank you for the tip.  We have not tried modifying the 'StopOnError' parameter to 'N' for ndb so that it will restart after an error.  We will give this a try.  However, given that this error is not easily reproducible, it will be difficult to report back as to its effectiveness.

Anyhow, perhaps it would still be better to get to the bottom of the problem as to why ndb process would become unresponsive.  We have the trace file, if needed.

Yours,
Patrick Chun
[2 Aug 2005 9:49] Martin Skold
Could be duplicate of bug#9961?
[3 Aug 2005 5:35] Patrick Chun
Yes, it looks the same as the "GCP stop" error (Bug #9961) for Version 5.x clustering.  The line number is also the same (NDBCNTR (Line: 193)).  If Jonas Oreland can kindly fix in future 4.1.x release, it would be perfect!
[3 Aug 2005 17:29] Patrick Chun
The trace log file regarding this ndbd server crash

Attachment: ndb_2_trace.log.7.txt (text/plain), 165.15 KiB.

[1 Sep 2005 6:54] Jonas Oreland
I think the conclusion was that this is a duplicate of 9961.
So I'm closing it...
[7 Sep 2005 12:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/29422