Bug #11623 ndb process became unresponsive but didn't really die
Submitted: 28 Jun 2005 23:45 Modified: 1 Sep 2005 8:54
Reporter: Patrick Chun
Status: Closed
Category:Server: Cluster Severity:S1 (Critical)
Version:4.1.12 OS:Linux (Linux 2.4.20-31.9)
Assigned to: Jonas Oreland Target Version:

[28 Jun 2005 23:45] Patrick Chun
Description:
When we tried to access the cluster at around 10am today, we got error message thru our
application saying:

               Can't lock file (errno: 4009)

We have also found in our error log file 'ndb_2_error.log':

      Date/Time: Monday 27 June 2005 - 05:03:18
      Type of error: error
      Message: System error
      Fault ID: 2303
      Problem data: Node 2 killed this node because GCP stop was detected
      Object of reference: NDBCNTR (Line: 193) 0x0000000a
      ProgramName: /usr/sbin/ndbd
      ProcessID: 3529
      TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.7
      Version 4.1.12

The ndb process apparently became unresponsive at 5:03am, but, at around 10am, it can
still be seen when using something like 'ps aux'.  

This machine was running seemingly normally on the day before; we were doing normal SQL
SELECT/UPDATE/INSERT without any apparent problem.  Curiously, this machine was not used
at or around 5am, Monday -- the time of crash; it was just sitting in our lab.

When our team got back to work in the morning and when we tried to execute the
above-mentioned command '/sur/bin/ndb_mgm -e shutdown', we can see that this didn't kill
the ndbd process(es) even though 'ps aux' can still see them.

How to repeat:
The ndbd process seems stable most of the time.  The above symptom only happen once in a
while.  

Suggested fix:
It will not be a show-stopper as it is now if ndbd could be changed so that it would dies
totally instead of hanging around but become totally unresponsive. This way, once it has
died, it would not be detected by 'ps aux' and sever scripts could be written to re-start
it.
[29 Jun 2005 10:58] Jonas Oreland
Have you tried "StopOnError: N"
This way ndbd will automatically restart it self...
[29 Jun 2005 21:37] Patrick Chun
Thank you for the tip.  We have not tried modifying the 'StopOnError' parameter to 'N' for
ndb so that it will restart after an error.  We will give this a try.  However, given that
this error is not easily reproducible, it will be difficult to report back as to its
effectiveness.

Anyhow, perhaps it would still be better to get to the bottom of the problem as to why ndb
process would become unresponsive.  We have the trace file, if needed.

Yours,
Patrick Chun
[2 Aug 2005 11:49] Martin Skold
Could be duplicate of bug#9961?
[3 Aug 2005 7:35] Patrick Chun
Yes, it looks the same as the "GCP stop" error (Bug #9961) for Version 5.x clustering. 
The line number is also the same (NDBCNTR (Line: 193)).  If Jonas Oreland can kindly fix
in future 4.1.x release, it would be perfect!
[3 Aug 2005 19:29] Patrick Chun
The trace log file regarding this ndbd server crash

Attachment: ndb_2_trace.log.7.txt (text/plain), 165.15 KiB.

[1 Sep 2005 8:54] Jonas Oreland
I think the conclusion was that this is a duplicate of 9961.
So I'm closing it...
[7 Sep 2005 14:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/29422