| Bug #11623 | ndb process became unresponsive but didn't really die | ||
|---|---|---|---|
| Submitted: | 28 Jun 2005 21:45 | Modified: | 1 Sep 2005 6:54 | 
| Reporter: | Patrick Chun | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) | 
| Version: | 4.1.12 | OS: | Linux (Linux 2.4.20-31.9) | 
| Assigned to: | Jonas Oreland | CPU Architecture: | Any | 
   [29 Jun 2005 8:58]
   Jonas Oreland        
  Have you tried "StopOnError: N" This way ndbd will automatically restart it self...
   [29 Jun 2005 19:37]
   Patrick Chun        
  Thank you for the tip. We have not tried modifying the 'StopOnError' parameter to 'N' for ndb so that it will restart after an error. We will give this a try. However, given that this error is not easily reproducible, it will be difficult to report back as to its effectiveness. Anyhow, perhaps it would still be better to get to the bottom of the problem as to why ndb process would become unresponsive. We have the trace file, if needed. Yours, Patrick Chun
   [2 Aug 2005 9:49]
   Martin Skold        
  Could be duplicate of bug#9961?
   [3 Aug 2005 5:35]
   Patrick Chun        
  Yes, it looks the same as the "GCP stop" error (Bug #9961) for Version 5.x clustering. The line number is also the same (NDBCNTR (Line: 193)). If Jonas Oreland can kindly fix in future 4.1.x release, it would be perfect!
   [3 Aug 2005 17:29]
   Patrick Chun        
  The trace log file regarding this ndbd server crash
Attachment: ndb_2_trace.log.7.txt (text/plain), 165.15 KiB.
   [1 Sep 2005 6:54]
   Jonas Oreland        
  I think the conclusion was that this is a duplicate of 9961. So I'm closing it...
   [7 Sep 2005 12:06]
   Bugs System        
  A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/internals/29422


Description: When we tried to access the cluster at around 10am today, we got error message thru our application saying: Can't lock file (errno: 4009) We have also found in our error log file 'ndb_2_error.log': Date/Time: Monday 27 June 2005 - 05:03:18 Type of error: error Message: System error Fault ID: 2303 Problem data: Node 2 killed this node because GCP stop was detected Object of reference: NDBCNTR (Line: 193) 0x0000000a ProgramName: /usr/sbin/ndbd ProcessID: 3529 TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.7 Version 4.1.12 The ndb process apparently became unresponsive at 5:03am, but, at around 10am, it can still be seen when using something like 'ps aux'. This machine was running seemingly normally on the day before; we were doing normal SQL SELECT/UPDATE/INSERT without any apparent problem. Curiously, this machine was not used at or around 5am, Monday -- the time of crash; it was just sitting in our lab. When our team got back to work in the morning and when we tried to execute the above-mentioned command '/sur/bin/ndb_mgm -e shutdown', we can see that this didn't kill the ndbd process(es) even though 'ps aux' can still see them. How to repeat: The ndbd process seems stable most of the time. The above symptom only happen once in a while. Suggested fix: It will not be a show-stopper as it is now if ndbd could be changed so that it would dies totally instead of hanging around but become totally unresponsive. This way, once it has died, it would not be detected by 'ps aux' and sever scripts could be written to re-start it.