Bug #47309 ndbd is not restarted after it has been stopped using the kill command
Submitted: 14 Sep 2009 12:46 Modified: 15 Oct 2009 15:25
Reporter: Andrew Morgan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Linux (Fedora 11)
Assigned to: CPU Architecture:Any
Tags: angel, MySQL Cluster 7.0.6, ndbd

[14 Sep 2009 12:46] Andrew Morgan
Description:
According to the documentation, if an ndbd process dies then it should be automatically recreated by the angel process.

I have attempted to test this using the kill command (with -6, -9 and no options) - in all cases, the process is not restarted and in addition the angel process is stopped.

It should be possible to test the claim that data nodes will be automatically restarted by using the Kill command.

Log files can be downloaded from http://thievesgarden.co.uk/clusterdb/Fedora11/data_node_not_restarting.zip

How to repeat:
Find the process ID for an ndbd using:

$ ps -e | grep ndb

There should be 2 processes for each data node, choose the 2nd process ID for one of the data nodes:

$ kill pid

repeat with kill -6 and kill -9

Verify that both processes for the data node no longer exist.
[14 Sep 2009 12:54] Andrew Morgan
Log files (zipped)

Attachment: data_node_not_restarting.zip (application/x-zip-compressed, text), 48.24 KiB.

[14 Sep 2009 13:52] Hartmut Holzgraefe
Can't reproduce, a 'simple' "kill" or "kill -15" (SIGTERM) shuts down the data node including the angle process, signals -6 (SIGABRT), -9 (SIGKILL), -11 (SIGSEG) shut down the node and initiate an angel restart:

  ndb_mgm> Node 3: Forced node shutdown completed, restarting. Initiated by 
  signal 6. Caused by error 6000: 'Error OS signal received(Internal error, 
  programming error or missing error message, please report a bug). Temporary 
  error, restart node'.

You don't seem to have submitted your config.ini file, but my educated blind guess would be that you don't have set

  StopOnError=false

and so the default behavior kicks in which is to stop a node on errors instead of restarting it?

( http://dev.mysql.md/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-par... )
[14 Sep 2009 15:40] Andrew Morgan
Hartmut,

 you were correct about my config - I've retested and it behaves as you'd described.

Thanks, Andrew.
[14 Oct 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".