Bug #18863 NDB node fails to restart, cluster stuck in state trying to restart it.
Submitted: 6 Apr 2006 18:31 Modified: 6 Jul 2006 11:13
Reporter: Ross McFarland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.0.18 OS:Linux (rhel3)
Assigned to: Tomas Ulin

[6 Apr 2006 18:31] Ross McFarland
Description:
My cluster was up and running for several days and i went to try and
test out it's failure tolerance by "nodeid restart -n" some nodes.
everything went find and worked perfectly.

i ran in to problems when i started trying to bring the nodes back up
through "nodeid start" all of them with the exception of 1 came back
up. i got the following error from it:

- in mgm log and to mgm console ------------------------------------------------
2006-04-06 10:14:58 [MgmSrvr] ALERT    -- Node 24: Forced node
shutdown complete d. Occured during startphase 1. Initiated by signal
0. Caused by error 6050: 'WatchDog terminate, internal error or
massive overload on the machine running this node(Internal error,
programming error or missing error

- in node 20's logfile ---------------------------------------------------------
Current byte-offset of file-pointer is: 568

Time: Thursday 6 April 2006 - 10:14:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error,
programming error or missing error message, please report a bug)
Error: 2341
Error data: Suma.hpp
Error object: SUMA (Line: 598) 0x00000002
Program: /path_to_mysql_bin_dir/bin/ndbd
Pid: 29113
Trace: ./ndb_20_trace.log.1
Version: Version 5.0.18
***EOM***

since the point at which i tried to bring this node back up i've been
getting the following in the mgm log every few seconds from various
nodes:
2006-04-06 10:41:51 [MgmSrvr] WARNING  -- Node 22: Failure handling of
node 20 has not completed in 27 min. - state = 3

i can't seem to stop these messages. and my only guess at this point
is that it would require a complete restart of the cluster (at least
ndb nodes) to get it to stop. it's basically filling up my log files.
it even continues to come out after i've taken down node 20 entirely.

when i brought back up node 20 by hand it's status showed as:
ndb_mgm> 20 status
Node 20: starting (Phase 1) (Version 5.0.18)
and it continues to do so. i tried 20 stop, but that tells me that i
can't stop a node while it's starting or stopping.

it's not ideal that the node didn't come back up, but the real problem to me is that the cluster is stuck trying to bring it back up as i've been able to get that node to come back up fine since (no watchdog problems.) but the cluster seems to be stuck in a prev attempt to rebuild it it so it never comes back in to service.

How to repeat:
it's unclear what steps would be required to repeate this. there seems to be two problems. the first that the node didn't come back up, which i can't help with how to repeat. the second is that the cluster is stuck in a state trying to bring the down nope back up to speed, if you can get a ndb node to take a really long time to come back up to speed or make it dissapear while coming back up to speed you might be able to repeat this.
[6 Apr 2006 19:16] Jonas Oreland
Hi,

Please upload all trace/error logs + cluster log and config.ini

/Jonas
[6 Apr 2006 20:56] Ross McFarland
config.ini

Attachment: config.ini (application/octet-stream, text), 2.19 KiB.

[6 Apr 2006 20:59] Ross McFarland
ndbd error log

Attachment: ndb_20_error.log (application/octet-stream, text), 524 bytes.

[6 Apr 2006 21:01] Ross McFarland
mgm log

Attachment: ndb_cluster_mgm.log.gz (application/gzip, text), 10.83 KiB.

[6 Apr 2006 21:01] Ross McFarland
ndbd trace

Attachment: ndb_20_trace.log.1.gz (application/gzip, text), 16.58 KiB.

[6 Apr 2006 21:03] Ross McFarland
files attached, had to gzip two of them b/c the system wouldn't except them otherwise.
[7 Apr 2006 12:15] Hartmut Holzgraefe
Another instance of Bug #16772 ?
[7 Apr 2006 12:19] Jonas Oreland
no this in not #16772
I checked tracefiles...
/Jonas
[12 May 2006 9:08] Valerii Kravchuk
Can you try to repeat your tests with a newer version, 5.0.21, and inform about the results?
[15 May 2006 20:31] Ross McFarland
the test is not repro. i have done the same process several times since and not seen the failure. i have had real network/system events cause the same type of problem, failure of one or more nodes to restart. usually a full ndb node set restart cleans things up.

it is a resonable amount of work for me to update the version i'm using, more than it should be. i will do so, for this reason and for general reasons soon.
[5 Jul 2006 12:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8753
[6 Jul 2006 9:08] Tomas Ulin
pushed to 5.0.24
[6 Jul 2006 11:13] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://www.mysql.com/doc/en/Installing_source_tree.html

Documented bugfix in 5.0.24 changelog.