MySQL Bugs: #18863: NDB node fails to restart, cluster stuck in state trying to restart it.

Bug #18863	NDB node fails to restart, cluster stuck in state trying to restart it.
Submitted:	6 Apr 2006 18:31	Modified:	6 Jul 2006 11:13
Reporter:	Ross McFarland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.0.18	OS:	Linux (rhel3)
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
My cluster was up and running for several days and i went to try and
test out it's failure tolerance by "nodeid restart -n" some nodes.
everything went find and worked perfectly.

i ran in to problems when i started trying to bring the nodes back up
through "nodeid start" all of them with the exception of 1 came back
up. i got the following error from it:

- in mgm log and to mgm console ------------------------------------------------
2006-04-06 10:14:58 [MgmSrvr] ALERT -- Node 24: Forced node
shutdown complete d. Occured during startphase 1. Initiated by signal
0. Caused by error 6050: 'WatchDog terminate, internal error or
massive overload on the machine running this node(Internal error,
programming error or missing error

- in node 20's logfile ---------------------------------------------------------
Current byte-offset of file-pointer is: 568

Time: Thursday 6 April 2006 - 10:14:35
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error,
programming error or missing error message, please report a bug)
Error: 2341
Error data: Suma.hpp
Error object: SUMA (Line: 598) 0x00000002
Program: /path_to_mysql_bin_dir/bin/ndbd
Pid: 29113
Trace: ./ndb_20_trace.log.1
Version: Version 5.0.18
***EOM***

since the point at which i tried to bring this node back up i've been
getting the following in the mgm log every few seconds from various
nodes:
2006-04-06 10:41:51 [MgmSrvr] WARNING -- Node 22: Failure handling of
node 20 has not completed in 27 min. - state = 3

i can't seem to stop these messages. and my only guess at this point
is that it would require a complete restart of the cluster (at least
ndb nodes) to get it to stop. it's basically filling up my log files.
it even continues to come out after i've taken down node 20 entirely.

when i brought back up node 20 by hand it's status showed as:
ndb_mgm> 20 status
Node 20: starting (Phase 1) (Version 5.0.18)
and it continues to do so. i tried 20 stop, but that tells me that i
can't stop a node while it's starting or stopping.

it's not ideal that the node didn't come back up, but the real problem to me is that the cluster is stuck trying to bring it back up as i've been able to get that node to come back up fine since (no watchdog problems.) but the cluster seems to be stuck in a prev attempt to rebuild it it so it never comes back in to service.

How to repeat:
it's unclear what steps would be required to repeate this. there seems to be two problems. the first that the node didn't come back up, which i can't help with how to repeat. the second is that the cluster is stuck in a state trying to bring the down nope back up to speed, if you can get a ndb node to take a really long time to come back up to speed or make it dissapear while coming back up to speed you might be able to repeat this.

Hi,

Please upload all trace/error logs + cluster log and config.ini

/Jonas

config.ini

Attachment: config.ini (application/octet-stream, text), 2.19 KiB.

ndbd error log

Attachment: ndb_20_error.log (application/octet-stream, text), 524 bytes.

mgm log

Attachment: ndb_cluster_mgm.log.gz (application/gzip, text), 10.83 KiB.

ndbd trace

Attachment: ndb_20_trace.log.1.gz (application/gzip, text), 16.58 KiB.

files attached, had to gzip two of them b/c the system wouldn't except them otherwise.

Another instance of Bug #16772 ?

no this in not #16772
I checked tracefiles...
/Jonas

Can you try to repeat your tests with a newer version, 5.0.21, and inform about the results?

the test is not repro. i have done the same process several times since and not seen the failure. i have had real network/system events cause the same type of problem, failure of one or more nodes to restart. usually a full ndb node set restart cleans things up.

it is a resonable amount of work for me to update the version i'm using, more than it should be. i will do so, for this reason and for general reasons soon.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8753

pushed to 5.0.24

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://www.mysql.com/doc/en/Installing_source_tree.html

Documented bugfix in 5.0.24 changelog.