MySQL Bugs: #26920: WatchDog termination, reported in error "please report a bug"

Bug #26920	WatchDog termination, reported in error "please report a bug"
Submitted:	7 Mar 2007 14:44	Modified:	27 May 2007 13:57
Reporter:	johnny slakva	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.0	OS:	Linux (RHEL3)
Assigned to:		CPU Architecture:	Any
Tags:	5.0.33, error message, shutdown

Description:
we have a cluster (5.0.33) with 2 datanodes. datanodes were running ok for more than month, but today one of datanodes went down with following message in errorlog:

Status: Temporary error, restart node 
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug) 
Error: 6050 
Error data: Polling for Receive 
Error object: WatchDog.cpp 
Program: /usr/sbin/ndbd 
Pid: 11871 
Trace: /var/lib/mysql-cluster/ndb_1_trace.log.1 
Version: Version 5.0.33 

there really was a high load on server, but i dont think data-critical software should shut down itself in such conditions...

after i restarted that datanode it restarted ok.

seems similar problem was reported in http://bugs.mysql.com/bug.php?id=21743.

How to repeat:
i'm not sure how can this be repeated, as it happened once during a month running. probably high load should be made on datanode server.

Suggested fix:
either fix software somehow, or document what are conditions for node to go down, so we know what to avoid.

Johnny,

The error message should be improved.

As for the node going down, a ndbd may terminate under high load.  This is controlled by the watchdog timeout config setting.

Can you confirm that this is a high load situation?

Do you feel that this behavior needs better documentation?

BR,

Tomas

Johnny,

The error message should be improved.

As for the node going down, a ndbd may terminate under high load.  This is controlled by the watchdog timeout config setting.

Can you confirm that this is a high load situation?

Do you feel that this behavior needs better documentation?

BR,

Tomas

thank you for response, 

i think most probably there was high load...

i think i've read about watchdog timeout in documentation, so if this was watchdog termination, then i think would be good to change error message to point to this specific condition.

Or if this was something else, then would be good to make it slightly clearer, so user can know what condition on server triggered it and/or what configuration parameters are relevant.

This didnt repeat for us since that time though.

johnny