Bug #26920 WatchDog termination, reported in error "please report a bug"
Submitted: 7 Mar 2007 14:44 Modified: 27 May 2007 13:57
Reporter: johnny slakva Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.0 OS:Linux (RHEL3)
Assigned to: CPU Architecture:Any
Tags: 5.0.33, error message, shutdown
Triage: Triaged: D4 (Minor)

[7 Mar 2007 14:44] johnny slakva
Description:
we have a cluster (5.0.33) with 2 datanodes. datanodes were running ok for more than month, but today one of datanodes went down with following message in errorlog:

Status: Temporary error, restart node 
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug) 
Error: 6050 
Error data: Polling for Receive 
Error object: WatchDog.cpp 
Program: /usr/sbin/ndbd 
Pid: 11871 
Trace: /var/lib/mysql-cluster/ndb_1_trace.log.1 
Version: Version 5.0.33 

there really was a high load on server, but i dont think data-critical software should shut down itself in such conditions...

after i restarted that datanode it restarted ok.

seems similar problem was reported in http://bugs.mysql.com/bug.php?id=21743.

How to repeat:
i'm not sure how can this be repeated, as it happened once during a month running. probably high load should be made on datanode server.

Suggested fix:
either fix software somehow, or document what are conditions for node to go down, so we know what to avoid.
[14 May 2007 22:01] Tomas Ulin
Johnny,

The error message should be improved.

As for the node going down, a ndbd may terminate under high load.  This is controlled by the watchdog timeout config setting.

Can you confirm that this is a high load situation?

Do you feel that this behavior needs better documentation?

BR,

Tomas
[14 May 2007 22:25] Tomas Ulin
Johnny,

The error message should be improved.

As for the node going down, a ndbd may terminate under high load.  This is controlled by the watchdog timeout config setting.

Can you confirm that this is a high load situation?

Do you feel that this behavior needs better documentation?

BR,

Tomas
[27 May 2007 13:57] johnny slakva
thank you for response, 

i think most probably there was high load...

i think i've read about watchdog timeout in documentation, so if this was watchdog termination, then i think would be good to change error message to point to this specific condition.

Or if this was something else, then would be good to make it slightly clearer, so user can know what condition on server triggered it and/or what configuration parameters are relevant.

This didnt repeat for us since that time though.

johnny