Bug #14004 Configure number of missed heartbeats
Submitted: 13 Oct 2005 13:07 Modified: 17 Oct 2005 14:25
Reporter: Scott Tully Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S4 (Feature request)
Version:4.1.14 OS:
Assigned to: Stewart Smith CPU Architecture:Any
Triage: D5 (Feature request)

[13 Oct 2005 13:07] Scott Tully
Description:
It would be nice to be able to define the number of missed heartbeats instead of the interval between them.  

Instead of doing 

HeartbeatIntervalDbDb=5000

and having the node declared dead in 15 seconds with only 3 missed beats, i would like to do

MissedHeartbeatDbDb=10

This with the default 1.5 second for HeartbeatIntervalDbDb would still give me 15 seconds, but more confidence that the node was unresponsive and declared dead after 10 missed beats. 

How to repeat:
not a bug... this field should not be required with a feature request.
[14 Oct 2005 7:18] Stewart Smith
This is fixed in 5.0 by also using any received signal from a node as a heartbeat. i.e. as long as traffic is getting through, we're okay.

The problem with 4.1 is if there is lots of network congestion.

Are you using 4.1? and having the cluster have its own private network? are you seeing problems during network congestion?

You shouldn't need this in 5.0 however.
[14 Oct 2005 13:01] Scott Tully
Sorry, yes, i am running 4.1.14. Unfortunately i do not have my own data center so i am limited to the network configuration that is available to me.  I am not able to put the cluster on a private LAN or even a separate subnet... Direct connects is also not currently an option of mine.  I have 4 hosts with 8 data nodes in the DMZ (plus 3 api's on separate hosts).  99% of the time everything is fine on the network, but like you said during high traffic i notice allot of missed heartbeats - sometimes resulting in a node being declared dead, only to resurrect itself a few seconds later. 

5.0 sounds like it has a good solution in place to overcome this...
[14 Oct 2005 14:43] Scott Tully
See now this kinda thing makes me mental.  I just saw this logged... (nodeids are a naming convention i use to distinguish the host and group)

2005-10-14 10:32:30 [MgmSrvr] WARNING  -- Node 11: Node 33 missed heartbeat 2
2005-10-14 10:32:30 [MgmSrvr] WARNING  -- Node 11: Node 33 missed heartbeat 3

both 11 and 33 are data nodes. why would node 11 miss heartbeat 2 and 3 on node 33 at the same time... no interval! maybe this is a seperate bug report?
[17 Oct 2005 14:25] Scott Tully
will test with 5.0.x when GA