Description:
The management node(s) is(are) not really part of the cluster and especially not of its heartbeat cycle. The management nodes role is limited to receiving log events and writing them to the appropriate log destination once the cluster is running. It only acts as an arbitrator but only if asked to by other nodes, not by itself.
In the case of a complete simultaneous loss of all non-management nodes (or at least all data nodes?) the cluster does not log anything as there is no node remaining that could inform it about missed heartbeats etc. The management server will also not detect any TCP failures on its socket connections to the other nodes as it is only reading from these sockets, listening for event information coming in. On a power loss or hard network failure these sockets will stay in the ESTABLISHED state forever on the management server side as it doesn't try to write to these sockets and as the TCP stack can't distinguish between "connection just silent" and "connection dead" in this case (unless
SO_KEEPALIVE is used, but even then it would only kick in after 2 hours)
So from looking at the management nodes cluster log alone one can not decide whether the cluster is (or was) operational or not at any given time. Monitoring a cluster so requires active polling of the current cluster status in addition to cluster log monitoring which makes monitoring integration unnecessary hard
(most monitoring solutions have default syslog/logfile parsing mechanisms whereas anything requiering invocation of extra scripts becomes more complicated not only in implementation but also in documentation/review/approval etc. ...)
How to repeat:
Simplest test setup: management node on one machine, all other nodes on a second one, cut power from the second machine, see that there is nothing logged in the management nodes error log unless you restart the second machine and its node processes
Suggested fix:
Implement some sort of heartbeat or keepalive mechanism that lets management nodes determine node losses by itself, e.g. cc: management nodes on heartbeats and let it keep track of these or let the management nodes send regular ping requests to all established node connections which would easily detect "half-dead" sockets (see also bug #24793)