Bug #43537 MEM agents not resolving hostname-to-IP on each connection
Submitted: 10 Mar 18:01 Modified: 4 Sep 17:50
Reporter: Shawn Green
Status: Verified
Category:Monitoring: Agent Severity:S3 (Non-critical)
Version:2.0 OS:Any (N/A)
Assigned to: Kay Roepke Target Version:
Tags: windmill

[10 Mar 18:01] Shawn Green
Description:
If the IP address of the MEM server changes (for example, due to a vlan switch) the
reporting agents are unable to follow. 

How to repeat:
1) Configure MEM server and at least one reporting agent. 

2) Remap the hostname of the MEM server to a new address. 

3) Observe the errors on the agents. 

4) Restart the agent to refresh the IP address of the MEM server. 

5) Observe return to normal operations.

Suggested fix:
Modify the Agent code to avoid any cached hostname resolutions.
[10 Mar 18:25] Kay Roepke
the agent doesn't resolve the URL to MEM itself, the entire URL is passed to libcurl which
does the necessary steps.
glancing at the libcurl docs i can't see a way to force re-resolving it.
i'm curious: what's the TTL of their MEM DNS record? might they be seeing their own TTL
setting here (and restarting MEM simply takes longer than the TTL)?
[10 Jun 21:01] Kay Roepke
We would need feedback for the question in the above comment to determine the source of
the problem.
Thanks
[15 Jun 17:32] Shawn Green
Here is the TTL information about memserver1 (the machine that the agents failed to follow
as it changed addresses due to a VLAN shift)

[adminuser ~]$ dig any memserver1.site1.anon

; <<>> DiG 9.2.4 <<>> any memserver1.site1.anon
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 53006
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;memserver1.site1.anon. IN ANY

;; ANSWER SECTION:
memserver1.site1.anon. 10800 IN A xxx.yyy.102.101

;; AUTHORITY SECTION:
site1.anon. 10800 IN NS ns1.lhr1.activehotels.com.
site1.anon. 10800 IN NS ns2.lhr1.activehotels.com.

;; ADDITIONAL SECTION:
ns1.xxx.site2.anon. 10800 IN A xxx.yyy.102.200
ns2.xxx.site2.anon. 10800 IN A xxx.yyy.103.200

;; Query time: 0 msec
;; SERVER: xxx.yyy.102.200#53(xxx.yyy.102.200)
;; WHEN: Mon Jun 15 08:22:32 2009
;; MSG SIZE rcvd: 152

Restarting the agent allowed for the new address to resolve properly but for one example
I know about this problem affecting about 130 agents at the same time. If we could
somehow get libcurl to uncache any DNS resolutions when we get a "failure to connect"
message and try again, then it would improve our ability to follow network changes like
VLAN remaps.