Bug #41068 Agent runs out of filedescriptors, does not recover
Submitted: 27 Nov 2008 12:09 Modified: 27 Feb 2009 12:19
Reporter: Kay Roepke
Status: Closed
Category:Monitoring: Agent Severity:S1 (Critical)
Version:2.0.0.7102 OS:Any
Assigned to: MC Brown Target Version:2.0 GA maint release

[27 Nov 2008 12:09] Kay Roepke
Description:
In some circumstances the agent/proxy runs out of filedescriptors, causing all kinds of
secondary failures.
It will not recover from that state.

Relevant part of the log file:

2008-11-27 11:11:00: (critical) last message repeated 2 times
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:30: (critical) network-socket.c.292: socket(127.0.0.1:3306) failed: Too
many open files (24)
2008-11-27 11:11:30: (critical) proxy-plugin.c.1532: Cannot connect, all backends are
down.
2008-11-27 11:20:22: (critical) last message repeated 4 times
2008-11-27 11:20:22: (critical) network-io.c:215:
curl_easy_perform('https://user:password@merlin-dashboard:443/heartbeat') failed: SSL
connection timeout (curl-error = 'Tim
eout was reached' (28), os-error = 'Connection refused' (111))

In this setup, iptables rules have been used to divert traffic to the proxy instead of
directly to the backend, to avoid changing the application config.
The rules have been changed back to the default (no rules, basically) just prior to the
agent/proxy failure.

[root@host etc]# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination        
LOG        tcp  -- !nagios-server  anywhere            tcp dpt:mysql LOG level notice
prefix `Redirect incoming: '
REDIRECT   tcp  -- !nagios-server  anywhere            tcp dpt:mysql redir ports 4040
 
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination        
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination        
[root@host etc]# iptables -F -t nat
[root@host etc]# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination        
 
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination        
 
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

How to repeat:
Currently unknown

Suggested fix:
N/A
[2 Dec 2008 18:20] Gary Whizin
1. update docs to explain how user can bump at the OS level
2. agent should try to increase at startup (like mysql server does)
  and add message level log entry either way
[17 Feb 2009 21:44] Diego Medina
Verified fixed on 2.0.5.7144

Using debug log level I see

(debug) chassis.c:1091: current RLIMIT_NOFILE = 256 (hard: 9223372036854775807)
(debug) chassis.c:1095: trying to set new RLIMIT_NOFILE = 8192 (hard:
9223372036854775807)
(debug) chassis.c:1103: set new RLIMIT_NOFILE = 8192 (hard: 9223372036854775807)
[27 Feb 2009 12:19] Tony Bedford
An entry was added to the 2.0.5 changelog:

In some circumstances the agent/proxy ran out of file descriptors, causing secondary
failures. It could not recover from that state. The relevant part of the log file is
shown here:

2008-11-27 11:11:00: (critical) last message repeated 2 times
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:411: sigar_cpu_info_list_get() failed
2008-11-27 11:11:00: (critical) job_collect_os.c:445: sigar_cpu_list_get() failed
2008-11-27 11:11:30: (critical) network-socket.c.292: socket(127.0.0.1:3306) failed: Too
many open files (24)
2008-11-27 11:11:30: (critical) proxy-plugin.c.1532: Cannot connect, all backends are
down.
2008-11-27 11:20:22: (critical) last message repeated 4 times
2008-11-27 11:20:22: (critical) network-io.c:215:
curl_easy_perform('https://user:password@merlin-dashboard:443/heartbeat') failed: 
SSL connection timeout (curl-error = 'Timeout was reached' (28), os-error = 'Connection
refused' (111))