| Bug #54204 | solaris sigar cpu detection can hang | ||
|---|---|---|---|
| Submitted: | 3 Jun 2010 13:19 | Modified: | 24 Jan 2011 19:39 |
| Reporter: | Andrew Dalgleish | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Enterprise Monitor: Agent | Severity: | S2 (Serious) |
| Version: | 2.2.0.1709 | OS: | Solaris |
| Assigned to: | Jan Kneschke | CPU Architecture: | Any |
[18 Jun 2010 7:15]
MySQL Verification Team
How to repeat: * configure a pset, pool, and non-global zone * login to the zone sudo zlogin -C testzone * check the CPU info kstat cpu_info * in the global zone, transfer CPUs in (or out) of the test zone's pset sudo poolcfg -dc 'transfer 1 from pset pset_default to testpset' * check the CPU info again - it's different kstat cpu_info There is a TOC-TOU window between the initial call to get the number of CPUs, and when we collect the cpu_info. I'm not even sure how MEM will handle the stats for a dynamic number of CPUs.
[6 Jul 2010 12:19]
Enterprise Tools JIRA Robot
Michael Schuster writes: increasing priority because of ESC.
[12 Jul 2010 15:37]
Enterprise Tools JIRA Robot
Jan Kneschke writes: sigar_get_kstats() is only called once directly, that's at sigar_open() time. Every other access goes through sigar_kstat_update() which calls kstat_update_chain() which updates kstat-chain and if it updated it, calls sigar_get_kstats() to update the internal cpu stats. * http://github.com/hyperic/sigar/blob/master/src/os/solaris/kstats.c#L23 (sigar_get_kstats()) * http://github.com/hyperic/sigar/blob/master/src/os/solaris/kstats.c#L96 (sigar_kstat_update()) Between the "kstat_update_chain()" and the "sysconf(..._ONLN)" in sigar_get_kstat() is a gap which may lead to a race-condition. The test in the loop: {noformat} + /* exit if the number of CPUs has reduced */ + if (i >= sysconf(_SC_NPROCESSORS_ONLN)) { + break; + } {noformat} should help to protect against a proc going offline in between, but it doesn't: 1) kstat_update_chain() called with 4 procs online 2) sysconf(..._ONLN) at the start of sigar_get_kstats() reports the 4 procs 3) new processor brought offline 4) sigar_get_kstats() enters the for-loop 5) sysconf(_SC_NPROCESSORS_ONLN) called again, reports 3 procs, exists early ... even if the kstat-chain contains 4 cpu entries If processors got online or offline inbetween: the kstat-chain is only updated if kstat_update_chain() is called. Taking into account that kstat_lookup() searches the whole kstat-chain until it finds something from head to end, it will be cheaper to just walk the list once to find included instances of cpu_info and afterwards extracting all their instances. + no need for sysconf + no race-conditions
[12 Jul 2010 16:00]
Enterprise Tools JIRA Robot
Jan Kneschke writes: added to hyperics forums: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=10457
[28 Jul 2010 15:29]
Enterprise Tools JIRA Robot
Jan Kneschke writes:
Added the kstats-visitor.[ch] to the sigar-repo:
------------------------------------------------------------
revno: 2544
committer: jan@mysql.com
branch nick: sigar
timestamp: Wed 2010-07-28 13:11:29 +0200
message:
added a visitor for the kstat-chain to walk the list only once
* fixes a race-condition between sysconf() and kstat
* should be faster for a large number of CPUs as we only walk it once
instead of 2 times per CPU
added:
src/os/solaris/kstats-visitor.c
src/os/solaris/kstats-visitor.h
modified:
src/os/solaris/Makefile.am
src/os/solaris/kstats.c
[29 Jul 2010 23:21]
Enterprise Tools JIRA Robot
Andy Bang writes: In build 2.2.3.1734.

Description: The sigar cpu detection code on Solaris can hang if the number of CPUs varies. How to repeat: A customer provided dtrace probes which confirmed an infinite loop. I'm still trying to create a repeatable test case, but I'm not confident. Suggested fix: Untested --- kstats.c.orig 2010-06-02 19:43:57.000000000 +1000 +++ kstats.c 2010-06-03 23:12:10.000000000 +1000 @@ -24,7 +24,7 @@ int sigar_get_kstats(sigar_t *sigar) { kstat_ctl_t *kc = sigar->kc; - unsigned int i, id, ncpu = sysconf(_SC_NPROCESSORS_CONF); + unsigned int i, id, ncpu = sysconf(_SC_NPROCESSORS_ONLN); int is_debug = SIGAR_LOG_IS_DEBUG(sigar); if (ncpu != sigar->ncpu) { @@ -67,6 +67,16 @@ for (i=0, id=0; i<ncpu; id++) { kstat_t *cpu_info, *cpu_stat; + /* exit if the number of CPUs has reduced */ + if (i >= sysconf(_SC_NPROCESSORS_ONLN)) { + break; + } + + /* exit if we have checked all the legal CPUID's */ + if (id >= sysconf(_SC_CPUID_MAX)) { + break; + } + if (!(cpu_info = kstat_lookup(kc, "cpu_info", id, NULL))) { continue; }