Bug #54204 | solaris sigar cpu detection can hang | ||
---|---|---|---|
Submitted: | 3 Jun 2010 13:19 | Modified: | 24 Jan 2011 19:39 |
Reporter: | Andrew Dalgleish | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Enterprise Monitor: Agent | Severity: | S2 (Serious) |
Version: | 2.2.0.1709 | OS: | Solaris |
Assigned to: | Jan Kneschke | CPU Architecture: | Any |
[3 Jun 2010 13:19]
Andrew Dalgleish
[18 Jun 2010 7:15]
MySQL Verification Team
How to repeat: * configure a pset, pool, and non-global zone * login to the zone sudo zlogin -C testzone * check the CPU info kstat cpu_info * in the global zone, transfer CPUs in (or out) of the test zone's pset sudo poolcfg -dc 'transfer 1 from pset pset_default to testpset' * check the CPU info again - it's different kstat cpu_info There is a TOC-TOU window between the initial call to get the number of CPUs, and when we collect the cpu_info. I'm not even sure how MEM will handle the stats for a dynamic number of CPUs.
[6 Jul 2010 12:19]
Enterprise Tools JIRA Robot
Michael Schuster writes: increasing priority because of ESC.
[12 Jul 2010 15:37]
Enterprise Tools JIRA Robot
Jan Kneschke writes: sigar_get_kstats() is only called once directly, that's at sigar_open() time. Every other access goes through sigar_kstat_update() which calls kstat_update_chain() which updates kstat-chain and if it updated it, calls sigar_get_kstats() to update the internal cpu stats. * http://github.com/hyperic/sigar/blob/master/src/os/solaris/kstats.c#L23 (sigar_get_kstats()) * http://github.com/hyperic/sigar/blob/master/src/os/solaris/kstats.c#L96 (sigar_kstat_update()) Between the "kstat_update_chain()" and the "sysconf(..._ONLN)" in sigar_get_kstat() is a gap which may lead to a race-condition. The test in the loop: {noformat} + /* exit if the number of CPUs has reduced */ + if (i >= sysconf(_SC_NPROCESSORS_ONLN)) { + break; + } {noformat} should help to protect against a proc going offline in between, but it doesn't: 1) kstat_update_chain() called with 4 procs online 2) sysconf(..._ONLN) at the start of sigar_get_kstats() reports the 4 procs 3) new processor brought offline 4) sigar_get_kstats() enters the for-loop 5) sysconf(_SC_NPROCESSORS_ONLN) called again, reports 3 procs, exists early ... even if the kstat-chain contains 4 cpu entries If processors got online or offline inbetween: the kstat-chain is only updated if kstat_update_chain() is called. Taking into account that kstat_lookup() searches the whole kstat-chain until it finds something from head to end, it will be cheaper to just walk the list once to find included instances of cpu_info and afterwards extracting all their instances. + no need for sysconf + no race-conditions
[12 Jul 2010 16:00]
Enterprise Tools JIRA Robot
Jan Kneschke writes: added to hyperics forums: http://forums.hyperic.com/jiveforums/thread.jspa?threadID=10457
[28 Jul 2010 15:29]
Enterprise Tools JIRA Robot
Jan Kneschke writes: Added the kstats-visitor.[ch] to the sigar-repo: ------------------------------------------------------------ revno: 2544 committer: jan@mysql.com branch nick: sigar timestamp: Wed 2010-07-28 13:11:29 +0200 message: added a visitor for the kstat-chain to walk the list only once * fixes a race-condition between sysconf() and kstat * should be faster for a large number of CPUs as we only walk it once instead of 2 times per CPU added: src/os/solaris/kstats-visitor.c src/os/solaris/kstats-visitor.h modified: src/os/solaris/Makefile.am src/os/solaris/kstats.c
[29 Jul 2010 23:21]
Enterprise Tools JIRA Robot
Andy Bang writes: In build 2.2.3.1734.