Bug #28200 Running test "sigar-test-all" on Solaris 9 hangs after instance 7
Submitted: 2 May 2007 13:54 Modified: 16 Apr 2009 16:59
Reporter: Kent Boortz Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Enterprise Monitor: Agent Severity:S2 (Serious)
Version:svn rev 5345 OS:Any (Solaris 9)
Assigned to: Jan Kneschke CPU Architecture:Any

[2 May 2007 13:54] Kent Boortz
Description:
Running both 32 and 64 bit on Solaris 9

 LD_LIBRARY_PATH=$base/pcre/lib:$base/curl/lib sigar-test-all

will hang after

[7] 
  fs.dirname = /users
  fs.devname = production:/usersnfs
  fs.typename = remote
  fs.sys-type-name = nfs
  fs.type = 3
  fs.flags = 0x8b
  fsusage.total = 373057344
  fsusage.free = 51617840
  fsusage.used = 321439504
  fsusage.avail = 32667596
  fsusage.files = 47382528
  fsusage.disk_reads = 18446744073709551615
  fsusage.disk_writes = 18446744073709551615
  fsusage.disk_write_bytes = 18446744073709551615
  fsusage.disk_read_bytes = 18446744073709551615
  fsusage.disk_queue = 18446744073709551615
  fsusage.use_percent = 0.000000

How to repeat:
Run the test like described above
[2 May 2007 14:03] Kent Boortz
If hitting control-C in dbx it stops at

 t@1 (l@1) signal INT (Interrupt) in _statvfs at 0xfee9f534
 0xfee9f534: _statvfs+0x0008:    bgeu     _statvfs+0x30  ! 0xfee9f55c
 Current function is sigar_file_system_usage_get (optimized)
  1597       if (statvfs(dirname, &buf) != 0) {

but a "cont" will actually terminate the run, with

 (dbx) cont
 agent/src/sigar-test-all.c.395: 
 sigar_file_system_usage_get(/nfstmp1) failed with: Interrupted system call (4) 
 agent/src/sigar-test-all.c.463 (unknown): 

So this might actually more be about the sigar call blocking
when there is a faulty NFS mount? If so, question is if this
is to be considered a bug or not. Even a "ls /nfstmp1" will
hang, but on the other hand we might consider the agent
to be the kind of daemon that should not hang on this
operation.

It could also be that we don't use this part of the SIGAR
library in the agent, just in the test executable.
[2 May 2007 14:15] Mark Leith
Oh we will certainly want to be getting this kind of information within the agent as well. Especially usage information such as size, used, available - even on mounted disks as well I should think. 

We should some how handle this nicely within the agent (as well as sigar-test-all)..

Cheers,

Mark
[11 May 2007 21:04] Jan Kneschke
The only way to "fix" this problem is using the "soft" mount option in the nfs-mount. Otherwise all sys-calls to a NFS-share will block infinitely.

Quoting "man mount" on Linux:

       hard   The program accessing a file on a NFS mounted file system will hang when the server crashes. The process cannot be interrupted or killed unless you  also  specify
              intr.  When the NFS server is back online the program will continue undisturbed from where it was. This is probably what you want.
[11 May 2007 21:05] Jan Kneschke
Wait, 'intr' was specified in that case. All we need is a SIGALRM around the call to statvfs(). It will return the statvfs() with EINTR and we know it was a timeout.
[11 May 2007 23:26] Andy Bang
Add to smoke test???
[21 May 2007 17:22] Gary Whizin
Per jan: requires a sigar feature which is pretty new
[27 Aug 2007 18:32] Jan Kneschke
The NFS got remounted with the "intr" option and the problem is gone now.

A more complex detection of the problem will be implemented later.
[15 Nov 2007 20:20] Gary Whizin
2.0 disk space monitoring will make this worse.
[29 Oct 2008 20:00] Jan Kneschke
1068 jan@mysql.com	2008-10-29
     try to ping the file-systems before we use them (fixes #28200)

       * sigar_file_system_ping() verifies that a NFS mount is active 
modified:
 plugins/agent/job_collect_os.c
 plugins/agent/sigar-test-all.c
 plugins/agent/tests/unit/t_sigar.c

to verify it, setup NFS with "hard" and no "intr", mount it and stop the NFS server.
[16 Apr 2009 16:59] Carsten Segieth
there is no longer a 'sigar-test-all', so I've used (with 2.0.5.7153 and 2.1.0.1024) the now existing agent option "--agent-run-os-tests". The agents doesn't hang.

The problem with the size of /users, which results in

 sigar-test-all.c.464: sigar_file_system_usage_get(/users) failed with: Value too large for defined data type (79)

will be reported as a separate problem (32bit only).