MySQL Bugs: #28200: Running test "sigar-test-all" on Solaris 9 hangs after instance 7

Bug #28200	Running test "sigar-test-all" on Solaris 9 hangs after instance 7
Submitted:	2 May 2007 13:54	Modified:	16 Apr 2009 16:59
Reporter:	Kent Boortz	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Enterprise Monitor: Agent	Severity:	S2 (Serious)
Version:	svn rev 5345	OS:	Any (Solaris 9)
Assigned to:	Jan Kneschke	CPU Architecture:	Any

Description:
Running both 32 and 64 bit on Solaris 9

 LD_LIBRARY_PATH=$base/pcre/lib:$base/curl/lib sigar-test-all

will hang after

[7] 
  fs.dirname = /users
  fs.devname = production:/usersnfs
  fs.typename = remote
  fs.sys-type-name = nfs
  fs.type = 3
  fs.flags = 0x8b
  fsusage.total = 373057344
  fsusage.free = 51617840
  fsusage.used = 321439504
  fsusage.avail = 32667596
  fsusage.files = 47382528
  fsusage.disk_reads = 18446744073709551615
  fsusage.disk_writes = 18446744073709551615
  fsusage.disk_write_bytes = 18446744073709551615
  fsusage.disk_read_bytes = 18446744073709551615
  fsusage.disk_queue = 18446744073709551615
  fsusage.use_percent = 0.000000

How to repeat:
Run the test like described above

If hitting control-C in dbx it stops at

 t@1 (l@1) signal INT (Interrupt) in _statvfs at 0xfee9f534
 0xfee9f534: _statvfs+0x0008:    bgeu     _statvfs+0x30  ! 0xfee9f55c
 Current function is sigar_file_system_usage_get (optimized)
  1597       if (statvfs(dirname, &buf) != 0) {

but a "cont" will actually terminate the run, with

 (dbx) cont
 agent/src/sigar-test-all.c.395: 
 sigar_file_system_usage_get(/nfstmp1) failed with: Interrupted system call (4) 
 agent/src/sigar-test-all.c.463 (unknown): 

So this might actually more be about the sigar call blocking
when there is a faulty NFS mount? If so, question is if this
is to be considered a bug or not. Even a "ls /nfstmp1" will
hang, but on the other hand we might consider the agent
to be the kind of daemon that should not hang on this
operation.

It could also be that we don't use this part of the SIGAR
library in the agent, just in the test executable.

Oh we will certainly want to be getting this kind of information within the agent as well. Especially usage information such as size, used, available - even on mounted disks as well I should think. 

We should some how handle this nicely within the agent (as well as sigar-test-all)..

Cheers,

Mark

The only way to "fix" this problem is using the "soft" mount option in the nfs-mount. Otherwise all sys-calls to a NFS-share will block infinitely.

Quoting "man mount" on Linux:

       hard   The program accessing a file on a NFS mounted file system will hang when the server crashes. The process cannot be interrupted or killed unless you  also  specify
              intr.  When the NFS server is back online the program will continue undisturbed from where it was. This is probably what you want.

Wait, 'intr' was specified in that case. All we need is a SIGALRM around the call to statvfs(). It will return the statvfs() with EINTR and we know it was a timeout.

Add to smoke test???

Per jan: requires a sigar feature which is pretty new

The NFS got remounted with the "intr" option and the problem is gone now.

A more complex detection of the problem will be implemented later.

2.0 disk space monitoring will make this worse.

1068 jan@mysql.com	2008-10-29
     try to ping the file-systems before we use them (fixes #28200)

       * sigar_file_system_ping() verifies that a NFS mount is active 
modified:
 plugins/agent/job_collect_os.c
 plugins/agent/sigar-test-all.c
 plugins/agent/tests/unit/t_sigar.c

to verify it, setup NFS with "hard" and no "intr", mount it and stop the NFS server.

there is no longer a 'sigar-test-all', so I've used (with 2.0.5.7153 and 2.1.0.1024) the now existing agent option "--agent-run-os-tests". The agents doesn't hang.

The problem with the size of /users, which results in

 sigar-test-all.c.464: sigar_file_system_usage_get(/users) failed with: Value too large for defined data type (79)

will be reported as a separate problem (32bit only).