Bug #28200 | Running test "sigar-test-all" on Solaris 9 hangs after instance 7 | ||
---|---|---|---|
Submitted: | 2 May 2007 13:54 | Modified: | 16 Apr 2009 16:59 |
Reporter: | Kent Boortz | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Enterprise Monitor: Agent | Severity: | S2 (Serious) |
Version: | svn rev 5345 | OS: | Any (Solaris 9) |
Assigned to: | Jan Kneschke | CPU Architecture: | Any |
[2 May 2007 13:54]
Kent Boortz
[2 May 2007 14:03]
Kent Boortz
If hitting control-C in dbx it stops at t@1 (l@1) signal INT (Interrupt) in _statvfs at 0xfee9f534 0xfee9f534: _statvfs+0x0008: bgeu _statvfs+0x30 ! 0xfee9f55c Current function is sigar_file_system_usage_get (optimized) 1597 if (statvfs(dirname, &buf) != 0) { but a "cont" will actually terminate the run, with (dbx) cont agent/src/sigar-test-all.c.395: sigar_file_system_usage_get(/nfstmp1) failed with: Interrupted system call (4) agent/src/sigar-test-all.c.463 (unknown): So this might actually more be about the sigar call blocking when there is a faulty NFS mount? If so, question is if this is to be considered a bug or not. Even a "ls /nfstmp1" will hang, but on the other hand we might consider the agent to be the kind of daemon that should not hang on this operation. It could also be that we don't use this part of the SIGAR library in the agent, just in the test executable.
[2 May 2007 14:15]
Mark Leith
Oh we will certainly want to be getting this kind of information within the agent as well. Especially usage information such as size, used, available - even on mounted disks as well I should think. We should some how handle this nicely within the agent (as well as sigar-test-all).. Cheers, Mark
[11 May 2007 21:04]
Jan Kneschke
The only way to "fix" this problem is using the "soft" mount option in the nfs-mount. Otherwise all sys-calls to a NFS-share will block infinitely. Quoting "man mount" on Linux: hard The program accessing a file on a NFS mounted file system will hang when the server crashes. The process cannot be interrupted or killed unless you also specify intr. When the NFS server is back online the program will continue undisturbed from where it was. This is probably what you want.
[11 May 2007 21:05]
Jan Kneschke
Wait, 'intr' was specified in that case. All we need is a SIGALRM around the call to statvfs(). It will return the statvfs() with EINTR and we know it was a timeout.
[11 May 2007 23:26]
Andy Bang
Add to smoke test???
[21 May 2007 17:22]
Gary Whizin
Per jan: requires a sigar feature which is pretty new
[27 Aug 2007 18:32]
Jan Kneschke
The NFS got remounted with the "intr" option and the problem is gone now. A more complex detection of the problem will be implemented later.
[15 Nov 2007 20:20]
Gary Whizin
2.0 disk space monitoring will make this worse.
[29 Oct 2008 20:00]
Jan Kneschke
1068 jan@mysql.com 2008-10-29 try to ping the file-systems before we use them (fixes #28200) * sigar_file_system_ping() verifies that a NFS mount is active modified: plugins/agent/job_collect_os.c plugins/agent/sigar-test-all.c plugins/agent/tests/unit/t_sigar.c to verify it, setup NFS with "hard" and no "intr", mount it and stop the NFS server.
[16 Apr 2009 16:59]
Carsten Segieth
there is no longer a 'sigar-test-all', so I've used (with 2.0.5.7153 and 2.1.0.1024) the now existing agent option "--agent-run-os-tests". The agents doesn't hang. The problem with the size of /users, which results in sigar-test-all.c.464: sigar_file_system_usage_get(/users) failed with: Value too large for defined data type (79) will be reported as a separate problem (32bit only).