MySQL Bugs: #40400: Please make mtr print the amount of free disk space after every failure

Bug #40400	Please make mtr print the amount of free disk space after every failure
Submitted:	29 Oct 2008 17:41	Modified:	9 Jan 2015 14:34
Reporter:	Sven Sandberg	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	Tools: MTR / mysql-test-run	Severity:	S7 (Test Cases)
Version:	5.1	OS:	Any
Assigned to:	Bjørn Munch	CPU Architecture:	Any
Tags:	disk full, mtr

Description:
When the disk on a pushbuild host gets filled up, it usually generates a number of strange failures. Sometimes, but not always, there will be a message in one or more of the logs that the disk is full. This would be much faster to debug if mtr printed the amount of free disk space after every test failure.

See also BUG#40156 (this is not a duplicate: BUG#40156 is for when the disk gets full during one of mysqltest's operations. The present bug will be useful also when the disk becomes full during one of mysqld's operations).

How to repeat:
E.g. BUG#40133, BUG#40155. There are also many unreported cases.

Please give more details than mentioned within the
initial report.
1. Maybe: free disk space in general
2. Free space in the filesystems containing
   our vardir and tmpdir
   We can probably conclude that our problems
   are caused by filesystem full.
3. Space consumed by our vardir and tmpdir
   We can probably conclude that our problems
   are caused by our MTR run
   - the last testcase was too greedy
   - the general setup (MTR options used)
     is unfortunate (See Bug #42442)

While this might seem useful I do not think this should/can be implemented in mtr. In general, software does not try to divine whether resource problems might be the cause of failures, they just report the symptoms. Some arguments specific to mtr:

mtr is run on lots of different systems, and I doubt there is any (easy) portable way to implement this across all platforms. Also, it's difficult or impossible to determine what's the limit of acceptable free disk (or memory). This tool is also shipped as part of the product and used by some customers, we don't know what kind of setup they have.

If numbers on free disk is to be printed whenever a test fails, I think this will be more "noise" than useful, since in well over 99% of the cases is has no relevance. It may even lead to confusion as users think it might be relevant. Also, the numbers reported by mtr *after* the server has crashed may be quite different from what the server experienced, especially when using --mem.

mtr is not an end user application, it's a tool for engineers who will often be able to see from the symptoms of unexpected failures that low memory and/or full disk is a likely cause, and also check afterwards that they are indeed low on mem/disk. This is especially true when running mtr manually. I can only speak for myself, but in the few cases I've had this problem on my desktop or a lab box, it's generally not been very difficult for me to guess that full disk/memory was the problem, from the way things failed.

For usage in the in-house PB2 environment where tests may be analyzed some time after they've been run, I think a much better solution is to have independent monitoring of disk/memory, with limits adjusted for each host.

Bjorn, thank you for your feedback.

I'm all for your suggestion to use an external monitoring tool to track the amount of disk space. As long as the job gets done it does not matter which tool does it. And as you mention, it may even be preferrable to have an external tool so that the disk/memory usage before the crash can be detected.

So, can we add this feature to pushbuild? Basically, what we need is a per-host database of resource usage and a monitoring tool that fills the database. To make it usable, I think pushbuild can parse logs, find timestamps of test failures, match them with the database, and insert, say, the peak of memory and disk usage for the last 30 minutes. Maybe we would need to augment mysqltest.cc so that it prints timestamps of test failures in an easily parsable way. This should be easy.

You are right that when mtr is used on a local system, it is often not difficult to figure out if you are out of disk space. But when developers are assigned to analyze pb failures, it can be harder. We have trees in all sorts of states, some where it is normal to have a big number of test failures. Not everyone is up to date with the state of every tree. This feature would allow us to classify host-related problems much faster.

So, should we change category to Pushbuild?