I had a production system go down this week - one minute no problem, the next, critical functionality stopped working. Worse, it didn’t die because something broke - it went down by design. And did so without warning.
Turns out that the system stopped working because the file system on which it stored information reached 95% full.
Sounds reasonable, right? Except that there was still 38GiB of space available on that volume - enough free space to keep the system running for months.
I’m 100% positive that when the system was first put together, the 95% threshold on drive space was reasonable - but in these days of terabyte volumes, network attached storage and storage area networks, the approach leaves something to be desired.
If I were designing such a system myself, I’d make two key changes.
Firstly, have both a soft and a hard threshold.
When the free space on the volume reaches the soft threshold, continue working but start complaining about low space. This gives the system administrators time to react - to purchase or configure additional space. Don’t stop working until you read the hard threshold.
Such a dual threshold approach avoids the problems I encountered - where a system was working perfectly, and then stopped with no warning.
Secondly, define the thresholds in time, not in percentages or bytes.
The amount of free space required depends both on the size of the drive, and on the workload of the system. Instead of making the system administrators guess, the system should keep track of it’s storage use over time, and therefore have an estimate on the rate of growth.
Thresholds can then be set in terms of time: warn the system administrators when disk space goes low enough that you expect to fill it in 8 weeks or less.
Comments
blog comments powered by Disqus