On Thresholds

I had a production system go down this week - one minute no problem, the next, critical functionality stopped working. Worse, it didn’t die because something broke - it went down by design. And did so without warning.

Turns out that the system stopped working because the file system on which it stored information reached 95% full.

Sounds reasonable, right? Except that there was still 38GiB of space available on that volume - enough free space to keep the system running for months.

I’m 100% positive that when the system was first put together, the 95% threshold on drive space was reasonable - but in these days of terabyte volumes, network attached storage and storage area networks, the approach leaves something to be desired.

If I were designing such a system myself, I’d make two key changes.

Firstly, have both a soft and a hard threshold.

When the free space on the volume reaches the soft threshold, continue working but start complaining about low space. This gives the system administrators time to react - to purchase or configure additional space. Don’t stop working until you read the hard threshold.

Such a dual threshold approach avoids the problems I encountered - where a system was working perfectly, and then stopped with no warning.

Secondly, define the thresholds in time, not in percentages or bytes.

The amount of free space required depends both on the size of the drive, and on the workload of the system. Instead of making the system administrators guess, the system should keep track of it’s storage use over time, and therefore have an estimate on the rate of growth.

Thresholds can then be set in terms of time: warn the system administrators when disk space goes low enough that you expect to fill it in 8 weeks or less.

Next Post
Foundations of Programming 08 Oct 2008
Prior Post
Creative Discipline 25 Sep 2008
Related Posts
Error assertions 26 Apr 2025
Browsers and WSL 31 Mar 2024
Factory methods and functions 05 Mar 2023
Using Constructors 27 Feb 2023
An Inconvenient API 18 Feb 2023
Method Archetypes 11 Sep 2022
A bash puzzle, solved 02 Jul 2022
A bash puzzle 25 Jun 2022
Improve your troubleshooting by aggregating errors 11 Jun 2022
Improve your troubleshooting by wrapping errors 28 May 2022
Archives
October 2008
2008