The premise that systems should ‘fail fast’ is pretty well established - the idea has it’s own wikipedia page, and any number of books talk about it as a fundamental premise. For example, Release It! from the Pragmatic Bookshelf makes several references to Fail Fast in Chapter 5: Stability patterns.
I first encountered the idea of Fail Fast (though it wasn’t called that then) way back in my university days while completing my Computer Science degree.
It is frustrating, therefore, to come across software that seems to completely ignore the premise.
At the moment, I’m spending way too much time working with product X.
Product X is a tool that is supposed to help with a certain data transportation issue, connecting two existing systems together so that the value of each is increased.
Unfortunately, Product X doesn’t ‘Fail Fast’ when something goes wrong. Rather, it seems to follow a ‘dogged determination’ philosophy - whenever something goes wrong, do your best to keep running regardless of the consequences.
For example, today I found that one of the database views lacked the correct permissions, and no data was being transferred. No error had been logged by the system - I noticed the problem by chance. Worse, when I brought up the configuration tool to check details, the system made no comment at all about the missing database view. Worst of all, the system appeared to have lost all of the mapping configuration.
Fortunately, once visibility of the database view was restored, I was able recover the mapping details without working from scratch.
Based on this experience, I’d like to suggest that systems should not just Fail Fast, they should Fail Loudly.
A critical error in a production system should never be kept quiet, waiting for chance discovery.
There must be a thousand ways to get attention, from the subtle to the ridiculous. Use them. Don’t wait for someone to wander by with their mind and focus on another task - create a log file, write details into the event log, broadcast an event, bring up a stomping great error dialog, post a blog entry, start a siren, set off the sprinklers, do anything that’s necessary to get some attention and get the problem solved.