Here’s a puzzle for you all …

Assume you’re working with a complex system that’s highly important for the business, one that is mission critical for some of the staff, where they are dependent of the system so they can do their jobs every day.

Further, lets assume the system has just been restarted after a major meltdown and you’re investigating the cause of that failure.

When you think you’ve identified the sequence of actions that caused the system to go down, when you have a plausible theory of the crash that matches known information, what do you do next?

  1. Document your theory and actively solicit feedback from other knowledgeable parties?

  2. Run a controlled test of your theory in a non-critical (non-production!) environment to see what happens?

  3. Try it out on the production system to see it the meltdown recurs?

If you happen to think that option #3 is acceptable, think about what happens if your theory is correct and the production system goes down again. (Bonus points if you can guess which of these was inflicted on critical system at work today.)

Having a production system go down for reasons outside of your control can be both catastrophic and expensive, with costs in time, direct expenses and lost revenue.

If that down time occurs for reasons under your control - if you in fact actively caused that failure, then the consequences should also be “career limiting.”

For most systems, test environments are set up and maintained for exactly this reason - to provide a safe place where failure is less significant, less costly, less important.

No matter how good you think you are - in fact, regardless of how good you actually are - playing fast and loose with a production system is not ok.


blog comments powered by Disqus