I hope you’ve been considering the puzzle from my last post about how much effort it you should put into fixing a simple problem.

For the sake of today’s discussion, lets assume you’re working on an in-house system for your organisation, one that fulfills an important institutional need. You can actually go and talk to all of your users as they’re located in the same building as you. Even better, there’s a fairly flat support heirarchy - if someone calls the Help Desk and the problem isn’t obvious, the kind folks on the Help Desk will pass the issue directly to you to resolve.

In this context, what is the cost of the recurring bug?

The cost of the fix

It never seems that applying the fix takes very long. You run the scripts, check the result, and give the end user a call to let them know it’s resolved.

But one day you’re asked to document the fix (so that someone else can do it if you’re not around), so you take the time to actually write down the steps required.

  • Contact the Help Desk to enable production access
  • Remote into the prodution system and open the logs
  • Search the logs for the expected error message and note the details needed for your scripts
  • Open the appropriate tool and connect to the production data store
  • Load the fix scripts and modify them with the details noted earlier
  • Test the scripts by running them with a rollback (or equivalent) to ensure they make exactly the correct change
  • Run the scripts with a commit (or equivalent) to make the change
  • Recycle the application role to ensure the faulty information isn’t lurking in a cache somewhere
  • Call the affected user and let them know everything is fixed

Even though you’re practiced, it turns out this quick fix actually takes you around 15 minutes each time. Subjectively you thought it was just a couple of minutes, so this surprises you.

The cost of interruption

You can’t predict when production issues occur, so this issue is always an interruption that takes you away from another task.

Research shows that it typically takes an average of 23 minutes to get back to where you were after being interrupted.

You know that sometimes you’re able to get back on task really quickly, in just a few minutes. But, you also know that sometimes you end up distracted when you’re debugging some really complicated code and it takes you a long time to get back to where you were.

The cost to the end user

When this problem happens, it’s pretty serious for the affected user. They’re unable to use the system at all until it’s fixed.

For many of your users, this system is their key workday tool - they spend most of the day working with it and they can’t achieve their goals for the day when it goes down.

When the problem occurs, they have to contact the help desk. The folks at the help desk have to work out what sort of problem it is. Once recognised, they need to contact you - and you run your scripts to fix things up.

It turns out that there’s usually a delay of 15-20 minutes betwen the time the problem happens and when you’re notified. After that, it’s another 15 minutes for you to fix the issue - and for all that time, the end user is sitting idle.

The cost of being away from your desk

You’re not at your desk all of the time. We all have formal meetings, informal conversations, coffee breaks, biological considerations, and other issues going on that take us away from our desks during the day.

In most organisations, you also have some variation in working hours. The early birds arrive at work early every morning, drinking their freshly squeezed organic juice. By late afternoon they’re skipping jauntily out of the door after a full day of work. The night owls zombie-shuffle in after 9 am, extra large coffees in hand, and work through into the early evening.

When the Help Desk tries to find you, and you’re not at your desk, how much longer does it take before they make contact? How much longer does it take before you get back to your desk to apply the fix?

The cost of repetition

Remember that our hypotheical production issue happens around once a week - that’s around fifty times per year. This magnifies the cost of each occurance, especially considering that you only need to fix the problem once.

Adding it all up

When you consider the time taken to fix the issue (15 minutes) and the cost of distractions (23 minutes), you find you’re spending 33 hours a year fixing the glitch.

Your help desk are spending up to 20 minutes triaging each issue before passing it on to you - that’s up to 17 hours per year.

And your end users, they find that each time the bug happens it carves nearly 35 minutes out of their day - totalling around 30 hours per year.

Now we see that the cost of our quick fix runs around 80 hours per year … that’s twice the time investment of a proper fix (and we haven’t yet accounted for the costs of being away from your desk).


blog comments powered by Disqus