In our last discussion we talked about the cost of leaving a minor bug in place, versus the cost of fixing it properly. We discovered that the fix might involve half the time investment of leaving the bug in place.
Unfortunately, it’s so much worse than that.
Instead of assuming the system is an internal one with a limited set of users who are all in the same building, let’s consider the case when your users are spread all around the world.
Perhaps you’re working on some type of software-as-a-service offering. Perhaps it’s online accounting, error reporting or process automation. Your users are spread all around the planet in a variety of time zones. You help-desk runs 24x7 in order to support all of your users.
In this context, what’s the cost of our little recurring bug?
All of the costs we detailed for the in-house system still apply, but there are new factors to consider as well.
The cost of a distributed user base
With your users spread around the planet in every time zone, the chances of the problem occuring while you’re in the office, let alone actually at your desk, fall dramatically.
Depending on where your userbase is concentrated, you might get lucky. If you’re based in Texas and most of your users are in the United States, it’s pretty likely that calls will come in when you’re in the office. But if you’re based in New Zealand and most of your users are in Europe, you’ll most likely be in bed, asleep, when calls come in.
If we split the difference and assume that half of your calls come in when you’re not in the office, how are you going to handle that?
You might decide that the issue is urgent and worth being woken. If so, how long does it take you to wake, turn on the laptop, establish the right remote connection, run your scripts, shut everything down and go back to sleep. Can you do this alone, or do you need a second party to approve your production access?
If it takes you an additional 45 minutes to handle issues that occur in the middle of the night, that’s another 20 hours per year of your time, not counting the disruption to the rest of your family.
Alternatively, you might decide that the issue is not urgent enough to be woken - so your end users have to wait until you’re availble in the morning to address the issue. Depending on the time zone difference between you and your users, this might increase their downtime by 20 hours. Add in the public holidays in each locale, the delay might be even longer. Somehow I doubt users in South America will particularly care that you took the day off for ANZAC Day - all they will know is that it took you over 36 hours to fix their problem.
If we assume an average impact of around 8 hours for each occurrence, happening 25 times a year, this increases the end user impact by around 200 hours per year.
The cost of commercial users
When people are paying for your service, they’ll have an expectations around up time and responsiveness. These expectations may or may not be realistic or reasonable - but they’ll have them nonetheless.
If the service goes down, even if for just one user, what’s the probabilty that the failure triggers their change to one of your competitors?
Of course the answer to this is unknowable. If the customer has had a great experience with your company, they love your service, and this is the first problem they’ve ever had, then you’ll be fine. But what if their experience with your company has been poor, they’ve grown to hate you, and this is is just another in an ongoing series of problems? What’s the cost of them taking their business elsewhere? Are we talking about a monthly subscription of $9.95 per month for one disgruntled customer, or $49.95 per user per month for thousands of users?
The cost of growth
When system load is higher, more things go wrong.
In a SaaS environment, we’re hoping to grow our subscriber base - and this means it’s very likely our defect will start happening much more often.
Doubling our user base in the course of a year would mean, all things being equal, that our problem would be happening twice a week, increasing the burden on your time by up to 20 hours in the first year.
However, if the underlying issue is a race condition, deadlock, or anything else that occurs more frequently with higher load, our little problem might start happening 10 times as much (or more) with double the use. This could easily occupy you for an additional 100 hours or more …
Don’t mistake my point here - having scale issues because your business is growing so fast is a wonderful problem to have, despite the frustration.
The cost of a bus
If you’re the only person who can quickly address the issue, what’s the effect on your business if you’re hit by a bicycle, motorbike, car, minivan, bus, truck, tram, or train? Without warning, you’re spending several weeks in the hospital - or perhaps never returning to work at all.
Don’t consider that very likely?
What about the likelyhood of a close family member suddenly taking seriously ill? Or the possibility of winning a large sum of money? Or, maybe, someone makes you the job offer of a lifetime and suddenly you’re relocating halfway around the planet.
If you are the only person who can fix the issue, you are a risk to the business.
Adding it all up
When your users are spread out around the world, your “around once a week” fix starts to cost your users dozens, if not hundreds of hours - and the support calls start to happen at all times of the day and night.
The growth of a successful business means that the problem starts happening more often, burning up even more of your time - and if you’re the only person who can fix it, the business is at risk.
The cost of our quick fix now runs to at least 300 hours in the first year alone (more in later years due to growth) - more than seven times the time investment of a proper fix (and we still haven’t accounted for the costs of being away from your desk, being woken up in the middle of the night, or the effect on your family of those late night calls).
So what’s the correct trade off again?