Always fix it twice

One of the most useful precepts that I use to guide my development is this: when something goes wrong, make sure you fix it twice. This is especially important when the problem impacts on a production environment, but it’s also relevant for staging, testing, and development environments.

I can’t remember when I first learnt this lesson - but I do vividly remember applying it to good effect while working at my first full-time job, way back in the mid ’90s, so it was clearly some wisdom I picked up a very long time ago.

In recent conversations, including some at the recent Code Camp Wellington, I’ve run into a bunch of people who have never heard of this approach. In fact, some are surprised by the idea, but most are convinced once we talk through the motivation.

The first fix is to mitigate the immediate problem, restoring things to a normal working state as quickly as practical. The second fix is to prevent the problem from ever happening again. Sometimes there are multiple possible fixes, and sometimes this takes considerable effort.

An example

Here’s an example based on my own experience from a few years ago to illustrate how fix it twice can be applied. Note that I’ve structured this using the well known 5 whys technique made famous by Toyota.

Problem: Our newly upgraded production application server is returning an error when someone tries to compile a time series.

Why? The production database doesn’t have the structure expected by the application server.

Why? Some of the SQL scripts required to upgrade the database were missed during the upgrade process for this release.

Why? There were more than a dozen scripts required for this release and the DBA inadvertently missed some of them.

Why? The scripts are all listed in the release notes for the release. Two of the database scripts had similar names and the DBA mistook one for another, skipping the scripts in between.

Why? Database scripts were named for the change they made to the database and not for the sequence in which they should be applied.

Now that we understand the problem, we can come up with the fixes required to make things better.

Mitigation

We manually apply the missing database changes to the production database to bring the structure up to date and then test that all the Production functionality is working properly.

Immediate Prevention

To prevent the problem from happening again, we made a number of changes to our process. With immediate effect, applying from the very next release, we adopted a database script naming convention and modified our release notes.

We used a three digit sequence number as a prefix for our database script filenames to control the order they should be applied to the database. For example, instead of add-index-to-output-time-series.sql the script now had to be called 123-add-index-to-output-time-series.sql.
At the suggestion of the DBA, the list of database scripts in the release documentation was changed to use checkboxes instead of bullet points, and the instructions amended to include checking off each script as it was run.

In the medium term (within six months), we changed the packaging of SQL scripts entirely.

Instead of supplying the DBA with a directory full of scripts and asking for them to be run one at a time, we bundled all the scripts using Fluent Migrations - creating an executable that would upgrade the database by applying exactly the required scripts. We further reduced friction by wrapping this exe with a PowerShell script that included all the correct command line parameters.

No longer was the DBA required to work through a manual (and therefore error-prone) process. Upgrading the database for a new release became a single step: Run this PowerShell script.

We also used the fluent migrations when upgrading the various development, testing and staging environments. By the time we were deploying into the production environment, it had been thoroughly tested.

Conclusions

Was this a lot of work? Yes, it was.

But we never again had a production outage the morning after an upgrade caused by a missing database update script.

Even better, this work opened the way forward to complete automation of deployments using Octopus Deploy, but that’s a tale for another time.

Next Post
Guaranteed Progression 19 May 2018
Prior Post
Even longer than that 05 May 2018
Related Posts
Browsers and WSL 31 Mar 2024
Factory methods and functions 05 Mar 2023
Using Constructors 27 Feb 2023
An Inconvenient API 18 Feb 2023
Method Archetypes 11 Sep 2022
A bash puzzle, solved 02 Jul 2022
A bash puzzle 25 Jun 2022
Improve your troubleshooting by aggregating errors 11 Jun 2022
Improve your troubleshooting by wrapping errors 28 May 2022
Keep your promises 14 May 2022
Archives
May 2018
2018

Always fix it twice

An example

Mitigation

Immediate Prevention

Conclusions

Comments