When developing systems that must wait for key things to happen (often, things that are outside their direct control), we need to consider how long we’re willing to wait for an answer.

It’s tempting - and easy - to hard code these wait times into the system, deciding that we’ll always wait that long for it to happen. I’ve done this myself.

However, doing this introduces subtle instabilities and problems. We can - and should - try to do better.

Here are some examples.

Example: waiting for a service to be available

Context: Writing a scripted test that needs to start up a service, wait for the service to finish initializing, and then run a series of tests to ensure the service is behaving properly.

The easy approach would be to hard code a delay - say, 30 seconds - after starting the service, to give it time to start up.

Pro Very easy to code.
Pro Usually works.
Con Every test has to wait for 30 seconds.
Con Sometimes doesn't work.

Result: Unstable tests that sometimes fail because the service hasn’t yet finished initializing.

We could fix this by changing the delay to a longer value, say 60 seconds.

Pro Very simple change.
Pro Works almost all of the time.
Con Every test now has to wait for 60 seconds.
Con Does still fail sometimes.

While this improves reliability, we don’t reach 100%; our tests are still flakey. Worse, we’ve slowed down our entire test suite to do it.

A better fix is to find a way to monitor the server to see if it is ready for use. One way that works for many services (if you’re running them on the current machine) is to see if their associated network port is accepting connections.

Pro Only waits as long as required.
Pro Most tests can start in just a few seconds.
Pro Can wait much longer for the P99 case.
Pro Better diagnostics when things go wrong.
Pro No longer have unstable tests.
Con Harder code to write.

Note that the code is harder to write once but gives benefits over and over again. It’s worth the effort.

Waiting for something to be created

Context: Your component needs to wait something to be created by someone else. Perhaps you’re waiting on a dynamic database table to be created; perhaps a new queue in the cloud; perhaps a message channel in your sevice bus.

The easy approach is to work out how long it typically takes for that item to be created by the other component of your system, and to go to sleep for that long.

Pro Really easy to set up - just guess a value and test in production to see if it's good enough.
Con Relies on the other system taking a predictable length of time.
Con Not reliable if things go slow (say, due to Storage throttling).

Instead, try testing for the existence of the queue - waiting only until it exists

Pro Don't need to predict the runtime of the other component.
Pro Can wait much longer if required (say if the other component is running slow).
Pro Faster startup time in the normal case.
Con More complex code to write and maintain.

Once again, note that the code is harder to write once but gives benefits over and over again. There’s a pattern here.

Ensuring a process completes

Context: Creating a watchdog to ensure that a separate process doesn’t run for too long. Perhaps you want to detect a runaway process for telemetry purposes, perhaps you need to protect your system against a rogue actor.

You could set an absolute time limit for the process - say 15 minutes.

Pro Easy to code.
Con Takes up to 15 minutes to detect a problem.

As an alternative, actively monitor the process to see what it’s doing - you can then tune your tolerances to only admit behaviour that’s acceptable.

Here are some possible rules:

  • Must output to stdout at least once per minute.
  • Must not accumulate more than 10m of total CPU time.
  • Must not run for more than 30 minutes wall clock time.
Pro Much faster detection of stalled processes without constraining processes that are making progress.
Pro Tighter constraints limit any negative impact on the rest of the system.
Pro Better diagnostics when things are killed by the watchdog (you can report on what rule was violated).
Con Requires more of the process being monitored.
Con Might constrain legitimate use.

Conclusions

Hardcoded time frames are often a mistake because they prevent the system from adapting to the environment in which they are running. Dynamic constraints give your applications an opportunity to self-tune, to report better diagnostics, and to achieve greater reliability.

Comments

blog comments powered by Disqus
Next Post
Converting projects to new csproj  02 Jun 2018
Prior Post
Sharpen The Saw #36  21 May 2018
Related Posts
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
A bash puzzle, solved  02 Jul 2022
A bash puzzle  25 Jun 2022
Improve your troubleshooting by aggregating errors  11 Jun 2022
Improve your troubleshooting by wrapping errors  28 May 2022
Keep your promises  14 May 2022
When are you done?  18 Apr 2022
Fixing GitHub Authentication  28 Nov 2021
Archives
May 2018
2018