Avoid hardcoded wait times

When developing systems that must wait for key things to happen (often, things that are outside their direct control), we need to consider how long we’re willing to wait for an answer.

It’s tempting - and easy - to hard code these wait times into the system, deciding that we’ll always wait that long for it to happen. I’ve done this myself.

However, doing this introduces subtle instabilities and problems. We can - and should - try to do better.

Here are some examples.

Example: waiting for a service to be available

Context: Writing a scripted test that needs to start up a service, wait for the service to finish initializing, and then run a series of tests to ensure the service is behaving properly.

The easy approach would be to hard code a delay - say, 30 seconds - after starting the service, to give it time to start up.

Pro	Very easy to code.
Pro	Usually works.
Con	Every test has to wait for 30 seconds.
Con	Sometimes doesn't work.

Result: Unstable tests that sometimes fail because the service hasn’t yet finished initializing.

We could fix this by changing the delay to a longer value, say 60 seconds.

Pro	Very simple change.
Pro	Works almost all of the time.
Con	Every test now has to wait for 60 seconds.
Con	Does still fail sometimes.

While this improves reliability, we don’t reach 100%; our tests are still flakey. Worse, we’ve slowed down our entire test suite to do it.

A better fix is to find a way to monitor the server to see if it is ready for use. One way that works for many services (if you’re running them on the current machine) is to see if their associated network port is accepting connections.

Pro	Only waits as long as required.
Pro	Most tests can start in just a few seconds.
Pro	Can wait much longer for the P99 case.
Pro	Better diagnostics when things go wrong.
Pro	No longer have unstable tests.
Con	Harder code to write.

Note that the code is harder to write once but gives benefits over and over again. It’s worth the effort.

Waiting for something to be created

Context: Your component needs to wait something to be created by someone else. Perhaps you’re waiting on a dynamic database table to be created; perhaps a new queue in the cloud; perhaps a message channel in your sevice bus.

The easy approach is to work out how long it typically takes for that item to be created by the other component of your system, and to go to sleep for that long.

Pro	Really easy to set up - just guess a value and test in production to see if it's good enough.
Con	Relies on the other system taking a predictable length of time.
Con	Not reliable if things go slow (say, due to Storage throttling).

Instead, try testing for the existence of the queue - waiting only until it exists

Pro	Don't need to predict the runtime of the other component.
Pro	Can wait much longer if required (say if the other component is running slow).
Pro	Faster startup time in the normal case.
Con	More complex code to write and maintain.

Once again, note that the code is harder to write once but gives benefits over and over again. There’s a pattern here.

Ensuring a process completes

Context: Creating a watchdog to ensure that a separate process doesn’t run for too long. Perhaps you want to detect a runaway process for telemetry purposes, perhaps you need to protect your system against a rogue actor.

You could set an absolute time limit for the process - say 15 minutes.

Pro	Easy to code.
Con	Takes up to 15 minutes to detect a problem.

As an alternative, actively monitor the process to see what it’s doing - you can then tune your tolerances to only admit behaviour that’s acceptable.

Here are some possible rules:

Must output to stdout at least once per minute.
Must not accumulate more than 10m of total CPU time.
Must not run for more than 30 minutes wall clock time.

Pro	Much faster detection of stalled processes without constraining processes that are making progress.
Pro	Tighter constraints limit any negative impact on the rest of the system.
Pro	Better diagnostics when things are killed by the watchdog (you can report on what rule was violated).
Con	Requires more of the process being monitored.
Con	Might constrain legitimate use.

Conclusions

Hardcoded time frames are often a mistake because they prevent the system from adapting to the environment in which they are running. Dynamic constraints give your applications an opportunity to self-tune, to report better diagnostics, and to achieve greater reliability.

Next Post
Converting projects to new csproj 02 Jun 2018
Prior Post
Sharpen The Saw #36 21 May 2018
Related Posts
Browsers and WSL 31 Mar 2024
Factory methods and functions 05 Mar 2023
Using Constructors 27 Feb 2023
An Inconvenient API 18 Feb 2023
Method Archetypes 11 Sep 2022
A bash puzzle, solved 02 Jul 2022
A bash puzzle 25 Jun 2022
Improve your troubleshooting by aggregating errors 11 Jun 2022
Improve your troubleshooting by wrapping errors 28 May 2022
Keep your promises 14 May 2022
Archives
May 2018
2018