When developing systems that must wait for key things to happen (often, things that are outside their direct control), we need to consider how long we’re willing to wait for an answer.
It’s tempting - and easy - to hard code these wait times into the system, deciding that we’ll always wait that long for it to happen. I’ve done this myself.
However, doing this introduces subtle instabilities and problems. We can - and should - try to do better.
Here are some examples.
Example: waiting for a service to be available
Context: Writing a scripted test that needs to start up a service, wait for the service to finish initializing, and then run a series of tests to ensure the service is behaving properly.
The easy approach would be to hard code a delay - say, 30 seconds - after starting the service, to give it time to start up.
Pro | Very easy to code. |
---|---|
Pro | Usually works. |
Con | Every test has to wait for 30 seconds. |
Con | Sometimes doesn't work. |
Result: Unstable tests that sometimes fail because the service hasn’t yet finished initializing.
We could fix this by changing the delay to a longer value, say 60 seconds.
Pro | Very simple change. |
---|---|
Pro | Works almost all of the time. |
Con | Every test now has to wait for 60 seconds. |
Con | Does still fail sometimes. |
While this improves reliability, we don’t reach 100%; our tests are still flakey. Worse, we’ve slowed down our entire test suite to do it.
A better fix is to find a way to monitor the server to see if it is ready for use. One way that works for many services (if you’re running them on the current machine) is to see if their associated network port is accepting connections.
Pro | Only waits as long as required. |
---|---|
Pro | Most tests can start in just a few seconds. |
Pro | Can wait much longer for the P99 case. |
Pro | Better diagnostics when things go wrong. |
Pro | No longer have unstable tests. |
Con | Harder code to write. |
Note that the code is harder to write once but gives benefits over and over again. It’s worth the effort.
Waiting for something to be created
Context: Your component needs to wait something to be created by someone else. Perhaps you’re waiting on a dynamic database table to be created; perhaps a new queue in the cloud; perhaps a message channel in your sevice bus.
The easy approach is to work out how long it typically takes for that item to be created by the other component of your system, and to go to sleep for that long.
Pro | Really easy to set up - just guess a value and test in production to see if it's good enough. |
---|---|
Con | Relies on the other system taking a predictable length of time. |
Con | Not reliable if things go slow (say, due to Storage throttling). |
Instead, try testing for the existence of the queue - waiting only until it exists
Pro | Don't need to predict the runtime of the other component. |
---|---|
Pro | Can wait much longer if required (say if the other component is running slow). |
Pro | Faster startup time in the normal case. |
Con | More complex code to write and maintain. |
Once again, note that the code is harder to write once but gives benefits over and over again. There’s a pattern here.
Ensuring a process completes
Context: Creating a watchdog to ensure that a separate process doesn’t run for too long. Perhaps you want to detect a runaway process for telemetry purposes, perhaps you need to protect your system against a rogue actor.
You could set an absolute time limit for the process - say 15 minutes.
Pro | Easy to code. |
---|---|
Con | Takes up to 15 minutes to detect a problem. |
As an alternative, actively monitor the process to see what it’s doing - you can then tune your tolerances to only admit behaviour that’s acceptable.
Here are some possible rules:
- Must output to stdout at least once per minute.
- Must not accumulate more than 10m of total CPU time.
- Must not run for more than 30 minutes wall clock time.
Pro | Much faster detection of stalled processes without constraining processes that are making progress. |
---|---|
Pro | Tighter constraints limit any negative impact on the rest of the system. |
Pro | Better diagnostics when things are killed by the watchdog (you can report on what rule was violated). |
Con | Requires more of the process being monitored. |
Con | Might constrain legitimate use. |
Conclusions
Hardcoded time frames are often a mistake because they prevent the system from adapting to the environment in which they are running. Dynamic constraints give your applications an opportunity to self-tune, to report better diagnostics, and to achieve greater reliability.
Comments
blog comments powered by Disqus