It wasn’t very long after my prior update on caching that a friend informed me that the code has a race condition. Yes, the same friend who persuaded me to update it last time. Worse, during a Skype call, we identified code that would outright fail for an independent reason.
Here’s the code, for your consideration. Do you see the problem? I didn’t see it until it was pointed out to me.
What’s the goal of this critical section?
At one level, the goal is to ensure the dictionaries
_pendingRenders are not concurrently accessed; reading from a dictionary while it’s is being updated is a recipe for some very odd side effects.
However, if this were the only reason to protect the code, I would switch to using a concurrent dictionary, remove the locks completely from this method, and move on. There is another reason why this code needs a lock.
We expect one of three outcomes when the
RenderSpeechAsync() method of our cache runs:
- The text has already been rendered into speech, so we can return it at once from our cache;
- Rendering into speech has already started, so we await completion of that request; or
- We need to render the text into speech.
It might seem that our code correctly fulfils the above requirements. Indeed, when I reviewed the code prior to writing the prior post, I thought it did.
But think about why we introduced this cache in the first place:
The purpose of the cache is to ensure that we only make a single request to render any given piece of text into speech.
And here is the problem: Nothing in that critical section ensures that we will call to our wrapped speech renderer only once for each piece of text. Nothing in that critical section changes any state, so we can run through it many times, with each execution giving the same results.
To fix our race condition, we need to modify state within the lock so that only one of our callers will do the rendering. Our first caller should trigger a call to our wrapped speech renderer. Any later calls for the same text should wait on the first.
The fix is to add the required
TaskCompletionSource into the cache before the first lock completes, as follows:
There’s a design issue with
IRenderSpeechService as well. Returning a
Stream from the cache is a bad idea. Streams are stateful objects, with a current read position. If two parallel code paths try to read the same stream at the same time, things will go badly. Fixing this is much easier - we change to using
byte for storage instead of
Stream. Each caller can independently process this. We run a small risk by not copying the array before returning it – this allows consumers to mutate the array, but the cost of copying the data would be high.
What’s the lesson here?
If you have a critical section to ensure that something only happens once, you need to make sure that the required state change occurs within the section.
Sounds simple, doesn’t it. Yet it’s a lesson that I’ve just had to relearn.