If you’ve been playing around with maintenance screen and the speech integration that we completed last week you may have noticed that there can be a noticeable lag between the time you press a play button and when you hear the speech.
The lag is caused by the necessary round trip to Azure Cognitive Services (henceforth ACS) to do the conversion of text to speech. In my testing (using service located relatively close to me, in Australia), rendering could take as much as 3.7 seconds.
This isn’t fast enough for interactive use.
It’s worth pointing out that I’m not being critical of ACS here. On top of the actual time taken by ACS to create the speech fragment, we’re also dealing with the round trip time between my laptop and the service itself. Given that New Zealand internet use seems to set a new record every few days, due to everyone working and playing from home, the performance I’m seeing is pretty good.
The cliche in software development is that most every problem can be solved by introducing another layer of indirection, unless your problem is too many layers of indirection.
Let’s introduce some caching - not only will this give us faster access to any particular phrase the second time we need it, we’ll reduce our calls to ACS by not rendering the same phrase multiple times.
Our first step is to move the call to ACS into a private method that simply returns a stream of binary data containing the required speech:
This differs from the approach taken previously in a few significant ways.
We use a different AudioConfiguration to ensure we don’t use the speaker but instead return the audio data for later reuse. Oddly, we don’t need to actually use
audioStream as the data we want is returned directly to us in
result; we capture the audio data for the speech and write it into a memory stream for caching.
Around this new method, we add a simple asynchronous cache:
This is pretty straightforward caching code - if we have it in the cache, we return it immediately. If not, we render the speech and add it to the cache if that was successful.
Now we can rewrite the core
SayAsync() method to use the cache:
To make this work, we needed to upgrade the project to use .NET Core 3.1, as it was only in that release that the
SoundPlayer class was introduced, allowing us to play audio.