Recently a friend of mine noticed some code in a book that was, shall we say, considerably sub-optimal. It’s worth looking at the code to see how both the performance and readability of the code can be easily improved.

Here’s the original code:

[Benchmark]
public IDictionary<string,int> Original()
{
    var counter = new Dictionary<string, int>();
    foreach(var word in WordListLoader.Words)
    {
        bool counted = false;
        foreach (var key in counter.Keys)
        {
            if (key == word)
            {
                counter[key]++;
                counted = true;
                break;
            }
        }

        if (!counted)
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Aside, I’m not going to call out the author of this code by name, nor mention the actual book it came from. Doing so would be unfair, both because the author would have no way to respond, and because I haven’t read the book myself (which means there may be other factors at play that justify the authors choices). The point of this post is to see what we can learn.

What does this code do?

Given a sequence of words, the code counts how many times each word occurs.

For testing, I took a copy of World War Z, extracted the text, and ran it through using BenchMark.Net.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms

For what it’s worth, here are the 16 most common words in the book:

Word Count
the6758
to3292
i2588
of2548
a2540
and2475
was2131
that1727
it1576
in1408
we1196
they1005
you998
were959
had940
for904

Taking 1,800 ms to do a word count of the entire novel might be fast enough for some purposes. But, I believe the original code is not very efficient - we can do better.

The inner loop through Keys is entirely unnecessary - the class Dictionary<K,T> includes ContainsKey(), with an implementation that’s far more efficient than the simple linear search used above. (In fact, the entire implementation is substantially more sophisticated than a simple list of tuples.)

Using ContainsKey() we can rewrite the code to this:

public IDictionary<string, int> UsingContainsKey()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        if (counter.ContainsKey(word))
        {
            counter[word]++;
        }
        else
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Observe how much shorter this code is than the original.

Running the benchmark again, we can see that it’s tremendously faster:

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms

In fact, it’s very nearly 120 times faster.

We can do better on the performance front, however, leveraging the way that TryGetValue() returns a default value if the key isn’t found. We can rewrite the code without any branching:

public IDictionary<string, int> UsingTryGetValue()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.TryGetValue(word, out var count);
        counter[word] = count + 1;
    }

    return counter;
}

This is marginally faster, at the cost of being a bit clever:

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms

When talking performance, however, it’s always important to know How fast is good enough? The time taken by a developer writing the code - and by their colleagues maintaining it - are also factors.

For example, we can rewrite this method using LINQ extension methods and make it far easier to understand & maintain:

public IDictionary<string, int> UsingLinq()
{
    var counter = WordListLoader.Words
        .GroupBy(w => w)
        .ToDictionary(g => g.Key, g => g.Count());

    return counter;
}

This isn’t as fast as the methods we’ve seen, though it’s still 70 times faster than the original. Arguably it is easier to read; the trade-off may be appropriate if the performance is good enough.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms
UsingLINQ 25.51 ms 0.493 ms 0.782 ms

Conclusion

What lessons can we learn from this?

Always measure performance. I had a strong hunch that the code was inefficient due to the unnecessary inner loop, but it wasn’t until I had profiled both approaches that I knew exactly how much faster it could be.

Leverage your class library. Doing things the hard way by writing more code might seem superficially easier at first, but it’s unlikely to pay off in the long run. It’s highly likely that your platform’s class library will have more efficient and more reliable solutions than what you might write yourself.

Be wary of blindly copying code. When you copy code, whether it’s out of a book, from a blog like this one, or from StackOverflow it’s vital to understand what the code is doing. Critical thinking is a vital part of every developers toolkit.

Addendum

A friend suggested also benchmarking ConcurrentDictionary<T,K>:

public IDictionary<string, int> UsingConcurrentDictionary()
{
    var counter = new ConcurrentDictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.AddOrUpdate(word, 1, (_, v) => v + 1);
    }

    return counter;
}

The lambda expression will be allocated once, so this is relatively lightweight on heap allocation. The performance is very nearly as good as a regular dictionary, showing the cost of non-contested locks is negligible.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms
UsingLINQ 25.51 ms 0.493 ms 0.782 ms
UsingConcurrentDictionary 21.18 ms 0.386 ms 0.361 ms

Comments

blog comments powered by Disqus
Next Post
Sharpen The Saw - March 2020  07 Mar 2020
Prior Post
Speech API  15 Feb 2020
Related Posts
Browsers and WSL  31 Mar 2024
Factory methods and functions  05 Mar 2023
Using Constructors  27 Feb 2023
An Inconvenient API  18 Feb 2023
Method Archetypes  11 Sep 2022
A bash puzzle, solved  02 Jul 2022
A bash puzzle  25 Jun 2022
Improve your troubleshooting by aggregating errors  11 Jun 2022
Improve your troubleshooting by wrapping errors  28 May 2022
Keep your promises  14 May 2022
Archives
February 2020
2020