Recently a friend of mine noticed some code in a book that was, shall we say, considerably sub-optimal. It’s worth looking at the code to see how both the performance and readability of the code can be easily improved.

Here’s the original code:

[Benchmark]
public IDictionary<string,int> Original()
{
    var counter = new Dictionary<string, int>();
    foreach(var word in WordListLoader.Words)
    {
        bool counted = false;
        foreach (var key in counter.Keys)
        {
            if (key == word)
            {
                counter[key]++;
                counted = true;
                break;
            }
        }

        if (!counted)
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Aside, I’m not going to call out the author of this code by name, nor mention the actual book it came from. Doing so would be unfair, both because the author would have no way to respond, and because I haven’t read the book myself (which means there may be other factors at play that justify the authors choices). The point of this post is to see what we can learn.

What does this code do?

Given a sequence of words, the code counts how many times each word occurs.

For testing, I took a copy of World War Z, extracted the text, and ran it through using BenchMark.Net.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms

For what it’s worth, here are the 16 most common words in the book:

Word Count
the6758
to3292
i2588
of2548
a2540
and2475
was2131
that1727
it1576
in1408
we1196
they1005
you998
were959
had940
for904

Taking 1,800 ms to do a word count of the entire novel might be fast enough for some purposes. But, I believe the original code is not very efficient - we can do better.

The inner loop through Keys is entirely unnecessary - the class Dictionary<K,T> includes ContainsKey(), with an implementation that’s far more efficient than the simple linear search used above. (In fact, the entire implementation is substantially more sophisticated than a simple list of tuples.)

Using ContainsKey() we can rewrite the code to this:

public IDictionary<string, int> UsingContainsKey()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        if (counter.ContainsKey(word))
        {
            counter[word]++;
        }
        else
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Observe how much shorter this code is than the original.

Running the benchmark again, we can see that it’s tremendously faster:

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms

In fact, it’s very nearly 120 times faster.

We can do better on the performance front, however, leveraging the way that TryGetValue() returns a default value if the key isn’t found. We can rewrite the code without any branching:

public IDictionary<string, int> UsingTryGetValue()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.TryGetValue(word, out var count);
        counter[word] = count + 1;
    }

    return counter;
}

This is marginally faster, at the cost of being a bit clever:

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms

When talking performance, however, it’s always important to know How fast is good enough? The time taken by a developer writing the code - and by their colleagues maintaining it - are also factors.

For example, we can rewrite this method using LINQ extension methods and make it far easier to understand & maintain:

public IDictionary<string, int> UsingLinq()
{
    var counter = WordListLoader.Words
        .GroupBy(w => w)
        .ToDictionary(g => g.Key, g => g.Count());

    return counter;
}

This isn’t as fast as the methods we’ve seen, though it’s still 70 times faster than the original. Arguably it is easier to read; the trade-off may be appropriate if the performance is good enough.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms
UsingLINQ 25.51 ms 0.493 ms 0.782 ms

Conclusion

What lessons can we learn from this?

Always measure performance. I had a strong hunch that the code was inefficient due to the unnecessary inner loop, but it wasn’t until I had profiled both approaches that I knew exactly how much faster it could be.

Leverage your class library. Doing things the hard way by writing more code might seem superficially easier at first, but it’s unlikely to pay off in the long run. It’s highly likely that your platform’s class library will have more efficient and more reliable solutions than what you might write yourself.

Be wary of blindly copying code. When you copy code, whether it’s out of a book, from a blog like this one, or from StackOverflow it’s vital to understand what the code is doing. Critical thinking is a vital part of every developers toolkit.

Addendum

A friend suggested also benchmarking ConcurrentDictionary<T,K>:

public IDictionary<string, int> UsingConcurrentDictionary()
{
    var counter = new ConcurrentDictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.AddOrUpdate(word, 1, (_, v) => v + 1);
    }

    return counter;
}

The lambda expression will be allocated once, so this is relatively lightweight on heap allocation. The performance is very nearly as good as a regular dictionary, showing the cost of non-contested locks is negligible.

Method Mean Error StdDev
Original 1,806.73 ms 35.070 ms 51.406 ms
UsingContainsKey 15.10 ms 0.250 ms 0.234 ms
UsingTryGetValue 13.01 ms 0.257 ms 0.286 ms
UsingLINQ 25.51 ms 0.493 ms 0.782 ms
UsingConcurrentDictionary 21.18 ms 0.386 ms 0.361 ms

Comments

blog comments powered by Disqus
Next Post
Sharpen The Saw - March 2020  07 Mar 2020
Prior Post
Speech API  15 Feb 2020
Related Posts
Redux Middleware Implementation  28 Mar 2020
Redux Middleware  14 Mar 2020
Speech API  15 Feb 2020
Logging Implementation  25 Jan 2020
Logging Demonstrated  11 Jan 2020
Logging  28 Dec 2019
Wither convention testing  14 Dec 2019
Convention testing for immutable types  30 Nov 2019
Modifying Words, Part the Second  16 Nov 2019
Modifying Words, Part the First  09 Nov 2019
More csharp posts »
Archives
February 2020
2020