Always review code you copy

Recently a friend of mine noticed some code in a book that was, shall we say, considerably sub-optimal. It’s worth looking at the code to see how both the performance and readability of the code can be easily improved.

Here’s the original code:

[Benchmark]
public IDictionary<string,int> Original()
{
    var counter = new Dictionary<string, int>();
    foreach(var word in WordListLoader.Words)
    {
        bool counted = false;
        foreach (var key in counter.Keys)
        {
            if (key == word)
            {
                counter[key]++;
                counted = true;
                break;
            }
        }

        if (!counted)
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Aside, I’m not going to call out the author of this code by name, nor mention the actual book it came from. Doing so would be unfair, both because the author would have no way to respond, and because I haven’t read the book myself (which means there may be other factors at play that justify the authors choices). The point of this post is to see what we can learn.

What does this code do?

Given a sequence of words, the code counts how many times each word occurs.

For testing, I took a copy of World War Z, extracted the text, and ran it through using BenchMark.Net.

Method	Mean	Error	StdDev
Original	1,806.73 ms	35.070 ms	51.406 ms

For what it’s worth, here are the 16 most common words in the book:

Word	Count
the	6758
to	3292
i	2588
of	2548
a	2540
and	2475
was	2131
that	1727
it	1576
in	1408
we	1196
they	1005
you	998
were	959
had	940
for	904

Taking 1,800 ms to do a word count of the entire novel might be fast enough for some purposes. But, I believe the original code is not very efficient - we can do better.

The inner loop through Keys is entirely unnecessary - the class Dictionary<K,T> includes ContainsKey(), with an implementation that’s far more efficient than the simple linear search used above. (In fact, the entire implementation is substantially more sophisticated than a simple list of tuples.)

Using ContainsKey() we can rewrite the code to this:

public IDictionary<string, int> UsingContainsKey()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        if (counter.ContainsKey(word))
        {
            counter[word]++;
        }
        else
        {
            counter.Add(word, 0);
        }
    }

    return counter;
}

Observe how much shorter this code is than the original.

Running the benchmark again, we can see that it’s tremendously faster:

Method	Mean	Error	StdDev
Original	1,806.73 ms	35.070 ms	51.406 ms
UsingContainsKey	15.10 ms	0.250 ms	0.234 ms

In fact, it’s very nearly 120 times faster.

We can do better on the performance front, however, leveraging the way that TryGetValue() returns a default value if the key isn’t found. We can rewrite the code without any branching:

public IDictionary<string, int> UsingTryGetValue()
{
    var counter = new Dictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.TryGetValue(word, out var count);
        counter[word] = count + 1;
    }

    return counter;
}

This is marginally faster, at the cost of being a bit clever:

Method	Mean	Error	StdDev
Original	1,806.73 ms	35.070 ms	51.406 ms
UsingContainsKey	15.10 ms	0.250 ms	0.234 ms
UsingTryGetValue	13.01 ms	0.257 ms	0.286 ms

When talking performance, however, it’s always important to know How fast is good enough? The time taken by a developer writing the code - and by their colleagues maintaining it - are also factors.

For example, we can rewrite this method using LINQ extension methods and make it far easier to understand & maintain:

public IDictionary<string, int> UsingLinq()
{
    var counter = WordListLoader.Words
        .GroupBy(w => w)
        .ToDictionary(g => g.Key, g => g.Count());

    return counter;
}

This isn’t as fast as the methods we’ve seen, though it’s still 70 times faster than the original. Arguably it is easier to read; the trade-off may be appropriate if the performance is good enough.

Method	Mean	Error	StdDev
Original	1,806.73 ms	35.070 ms	51.406 ms
UsingContainsKey	15.10 ms	0.250 ms	0.234 ms
UsingTryGetValue	13.01 ms	0.257 ms	0.286 ms
UsingLINQ	25.51 ms	0.493 ms	0.782 ms

Conclusion

What lessons can we learn from this?

Always measure performance. I had a strong hunch that the code was inefficient due to the unnecessary inner loop, but it wasn’t until I had profiled both approaches that I knew exactly how much faster it could be.

Leverage your class library. Doing things the hard way by writing more code might seem superficially easier at first, but it’s unlikely to pay off in the long run. It’s highly likely that your platform’s class library will have more efficient and more reliable solutions than what you might write yourself.

Be wary of blindly copying code. When you copy code, whether it’s out of a book, from a blog like this one, or from StackOverflow it’s vital to understand what the code is doing. Critical thinking is a vital part of every developers toolkit.

Addendum

A friend suggested also benchmarking ConcurrentDictionary<T,K>:

public IDictionary<string, int> UsingConcurrentDictionary()
{
    var counter = new ConcurrentDictionary<string, int>();
    foreach (var word in WordListLoader.Words)
    {
        counter.AddOrUpdate(word, 1, (_, v) => v + 1);
    }

    return counter;
}

The lambda expression will be allocated once, so this is relatively lightweight on heap allocation. The performance is very nearly as good as a regular dictionary, showing the cost of non-contested locks is negligible.

Method	Mean	Error	StdDev
Original	1,806.73 ms	35.070 ms	51.406 ms
UsingContainsKey	15.10 ms	0.250 ms	0.234 ms
UsingTryGetValue	13.01 ms	0.257 ms	0.286 ms
UsingLINQ	25.51 ms	0.493 ms	0.782 ms
UsingConcurrentDictionary	21.18 ms	0.386 ms	0.361 ms

Next Post
Sharpen The Saw - March 2020 07 Mar 2020
Prior Post
Speech API 15 Feb 2020
Related Posts
Browsers and WSL 31 Mar 2024
Factory methods and functions 05 Mar 2023
Using Constructors 27 Feb 2023
An Inconvenient API 18 Feb 2023
Method Archetypes 11 Sep 2022
A bash puzzle, solved 02 Jul 2022
A bash puzzle 25 Jun 2022
Improve your troubleshooting by aggregating errors 11 Jun 2022
Improve your troubleshooting by wrapping errors 28 May 2022
Keep your promises 14 May 2022
Archives
February 2020
2020

Always review code you copy

Conclusion

Addendum

Comments