Everything's broken but we can fix it

If you haven’t read it already, go and read Scott Hanselman’s post Everything’s broken and nobody’s upset”. Do it now, I’ll wait for you.

Like many of you, my initial reaction to Scott’s post was to nod my head in understanding, with examples of my own frustrations hovering at the edges of my awareness.

I suspect that the reason we aren’t all upset about it is a kind of stockholm syndrome - we’re all so used to being badly treated by our software that it’s accepted as normal. More than that, we believe that it’s inevitable that things are going to fail in maddening ways and that there’s nothing we can do about it.

If you believe that, stop reading here. If you think we can do better, please keep reading.

Back in July 2012, Oren Eini posted Unprofessional Code, take II and a followup post Unprofessional code, take II - Answer.

The key point that Oren is making with these posts is that black box code is bad, because there’s no way to see what’s happening inside. Troubleshooting an errant black box can be next to impossible without source code and a debugger.

I suspect that Oren is onto a pretty important principle here. If you have a (human) colleague who does something incomprehensible, you can ask them to explain why they did what they did. You might not like the answer you get, but at least you can ask. We should be able to ask questions of our software colleagues in similar ways.

Consider the list of problems that Scott reported in his original blog post and how things would change if the software could explain its decision making process.

iTunes would explain what it found on Scott’s phone that ended up in the “Other” category, and how that 3GB of data breaks down.
The Windows Indexing service would list the documents it had been trying to index and detail the problems it encountered that caused indexing to run continuously.
The iPhone contacts application on would explain why it thinks all these contacts are different people in spite of Scott previously telling it otherwise.
Outlook would identify what is causing it to fail while shutting down, allowing the problem to be isolated and remedied.
The iCloud photo stream would detail what images it found eligible for inclusion and would explain why on 734 photos are shown instead of the expected 1000.
When a message stalls in the Outlook outbox, you’d receive an explanation about why it wasn’t sent immediately and a log of all the attempts that have been made and why they failed.
Gmail would explain the cause of any slowdown and give the user enough information to attempt to remedy the situation.
Microsoft Word would identify what’s wrong with a document, especially if it doesn’t recognise it as a word document at all.

You’ll note that the examples I’ve given are much more than simple logging - most system log files are unintelligible and next to useless, either because they log way too much, or because they log the wrong things.

What I’m seeking is to empower our systems to answer questions, allowing us to interact with them in new ways, particularly by asking for explanations of behaviour.

Now, is it easy to instrument our systems to give this level of intelligent reporting? Certainly not - if it was easy, we’d be doing this already. However, the alternative is to keep living with badly behaved software that doesn’t (can’t!) tell us what’s going wrong.

I think we need to try.

Achieving this is going to require a lot of work, as well as a couple of algorithmic breakthroughs. We’re fortunate that in the coming world of cloud computing, the additional horsepower required to make persuasive system introspection a reality will be just a websocket away.

Next Post
Of Method Naming and more 29 Sep 2012
Prior Post
DEV311 Snippet Demos 07 Sep 2012
Related Posts
Error assertions 26 Apr 2025
Browsers and WSL 31 Mar 2024
Factory methods and functions 05 Mar 2023
Using Constructors 27 Feb 2023
An Inconvenient API 18 Feb 2023
Method Archetypes 11 Sep 2022
A bash puzzle, solved 02 Jul 2022
A bash puzzle 25 Jun 2022
Improve your troubleshooting by aggregating errors 11 Jun 2022
Improve your troubleshooting by wrapping errors 28 May 2022
Archives
September 2012
2012