There seems to be a lot of hatred for XML. It’s not hard to find blog posts and articles where the author rants about the deficiencies and inefficiencies of XML and promotes the beauty of JSON, YAML, or something else. Is this level of vitriol really deserved?

When I talk to people who dislike XML, they’re quick to point out examples where it’s been used in spectacularly poor fashion. These aren’t strawman arguments but instead genuine situations where the files are no more useful than a proprietary, undocumented, binary format. Given that one of the promises of XML was around effective data interchange, this is a shame.

However, should poor use of XML, even if widespread, be sufficient for us to abandon use completely? Especially when the very flexibility — extensibility — of XML has allowed it to be misused in the first place.

Element Normal Form

There’s a certain group of developers who look at the pedigree of XML and conclude that the only proper use is to use elements to markup text. They see XML as a specialization of standardized generalized markup language (SGML) and a cousin of the hypertext markup language (HTML), concluding that use of attributes is to be avoided at all costs, except (perhaps) for the occasional internal identifier.

This results in XML documents that look like this:

<person id="966">
  <fullName>John Doe</fullName>
  <knownAs>John</knownAs>
  <familyName>Doe</familyName>
  <birth>1966-03-31</birth>
  <addresses>
    <address>
        <street>1313 Mockingbird Lane</street>
        <city>Mockingbird Heights</city>
        <from>1966-03-31</from>
        <until>1999-12-31</until>
    </address>
    <address>
        <street>1600 Penselvania Avenue</street>
        <city>Washington DC</city>
        <from>2000-01-01</from>
        <until>2003-12-31</until>
    </address>
    <address>
        <street>1 Skid Row</street>
        <city>Hicksville</city>
        <from>2004-01-01</from>
    </address>
  </addresses>
</person>

There’s a lot to dislike about this. It’s verbose, repetitive and inefficient. Of the 690 characters in the file, fully 516 are dedicated to syntax and whitespace - only 174 characters of data (or 25%) would differ from one Person to another.

It doesn’t have to be this way.

By focussing on where XML came from, supporters of element normal form have lost sight of what it is - a well defined serialization format for heirarchical data.

Here’s the same exact information, serialized in a better way, making use of attributes:

<person id="966"
  fullName="John Doe"
  knownAs="John"
  familyName="Doe"
  birth="1966-03-31">
  <address street="1313 Mockingbird Lane"
    city="Mockingbird Heights"
    from="1966-03-31"
    until="1999-12-31" />
  <address street="1600 Pennsylvania Avenue"
    city="Washington DC"
    from="2000-01-01"
    until="2003-12-31" />
  <address street="1 Skid Row"
    city="Hicksville"
    from="2004-01-01" />
</person>

We’ve also lost the unnecessary wrapper around the multiple addresses.

This file has 440 characters, of which 266 are dedicated to syntax and whitespace. The same 174 characters of content are now 40% of the file.

For what it’s worth, this compares most favourably with 530 characters of JSON (where the same 174 characters of data would comprise just 32% of the file) and with a 423 character YAML file (where the data would be 41% of the file).

XML data serialization doesn’t have to be repetitive and inefficient - it can be as good as more recent formats such as JSON and YAML. Who knew?

Comments

blog comments powered by Disqus
Next Post
Finding source code in .NET Core 10 May 2017
Prior Post
Static Analysis tools for the Win 15 Apr 2017
Related Pages
April 2017 archive