Whether I am writing my own program, or chosing between existing solutions, one aspect of the decision making process which always weighs heavily on my mind is that of the input and output data formats.

I have been spending a lot of my work days recently working on converting data from a proprietary tool's export format into another tool's input format. This has involved a lot of XML diving, a lot more swearing, and a non-trivial amount of pain. This drove home to me once more that the format of input and output of data is such a critical part of software tooling that it must weigh as heavily as, or perhaps even more heavily than, the software's functionality.

As Tanenbaum tells us, the great thing about standards is that there's so many of them to choose from. XKCD tells us, how that comes about. Data formats are many and varied, and suffer from specifications as vague as "plain text" to things as complex as the structure of data stored in custom database formats.

If you find yourself writing software which requires a brand new data format then, while I might caution you to examine carefully if it really does need a new format, you should ensure that you document the format carefully and precisely. Ideally give your format specification to a third party and get them to implement a reader and writer for your format, so that they can check that you've not missed anything. Tests and normative implementations can help prop up such an endeavour admirably.

Be sceptical of data formats which have "implementation specific" areas, or "vendor specific extension" space because this is where everyone will put the most important and useful data. Do not put such beasts into your format design. If you worry that you've made your design too limiting, deal with that after you have implemented your use-cases for the data format. Don't be afraid to version the format and extend later; but always ensure that a given version of the data format is well understood; and document what it means to be presented with data in a format version you do not normally process.

Phew.


Given all that, I exhort you to consider carefully how your projects manage their input and output data, and for these things to be uppermost when you are choosing between different solutions to a problem at hand. Your homework is, as you may have grown to anticipate at this time, to look at your existing projects and check that their input and output data formats are well documented if appropriate.