Comment your data

Introduction


I am a big believer in embedding comments in data files.

This has always served me well. I'm hoping this short essay will convince you to do the same.

Kinds of things you could write as comments

Here is a short list of things that can 

  • Describe the data file layout and structure
  • Describe why the structure is what it is
  • Describe legal values
  • Describe what changes are allowed
  • Describe the history of changes
  • Describe what program uses this data file
  • Describe assumptions about the data file
  • Include the checksum or digital signature for the file
  • Separate out clumps of records for readability
  • Comment out records with problems
  • Describe what version of the data schema is in use
  • Include a URL to the descriptive wiki page
  • Include a copyright or license

In many ways, this is similar to how JavaDocs is embedded in an application. The idea is to keep the documentation near the data, just like JavaDocs keeps the docs near the code.

Formats which allow comments


By the way, the demarcation between configuration and data is often very grey.  So, just as you comment your data, comment your configuration files.
Ok, let's look at an example:

Here's a text file:

a|1234|b|2016-10-09
b|1234|E|2016-10-09
C|1234|F|2016-10-09
D|1234|b|2016-10-09

Just by looking at it, can you tell what the fields are? I hear some of you (really, I can mindread) saying "just look at the source code".  Well, maybe you don't have the source. Maybe you don't know how to program in Befunge or Piet, the original language of the program.  I know I don't.  Maybe you're a tester and you don't have a copy of the source code handy.  Also, a smart data loader could use those field names to validate the input.  But then again, you'd probably be better off using JSON, HJSON, or YAML.  Or even XML.  

There are data formats which include commenting ability, such as:

  • YAML
  • HJSON
  • your custom data file

But some do not (I'm looking at you, JSON).  And this is too bad.


Wouldn't it be better if you could read a comment in the file?

# SourceVariable|InitialValue|TargetVariable|DateofLastChange
a|1234|b|2016-10-09
b|1234|E|2016-10-09
C|1234|F|2016-10-09
D|1234|b|2016-10-09

And, given that you might be using this file during testing, you might need to remove data.  Commenting out that data AND providing a reason why might be a good idea.

This particular format is trivial to implement by the way.  You could pass the file through grep, as follows:

$ grep -v "^#" filename | yourapp

I mention  this because popular Unix & Windows shell languages don't usually allow complex data formats but flat files are pretty much universal.
  
Now, I will caution you that excessive commenting can cause the datafile to become unreadable (to humans - the computer doesn't care!)  Use your own judgment.

Also, it is possible to go overboard (like any other human endeavor).  For example, you can certainly use it for meta-data.  But after while, you might find it easier just to include the meta data as data in the data file, rather than a comment.  The comment is a good starting point, perhaps while you're still working out the schema.

One excellent example I recently discovered was the Bro Project (http://www.bro.org )

Their data files use comments and you can see why. Here's an example:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   conn
#open   2016-12-09-16-34-18
#fields ts      uid     id.orig_h       id.orig_p       id.resp_h       id.resp_p       proto   service duration        orig_bytes      resp_bytes      conn_state      local_orig      local_resp      missed_bytes    history orig
#types  time    string  addr    port    addr    port    enum    string  interval        count   count   string  bool    bool    count   string  count   count   count   count   set[string]
1481319198.452775       CjglN4nG286gKt96d       192.168.1.166   50553   239.255.255.250 1900    udp     -       0.519700        2108    0       S0      T       F       0       D       6       2276    0       0       (empty)

I've truncated quite a lot. But you can see how this format makes it easy to read and parse this file.


Comments

Popular Posts