Skip to main content

Comment your data

Introduction


I am a big believer in embedding comments in data files.

This has always served me well. I'm hoping this short essay will convince you to do the same.

Kinds of things you could write as comments

Here is a short list of things that can 

  • Describe the data file layout and structure
  • Describe why the structure is what it is
  • Describe legal values
  • Describe what changes are allowed
  • Describe the history of changes
  • Describe what program uses this data file
  • Describe assumptions about the data file
  • Include the checksum or digital signature for the file
  • Separate out clumps of records for readability
  • Comment out records with problems
  • Describe what version of the data schema is in use
  • Include a URL to the descriptive wiki page
  • Include a copyright or license

In many ways, this is similar to how JavaDocs is embedded in an application. The idea is to keep the documentation near the data, just like JavaDocs keeps the docs near the code.

Formats which allow comments


By the way, the demarcation between configuration and data is often very grey.  So, just as you comment your data, comment your configuration files.
Ok, let's look at an example:

Here's a text file:

a|1234|b|2016-10-09
b|1234|E|2016-10-09
C|1234|F|2016-10-09
D|1234|b|2016-10-09

Just by looking at it, can you tell what the fields are? I hear some of you (really, I can mindread) saying "just look at the source code".  Well, maybe you don't have the source. Maybe you don't know how to program in Befunge or Piet, the original language of the program.  I know I don't.  Maybe you're a tester and you don't have a copy of the source code handy.  Also, a smart data loader could use those field names to validate the input.  But then again, you'd probably be better off using JSON, HJSON, or YAML.  Or even XML.  

There are data formats which include commenting ability, such as:

  • YAML
  • HJSON
  • your custom data file

But some do not (I'm looking at you, JSON).  And this is too bad.


Wouldn't it be better if you could read a comment in the file?

# SourceVariable|InitialValue|TargetVariable|DateofLastChange
a|1234|b|2016-10-09
b|1234|E|2016-10-09
C|1234|F|2016-10-09
D|1234|b|2016-10-09

And, given that you might be using this file during testing, you might need to remove data.  Commenting out that data AND providing a reason why might be a good idea.

This particular format is trivial to implement by the way.  You could pass the file through grep, as follows:

$ grep -v "^#" filename | yourapp

I mention  this because popular Unix & Windows shell languages don't usually allow complex data formats but flat files are pretty much universal.
  
Now, I will caution you that excessive commenting can cause the datafile to become unreadable (to humans - the computer doesn't care!)  Use your own judgment.

Also, it is possible to go overboard (like any other human endeavor).  For example, you can certainly use it for meta-data.  But after while, you might find it easier just to include the meta data as data in the data file, rather than a comment.  The comment is a good starting point, perhaps while you're still working out the schema.

One excellent example I recently discovered was the Bro Project (http://www.bro.org )

Their data files use comments and you can see why. Here's an example:

#separator \x09
#set_separator  ,
#empty_field    (empty)
#unset_field    -
#path   conn
#open   2016-12-09-16-34-18
#fields ts      uid     id.orig_h       id.orig_p       id.resp_h       id.resp_p       proto   service duration        orig_bytes      resp_bytes      conn_state      local_orig      local_resp      missed_bytes    history orig
#types  time    string  addr    port    addr    port    enum    string  interval        count   count   string  bool    bool    count   string  count   count   count   count   set[string]
1481319198.452775       CjglN4nG286gKt96d       192.168.1.166   50553   239.255.255.250 1900    udp     -       0.519700        2108    0       S0      T       F       0       D       6       2276    0       0       (empty)

I've truncated quite a lot. But you can see how this format makes it easy to read and parse this file.


Comments

Popular posts from this blog

Using Fossil SCM with Jenkins CI

Currently, there is no SCM plugin for Fossil in Jenkins. I've been coding one but it's not yet ready.

Update: I have permission from my employer to open source the plugin. Now if only I had the time...

Update 2:  I've created a github repo for my code: 
https://github.com/rjperrella/jenkins-fossil-adapter

It turns out, you should be able do a pretty good job of integrating Fossil into Jenkins without using such a plugin.

Here's my idea:

For now, you should just need the Xtrigger plugin. This plugin is capable of triggering a build based on a number of things changing. Among them, a web-addressable piece of content in XML.

Fossil is able to generate RSS feeds trivially.

On the Fossil machine, you'll want to start the fossil server as usual:

$ fossil server --port=1234

On the Jenkins machine, you'll simply install the Xtrigger plugin and set it to trigger a build, by polling the following URL for changes:

   http://<FOSSILMACHINE>:1234/timeline.rss?y=ci&n=0

T…

Why Fossil-SCM is an excellent choice for introductory programming courses

Fossil SCM for introductory programming courses The use of source control management (or version control - take your pick) is an important skill for new programmers to adopt.  It is expected that all programmers use SCM in their daily jobs, in order to coordinate changes among team members.  Thus, getting beginners to adopt good habits early should be a goal.

While GIT (git-scm.com)  is certainly the dominant source control system of today, I believe instructors of introductory classes in programming should consider an alternative called Fossil (fossil-scm.org).

Fossil has several compelling advantages in education over GIT.  You will see that I value the practical aspects of Fossil even more than its technical capabilities.  After all, an instructor has a limited amount of time to have an impact and they don't want to waste time doing technical support on a tool that is too complex.  Helping one or two people is fine but helping 30 can be a real burden.

Simple installation and …

So you want to use Fossil DVCS as your SCM solution? Here are some first steps.

First steps when using Fossil SCM.

Download the executable from http://www.fossil-scm.org

Depending on your programming language and operating systems, you'll want to make sure you ignore certain kinds of file extensions.

You might want to create a configuration file and store it in fossil for use in other fossil setups.  The configuration file goes into the top level directory under a folder called ".fossil-settings".  The filename matches the configuration setting, thus it is called "ignore-glob".

For unix/linux, I would ignore the following file extensions (you can put one per line or separate them with commas. I'll use the per line convention here.)

*.o
*.a
a.out
*~
*.pipe
*.tar

For Windows, I would ignore these:

*.obj
*.exe
*.lib
*.tmp
*.$$$


Next, you'll want to decide on binary file for the purpose of merging. These go in the .fossil-settings/binary-glob file:

*.jpeg
*.jpg
*.gif
*.bmp
*.mpeg
*.mpg
*.avi
*.flv
*.ico

Typing fossil settings binary-…