Do you suffer from untidy data? Help may be at hand….
Among the many ways not to live a life is in spending time on irksome tasks that can be avoided, paying the price of careless mistakes, or working late to make up for inefficiencies but still meet a deadline. If this sounds like your time data wrangling then your problem might be untidy data. Fret not, for help may be at hand, thanks to the work of Hadley Wickham, the brains behind some neat R packages and now part of the RStudio team.
I’m not going to repeat what Hadley has already written, but I’m spreading the word into the Learning Analytics community because I think Hadley’s conceptualisation of Tidy Data is neat and actionable (and there is an R package too help out, but the principles of Tidy Data are not R-specific, so don’t be put off if you are not an R user).
The best place to start is probably Hadley’s article in the Journal of Statistical Software Vol 59, Issue 10, published September 2014, the abstract of which says:
A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.
This paper concludes with a nice worked example to demonstrate that tidy data makes a pre-processing pipeline into a simple and intelligible affair. Other useful resources:
- The R package “tidyr” (only v0.1 at present but by no means shabby), which is quite minimal and targetted at tidying-up untidy data.
- A code-centric version of the JSS paper, which also illustrates use of dplyr, an extensive library for common data manipulation tasks that adds more easily-usable features (“a fast, consistent tool”) than the R Core provides.
One of my take-homes is that in the long run it is worth spending a little bit of time to write scripts, and build tools that follow the conventions of Tidy Data both for our own sanity in moving between software tools or returning to data a later point, and also for the benefit of other people who might use it (unknown people at some unknown time). Several of Hadley’s examples show that what might be a suitable tabular presentation for publication might be untidy for processing, which illustrates a general point about the advantage in discriminating between presentation and semantics. That is something to bear in mind when practising reproducible research (see comments in my previous post on FOAS).