Author Archives: Adam Cooper

About Adam Cooper

Adam works for Cetis, the Centre for Educational Technology and Interoperability Standards, at the University of Bolton, UK. He rather enjoys data wrangling and hacking about with R. He is currently a member of the UK Government Open Standards Board, and a member of the Information Standards Board for Education, Skills and Children’s Services. He is a strong advocate of open standards and open system architecture. Adam is leading the workpackage on interoperability and data sharing.

Tidy Data!

Do you suffer from untidy data? Help may be at hand….

Among the many ways not to live a life is in spending time on irksome tasks that can be avoided, paying the price of careless mistakes, or working late to make up for inefficiencies but still meet a deadline. If this sounds like your time data wrangling then your problem might be untidy data. Fret not, for help may be at hand, thanks to the work of Hadley Wickham, the brains behind some neat R packages and now part of the RStudio team.

I’m not going to repeat what Hadley has already written, but I’m spreading the word into the Learning Analytics community because I think Hadley’s conceptualisation of Tidy Data is neat and actionable (and there is an R package too help out, but the principles of Tidy Data are not R-specific, so don’t be put off if you are not an R user).

The best place to start is probably Hadley’s article in the Journal of Statistical Software Vol 59, Issue 10, published September 2014, the abstract of which says:

A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.

This paper concludes with a nice worked example to demonstrate that tidy data makes a pre-processing pipeline into a simple and intelligible affair. Other useful resources:

  • The R package “tidyr” (only v0.1 at present but by no means  shabby), which is quite minimal and targetted at tidying-up untidy data.
  • A code-centric version of the JSS paper, which also illustrates use of dplyr, an extensive library for common data manipulation tasks that adds more easily-usable features (“a fast, consistent tool”) than the R Core provides.

One of my take-homes is that in the long run it is worth spending a little bit of time to write scripts, and build tools that follow the conventions of Tidy Data both for our own sanity in moving between software tools or returning to data a later point, and also for the benefit of other people who might use it (unknown people at some unknown time). Several of Hadley’s examples show that what might be a suitable tabular presentation for publication might be untidy for processing, which illustrates a general point about the advantage in discriminating between presentation and semantics. That is something to bear in mind when practising reproducible research (see comments in my previous post on FOAS).

Foundation for Open Access Statistics

Foundation for Open Access Statistics logoFOAS, the Foundation for Open Access Statistics brings together the promotion of three ideas that I think are really important for Learning Analytics (“LA” hereafter). They are:

  • Open Source Software (“free as in freedom”),
  • Open Access publishing, which is increasingly being required by research funders, and
  • Reproducible research.

R is a prime example of Open Source working well, with a genuinely awesome (and I am not prone to use that word) collection of well-maintained packages resting on an excellently-maintained core and properly managed through CRAN. OK, that was probably stating the obvious to most readers, but I think it is worth reflecting on what a good open source licence brings to the table: the virtuous, and unconstrained, cycle of share-use-improve.

I suppose that one of the motivations for FOAS was the sustainability  of the Journal of Statistical Software (JSS). I have found JSS to be an excellent resource over the years since I discovered R, and I really like the way the articles about R packages give both the theoretical background (or discussion of algorithms) and worked examples with code snippets. Not all articles are R-specific, and even those that are usually give a good grounding in the principles. The Journal of Learning Analytics has been an Open Access publication from its inception.

The final bullet, “reproducible research”, remains a minority sport, but one that is sure to grow. The practice of reproducible research is generally still a challenge but the idea is simple: that other people can repeat the research that is published. It baffles me that, for numerical and computational studies, this is not already the norm; repeatability seems like the essence of good scholarship and computational studies to rather lend themselves to repeatability.

So… why should you care about reproducible research?

  1. If you practice LA but are not a publishing researcher, the techniques of reproducible research make it easier to repeat tasks, and to reduce careless errors. They also make it easier to keep a useful archive (a bit more than just a source code repository).
  2. If you are a researcher, reproducible research techniques are likely to improve quality and consistency and make paper revisions a breeze. I have a feeling that it will soon make you a lot more citable.
  3. If you are interesting in adopting the results of LA research, doesn’t due diligence demand that you check the reproducibility of the work, and test how well it generalises to your context?

FOAS isn’t just an evangelist for reproducible research, but a spiritual home for several projects that enable it. I’m particularly fond of RStudio and knitr, and the latest release of RStudio has improved pandoc support, but also intrigued by OpenCPU. I haven’t tried OpenCPU yet, but it looks like a useful and well-implemented linux package to wrap an HTTP API over R. It is currently a post-doctoral research project, so a bit of a risk for production use, but professionally-organised and a good candidate for transfer to a long-term home.

Why not join FOAS and ” promote free software, open access publishing, and reproducible research in statistics“.