On the Question of Validity in Learning Analytics


The question of whether or not something works is a basic one to ask when investing time and money in changing practice and in using new technologies. For learning analytics, the moral dimension adds a certain imperative, although there is much that we do by tradition in teaching and learning in spite of questions about efficacy.

I believe the move to large-scale adoption of learning analytics, with the attendant rise in institution-level decisions, should motivate us to spend some time thinking about how concepts such as validity and reliability apply in this practical setting. Motivation comes from: large scale adoption has “upped the stakes”, and non-experts are now involved in decision-making. This article is a brief look at some of the issues with where we are now, and some of the potential pit-falls going forwards.

There is, of course, a great deal of literature on the topics of validity (and reliability) and various disciplines have their own conceptualisations, and sometimes multiple kinds of validity. The wikipedia disambiguation page for “validity” illustrates the variety and the disambiguation page for “validation” adds further to it.

For the purpose of this article I would like to avoid choosing one of these technical definitions, because it is important to preserve some variety; the Oxford English Dictionary definition will be assumed: “the quality of being logically or factually sound; soundness or cogency”. Before looking at some issues, it might be helpful to first clarify the distinction between “reliability” and “validity” in statistical parlance (see the diagram below).

Distinguishing between reliability and validity in statistics.

Distinguishing between reliability and validity in statistics.


Technical terminology may mislead

The distinction between reliability and validity in statistics leads us straight to the issue that terms may be used with very specific technical meanings that are not appreciated by a community of non-experts, who might be making decisions about what kind of learning analytics approach to adopt. This is particularly likely where terms with every-day meaning are used, but even when technical terms are used without everyday counterparts, non-expert users will often employ them without recognising their misunderstanding.

Getting validity (and reliability) universally on the agenda

Taking a fairly hard-edged view of “validation”, as applied to predictive models, a good start would be to see this being universally adopted, following established best practice in statistics and data mining. The educational data mining research community is very hot on this topic but the wider community of contributors to learning analytics scholarship is not always so focused. More than this, it should be on the agenda of the non-researcher to ask the question about the results and the method, and to understand whether “85% correct classification judged by stratified 10-fold cross-validation” is an appropriate answer, or not.

Predictive accuracy is not enough

When predictive models are being described, it is common to hear statements such as “our model predicted student completion with 75% accuracy.” Even assuming that this accuracy was obtained using best practice methods it glosses over two kinds of fact that we should seek in any prediction (or, generally “classification”), but which are too often neglected:

  • How does that figure compare to a random selection? If 80% completed then the prediction is little better than picking coloured balls from a bag (68% predicted correctly). The “kappa statistic” gives a measure of performance that takes account of improved predictive performance compared to chance, but it doesn’t have such an intuitive feel.
  • Of the incorrect predictions, how many were false positives and how many were false negatives? How much we value making each kind of mistake will depend on social values and what we do with the prediction. What is a sensible burden of proof when death is the penalty?

Widening the conception of validity beyond the technical

One of the issues faced in learning analytics is that the paradigm and language of statistics and data mining could dominate the conceptualisation of validity. The focus on experiment and objective research that is present in most of the technical uses of “validity” should have a counterpart in the situation of learning analytics in practice.

This counterpart has an epistemological flavour, in part, and requires us to ask whether a community of practice would recognise something as being a fact. It includes the idea that professional practice utilising learning analytics would have to recognise an analytics-based statement as being relevant. An extreme example of the significance of social context to what is considered valid (fact) is the difference in beliefs between religious and non-religious communities.

For learning analytics, it is entirely possible to have some kind of prediction that is highly statistically-significant and scores highly in all objective measures of performance but is still irrelevant to practice, or which produces predictions that it would be unethical to use (or share with the subjects), etc.

Did it really work?

So, lets say all of the above challenges have been met. Did the adoption of a given learning analytics process or tool make a desirable difference?

This is a tough one. The difficulty in answering such questions in an educational setting is considerable, and has led to the recent development of new research methodologies such as Design-based Research, which is gaining adoption in educational technology circles (see, for example, Anderson and Shattuck in Educational Researcher vol. 41, no. 1).

It is not realistic to expect robust evidence in all cases, but in the absence of robust evidence we also need to be aware of proxies such as “based on the system used by [celebrated university X]”.

We also need to be aware of the pitfall of the quest for simplicity in determining whether something worked; an excessive focus on objective benefits neglects much of value. Benefits may be intangible, indirect, found other than where expected, or out of line with political or business rhetoric. As the well worn aphorism has it, “not everything that counts can be counted, and not everything that can be counted counts.” It does not follow, for example, that improved attainment is either a necessary or sufficient guide to whether a learning analytics system is a sound (valid) proposition.

Does it generalise? (external validity)

As we try to move from locally-contextualised research and pilots towards broadly adoptable patterns, it becomes essential to know the extent to which an idea will translate. In particular, we should be interested to know which of the attributes of the original context are significant, in order to estimate its transferability to other contexts.

This thought opens up a number of possibilities:

  • It will sometimes be useful to make separate statements about validity or fitness for purpose for a method and for the statistical models it might produce. e.g. is the predictive model transferable, or the method by which it is discovered?
  • It may be that learning analytics initiatives that are founded on some theorisation about cause and effect, and which somehow test that theorisation, are more easily judged in other contexts.
  • As the saying goes “one swallow does not a summer make” (Aristotle, but still in use!), so we should gather evidence (assess validity and share the assessment) as an initially-successful initiative migrates to other establishments and as time changes.

Transparency is desirable but challenging

The above points have been leading us in the direction of the need to share data about the effect of learning analytics initiatives. The reality of answering questions about what is effective is non-trivial, and the conclusions are likely to be hedged with multiple provisos, open to doubt, requiring revision, etc.

To some extent, the practices of good scholarship can address this issue. How easy is this for vendors of solutions (by this I mean other than general purpose analytics tools)? It certainly implies a humility not often attributed to the sales person.

Even within the current conventions of scholarship we face the difficulty that the data used in a study of effectiveness is rarely available for others to analyse, possibly asking different questions, making different assumptions, or for meta-analysis. This is the realm of reproducible research (see, for example, reproducibleresearch.net) and subject to numerous challenges at all levels, from ethics and business sensitivities down to practicalities of knowing what someone else’s data really means and the additional effort required to make data available to others. The practice of reproducible research is generally still a challenge but these issues take on an extra dimension when we are considering “live” learning analytics initiatives in use at scale, in educational establishments competing for funds and reputation.

To address this issue will require some careful thought to imagine solutions that side-step the immovable challenges.

Conclusion… so what?

In conclusion, I suggest that we (a wide community including research, innovation, and adoption) should engage in a discourse, in the context of learning analytics, around:

  • What do we mean by validity?
  • How can we practically assess validity, particularly in ways that are comparable?
  • How should we communicate these assessments to be meaningful for, and perceived as relevant by, non-experts, and how should we develop a wider literacy to this end?

This post is a personal view, incomplete and lacking academic rigour, but my bottom line is that learning analytics undertaken without validity being accounted for would be ethically questionable, and I think we are not yet where we need to get to… what do you think?

Target image is “Reliability and validity” by Nevit Dilmen. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons – http://commons.wikimedia.org/wiki/File:Reliability_and_validity.svg


About Author

Adam works for Cetis, the Centre for Educational Technology and Interoperability Standards, at the University of Bolton, UK. He rather enjoys data wrangling and hacking about with R. He is currently a member of the UK Government Open Standards Board, and a member of the Information Standards Board for Education, Skills and Children’s Services. He is a strong advocate of open standards and open system architecture. Adam is leading the workpackage on interoperability and data sharing.


  1. Pingback: On the Question of Validity in Learning Analytics | Adam Cooper

  2. [copied from the cross-posting to my personal blog, where Mike commented]

    Good points here, Adam. We need to have more dialog about measuring the impact of analytics. That is, did it move the needle? As someone who recently changed from a practitioner to a vendor, I’ve seen this from multiple sides. I wouldn’t say that analytics without validity is ethically questionable. I just think it’s REALLY hard to undertake a project that proves efficacy. It takes many months to do and it involves a lot of setup to frame the test correctly. Since these tend to be big obstacles, you find folks relying on a measure like accuracy or using the output from one small pilot as their evidence. It’s not unethical. Just because something doesn’t stand up to a rigorous challenge, that doesn’t mean it’s bad. It just means you need to be extra careful.

    On the topic of validity, I have executed a fairly rigorous experiment in the past (http://bluecanarydata.com/evidence). I’m also launching one with a new client this week. If all goes well and if the client allows, I’ll share the outcomes in 6 months (yes…it’s going to take that long to see if we’ve successfully moved the needle).


    Mike Sharkey
    Blue Canary

    • Mike –
      thanks for the comment. My remark “learning analytics undertaken without validity being accounted for would be ethically questionable” was a little bit of comment-baiting…

      I certainly acknowledge that we need to allow experimentation and some degree of risk-taking, but I think we also need to take a nuanced view of what is effective, sometimes recognising that “hard facts” about effectiveness may be misleading or irrelevant, or be in need of revision. Sometimes “moving the needle” may be too stringent a requirement, or an inappropriate means of addressing change in a complex system; sometimes the qualitative effects of a LA initiative may justify it.

      So, a less provocative formulation would be: “learning analytics undertaken at scale without considering what kind of evidence is appropriate, and what evidence actually exists.”

      Cheers, Adam

  3. Pingback: This Week in Learning Analytics (October 18 – October 24, 2014) | Timothy D. Harfield

    • Timothy –
      I think I am paying the price of having written a somewhat compressed piece about a complex topic…

      I did not intend to convey the idea that conflation of concerns is desirable; indeed, the first listed point in the conclusion section could be unpacked to include the idea that we (a community with expertise in different disciplines, but also a community that would benefit from drawing more scholars into it) need to better articulate the distinctions within a wider interperation of “validity” than exists in particular disciplinary areas. This is not a “loosening of conceptual clarity” but a call for widening of conceptual view.

      It is precisely the risk of under-appreciating complexity that prompted me to write the article. For all non-expert practitioners may need a set of expert advisors, I believe the reality is that they are unlikely to have access to them, or simply not see the need to consult them. At present, it seems likely that these non-experts will make decisions based on seeing a visually-attractive dashboard, a quoted prediction rate, or a statement like “based on the system used by XXX”. We need to move this narrow/near view forwards, to widen the view, and yes, to raise awareness of the need to consult experts. In the process, we should be aware that specialised vocabularies can be a source of difficulty. The same applies across disciplines and vocational areas; not all teams involved in implementing learning analytics will be as diverse as would be ideal. There is, I think, a need to develop awareness of the many sides of “validity”, even within the community.

      So… yes, I’m all for conceptual sophistication, but also for “dumbing-down”. The way forward I see is to develop a more socialised conceptual map as a basis for working out how best to simplify the message.


  4. Pingback: Against ‘Dumbing Down': A Response to Adam Cooper’s On the Question of Validity in Learning Analytics | Timothy D. Harfield

  5. Pingback: What is Learning Analytics and what can it ever do for me? | nauczanKi

  6. Charles Lang on

    Thanks for bringing up this topic Adam, it is an important piece of the puzzle but seems to be one that stalls easily, maybe because of its complexity, maybe because of the vocabulary, or as Tim pointed out, because it is so context specific.

    One thing I would like to add, which Tim touched on, and which is at the core of the translation between those within and outside the field, is construct validity. This seems to be where the rubber hits the road because although people may not be used to thinking in well-defined probabilistic terms they are very efficient at abstracting from probabilities to causal constructs. To take a currently trending construct, grit, let’s say that a high “grit” score predicts high exam scores – among the general public (and among many researchers to be honest) this will quickly be converted to the idea that there is a thing called grit that people carry around in their head that *causes* them to do well on exams. This problem has long been the purview of psychology, and whether or not we can call any psychological construct causal remains up for debate. What is clear though is that psychology has done a pretty terrible job of translating these issues to the wider world – the urge to make the jump between probability to causal psychological agent seems to great.

    The question is, how can LA and EDM do a better job? My only thought is that we need to develop individual level “validity” metrics for our constructs so that students can gauge the extent to which a given model/construct explains their behavior over time. So I, as a student, can determine whether the idea of “grit” really explains my performance over time or whether I should look for an alternative. That may make things horribly complicated and lead to more uncertainty of course, but maybe that is preferable to the simplistic approach?

Leave A Reply