When 87 percent of the U.S. population can be uniquely identified with a reasonable degree of certainty by a combination of ZIP code, date of birth, and gender you may realise that it may be pretty hard to de-identify the data exhaust of modern education. Anonymising the data does not give what the Learning Analytics community is looking for. This is demonstrated in a fresh report from MIT and Harvard where a data set from 16 MOOCs released as open data was analysed against the original data that had not gone through the de-identification process.
«We show that these and other de-identification procedures necessitate changes to data sets that threaten replication and extension of baseline analyses,» the MITx and HavardX researchers claim. Their solution to the problem is to focus on «protecting privacy without anonymizing data by instead expanding policies that compel researchers to uphold the privacy of the subjects in open data sets».
In their paper «Privacy, Anonymity, and Big Data in the Social Sciences» Daries et al. show it is impossible to anonymize identifiable data without the possibility of affecting some future analysis in some way. Higher standards for de-identification can lead to lower-value de-identified data privacy.
As an example, the original analysis of the MOOC data showed that approximately 5 percent of course registrants earned certificates. Some methods of de-identification cut that percentage in half.
The authors conclude that the scientific ideals of open data and the current regulatory requirements concerning anonymising data are incompatible. They think neither abandoning open data nor loosening student privacy protections is a wise option. Their solution is either a technological solution such as differential privacy, which separates analysis from possession of the data, or a policy-based solution that allows open access to possibly re-identifiable data while policing the uses of the data.
Learning Analytics raises serious issues related to privacy, transparency, trust, and control of data that the LA community needs to deal with. It is no solution to leave the data in closed containers only under control of a particular vendor or institution. To unleash the potential of LA we need open data. Therefore, there is no other option than start discussing how to build systems of trust giving the learner control of their data without refraining from the benefits of improved learning support by sharing data with others.
A digression to end this post: I was alerted to the paper of Daries et al. via Google Alert. The same day P2PU (p2pu.org) organised a webinar on The Big Question: (Better) Learning Analytics (see Youtube recording). I did not know the presenters, and when the issue of ethics and LA came up towards the end of the webinar I contributed the reference to the paper discussed above to the chat. Guess, who picked up the thread and gave a talk about this paper? One of the authors, Justin Reich of HarvardX. Have a look at the webinar recording – and see what is discussed at the P2Pu Forum page!
Daries, J. P., Reich, J., Waldo, J., Young, E. M., Whittinghill, J., Seaton, D. T., et al. (2014). Privacy, Anonymity, and Big Data in the Social Sciences. Queue, 12(7), 30. doi:10.1145/2639988.2661641. Online at http://dl.acm.org/ft_gateway.cfm?id=2661641&type=html