Playful Reading of Learning Analytics Papers using ‘off the shelf’ R packages

Introduction

R is both a language and environment used by statisticians  that I have been tinkering with recently  for text analysis. I am no expert in statistics by any means, but I still like playing with the software because it has a large collection of add-on packages, these packages can easily be imported in to an R environment giving the user access to techniques they may have come across in papers or on the web, they can then try the techniques on their own data.

I’ve taken some of the popular packages and techniques currently doing the rounds, modified them slightly and applied them to the the LAK dataset to see what sort of stuff comes out the other end. This post is a write up of my playing and is geared towards testing how possible it is to take off the shelf examples of data analysis, push our own dataset through them and get something interesting out the other side. As computer tinkerers know, it can be very difficult and time consuming to get the machine to do the things you want, particularly in an environment you are not that much of an expert in. As such this post will concentrate on the ‘is it possible?’ rather than a critique of the techniques themselves, perhaps I will save that for a part two.

I’ve tried to put in enough code examples and videos so you can play along if you like, you should be able to play along with the LAK dataset itself or with your own dataset. These are ‘off the shelf’ examples with small tweaks so I’ve also linked to the original source which you can follow up if you get stuck (or if notice that I am doing it wrong). If you do want to play along, I recommend using R studio.

Preparing LAK data set in R format

The basics of preparing RDF for R analysis is taken from Adam Cooper’s pre-processing script which can be found on his github.

The LAK dataset is available in RDF format to download here, or if you know your SPARQL you can query at the end point like so: http://lak.linkededucation.org/request/lak-conference/sparql?query=[your sparql query].  The first thing I wanted to do with the data set is get it in to R in a sensible way so that I could try some of these R packages. There is an option to get the data in R format on the website, but the data you will get is the LAK 2012 dataset, which is two years out of date. Adam Cooper, who was involved in converting the LAK 2012 dataset hosts his script to convert the RDF to R dataframes on his github, but it only works with the 2012 dataset. I’ve updated the script to work with the 2014 dataset, to use it simply download it here, place the LAK dataset in the same directory within a folder called Data. The script itself is quite straight forward, but the important bits are:

1) We import the rrdflibs and rrdf R packages, we need them to work with the LAK RDF dump, we then create a new RDF triple store object and load in the data. We can see how many triples there are with the summarize.rdf command:

2) We can then write SPARQL queries, run them against the triple store object and return the results as a dataframe. I have three queries all of which are slightly modified versions of Adam’s queries. I won’t go in to detail on how to write SPARQL queries here, but there is a great guide here that I often refer to. I’m not the best at SPARQL, so let me know if there are any obvious mistakes:

Creating a dataframe of people, names, location and organisation:

Creating a dataframe of paper information and contents:

Creating a dataframe of authorship data:

You can now poke about any of the dataframes, RStudio users can click the little table next to a dataframe in the environment tab to view it. If you wish to write these objects to disk, you can using the save command (and can be loaded with the load command):

Now we have all the data in R we can find some basic stats about the data, we can find the number of triples in the RDF using:

count the numbers of papers per year by simply counting the rows in the R data frame ‘papers’ with the command:

and count the mean number of words with:

Using these commands we can whip up some basic stats about the data set:

Year Number of Papers Mean Number of Words
2008  31  2424
2009  32  2609
2010  61  1521
2011  84  2684
2012  101  3778
2013  153  3070
2014  124  2430

A side note, this seems to be quite a few papers less than the LAK data site claims there are. This could be one of two problems, the process I am using to grab the contents of the papers is dodgy or the content is missing from the data set. Exploring the dataset I do have a problem that the contents of the papers seem to have different predicates for the content of a paper. What I have been doing is using the the SPARQL UNION keyword (thanks @wilm) to grab the contents of the paper,  remove any empty papers, and then remove any duplicate ones with the following code:

If you can spot a problem I would love an update, in the mean time I am going to contact LAK about this to explain my problem, still I think I have enough data to do an analysis, I can always come back update the post.

Ben Marwick has an interesting way to plot of the distribution of words per post by year, which I have slightly modified to work with our dataframe. The idea is that we make a dataframe where each row is a paper with columns for the number of words and the year it was published, we then use ggplot2 to  plot it

results in:

averagewordsayear

Plot of the distribution of words per paper by year in LAK 14 dataset

Topic Moddeling

The way in which I approached Topic Modelling for the LAK dataset was taken from Ben Marwick’s reading of archeology blog posts.

Now that we have the data in a format in which we can explore the data in a format we want we can do some analysis. A popular machine learning technique at present is to find ‘Topic Models’ in a collection of documents. The idea is that we can use a statistical model to find abstract “topics” that appear in the documents, then we can use that model to predict how much a document contains about each of these topics. The most common example of topic modelling currently seems to be the Latent Dirichlet allocation (LDA) model.

If you’ve been following along with this post then you should already have all the text of the documents in the papers dataframe.  There are a few examples you can follow to do this, there is an R package you can download and use, but the general census seems to be that java tool MALLET, can do it much faster, fortunately for us there is a R wrapper for that.

1. Add a new column to our papers dataframe

I started by adding a new column to the papers dataframe describing what conference the paper was from. I thought this might be useful to see if certain topics appear more at different conferences:

2. Import the papers in to Mallet format

First we need to load the Mallet library, then create a dataframe that contains an unique id, the paper text and the year it was published. Then we create a mallet instance out of it. Below you can see I have also loaded a file of common stop words “en.txt”, you can create your own or borrow one from somewhere (a good one comes packaged with Mallet)

3.Set the Topic Model parameters.

The next steps are to set the parameters, these are taken from Ben’s example. I ran this many times trying different parameters and I’m not sure there is a ‘best’ way to do it. I think the parameters will depend greatly on your data set, the number of papers you have and size of these paper. The important things you might want to change are the number of topics (n.topics <- 30) and the number of words per topics. This might take a few minutes to run, particularly if you have an old creaky machine like mine.

You should be left with topics and some information about how much of your documents relate to each topic. These are the 30 abstract topics in the LAK dataset that I got on my run, due to the generative nature of the model yours will be different:

 [1] “tree classifier classification classifiers tutor”
[2] “patterns students actions sequences pattern”
[3] “model models performance data features”
[4] “lak data topics papers dataset”
[5] “difficulty item time items system”
[6] “hints state hint states game”
[7] “student problem problems students tutor”
[8] “number figure results work data”
[9] “words text word writing document”
[10] “learning analytics learners social learner”
[11] “students data student learning behavior”
[12] “concept dialogue concepts knowledge act”
[13] “data student academic university institutions”
[14] “students student study average higher”
[15] “questions question answer code student”
[16] “students posts topic discussion words”
[17] “set algorithm method values section”
[18] “data items skills item cognitive”
[19] “network analysis social interaction group”
[20] “cluster clusters clustering k-means algorithm”
[21] “students scores learning features reading”
[22] “data teachers students time tool”
[23] “rst graph events dierent interaction”
[24] “students time facial task participants”
[25] “learning causal variables data condition”
[26] “learning analysis based research system”
[27] “rules data students attributes table”
[28] “students video assignment lecture time”
[29] “student skill knowledge parameters students”
[30] “users user learning datasets objects”

While I find poking around the output very interesting I find it hard to visualise the results and would appropriate any advice. Ben’s solution was to comparing the average proportions of each topic across all documents for each year using ggplot2. For our dataset it looks like this:

topicmodels

But I think its very hard to follow, you can kind of see the shift away from topic “users user learning datasets objects” over time and the shift to “model models performance data features”.

Another thing we could do is a method I have seen on sapping attention, where we make a streamgraph for topics over time. Since the educational datamining conference is the only conference that seems to have papers for all 6 years I redid the topic model for that conference only and then mapped the with the y-axis  representing percent of top 5 topics in the papers, and the x-axis corresponding to the year the papers were published.

1.students learning student data features model
2.questions student words features students question
3.students data learning mining number student
4.model student data models skill knowledge
5.students student problem problems hints state

over time:

Screen Shot 2015-01-12 at 13.25.04

Phrase Finding

Ben Schmidt runs the excellent blog Sapping Attention, in some of my favourite posts on the blog he mines the Google Ngram data to see if the language of TV period dramas matches the language of books from the same era. Ben doesn’t share much code, but from what I can gather his general technique to Ngram is  to be to use the tm package to do some pre-processing of text and then use RWeka to create an Ngram tokenizer.

Since creating an R dataframe from the RDF we can access the text of the papers in papers$text, First we want to import the tm  and Rweka libraries, I also import slam so I use the rollup function. Also I set mc.cores = 1 because of issue with forking in the tm package on Mac OS X:

The pre-processing on the text using the tm is quite straight forward, we want to create a corpus and then remove stop words, numbers and punctuation. I then make sure all the text is lower case:

We can then create a term document matrix from this corpus:

Using the weka package we can then create a tokenizer and a dataframe of the phrases arranged by the amount of times they have been used. Below I am looking for bigrams but you can change the number of words in the phrases by editing where I have put 2:

I created a tokenizor for phrases of length 2, 3,4 and 5 words and took out the phrases that would appear in generic papers. Many of the top phrases seem to be phrases that you would expect to see in an academic paper, for example ‘et al’ or ‘for example’, I took these phrases out and I was left with the following top phrases in the LAK dataset:

Phrase Occurances
learning analytics 1097
data mining 832
knowledge tracing 437
educational data 400
student performance 353
social network 296
student model 278
educational data mining 272
intelligent tutoring systems 164
social network analysis 142
national science foundation 108
bayesian knowledge tracing 94
data mining techniques 85
sequential pattern mining 72
intelligent tutoring system 71
social learning analytics 60
item response theory 58
dynamic cognitive tracing 56
logistic regression model 56
predicting student performance 54
learning management systems 48
knowledge tracing model 47

I wondered how these changed over time in the dataset. To do this I created a loop that created a Corpus for each year, and then checked how many times each phrase appeared in each corpus and wrote it to a CSV. Loops are a bit controversial in the R community as they are very resource heavy, I’m sure there is a better way to do this, but rearranging data frames and the such can be quite difficult, with a bit of trial and too much room for error. The content for each year can be grabbed with: papers[papers$year == 2008,]$content, so creating a Corpus of text with papers from the LAK dataset published in 2008 looks like this:

 lak top phrases

Creating a Network Graph

There are lots of ways to import data in to R for social network analysis, I guess each has it’s own advantages and might be easier or harder depending on the format of your data. My favourite resource on this can be found here.

We can do simple network and graph analysis in R using the iGraph package and honestly, while I’m not much of network diagrams that don’t really explain themselves, I think there is something in being able to provide the data in a format that people can play with in tools such as Gephi. This is the way I do it, it seems to be the most simple way, but I lose some data along the way:

1.  Identify a dataframe that has a list of nodes, and items that might connect these nodes

In the authorship dataframe we have a list of authors and papers that they are have worked on. You can list these in R with the command:

names and papers

Names and papers they worked on

2.Create an adjancy matrix

This creates a matrix of names and the number of times they appear on a paper

This creates a matrix of the number of times users appear on the same paper together:

3. Create iGraph object

We can create an Igraph object out of the adjancy matrix

In this example I will remove edges and loops

3. Write to gephi file.

The file can then be imported in to Gephi by going to file->open. There are lots of examples of using Gephi on the web, so I am not going to mess with the file much, I like this example of using Gephi but I recommend getting an understanding of some of the measurements elsewhere, if your library has it, skip to the the last chapter in Webb’s 3rd edition of Statical Pattern Recognition, it is brilliant and really easy to understand. You can save as in other formats by replacing the graphml in format=”graphml” with pajek, gml, dot, etc.

While this is a very easy and quick method there are some drawbacks, the only attribute nodes and edges have are their label intact of the process.

ToDo

1. Explore other ways of doing SNA in R

2. Better ways to explain and explore topic models

Final versions of the scripts

Will be on github once I’ve cleaned them up!

The places I borrowed from

Adam Cooper’s Github
Ben Marwick’s Github
Ben Schmid’s Sapping attention

Tidy Data!

Do you suffer from untidy data? Help may be at hand….

Among the many ways not to live a life is in spending time on irksome tasks that can be avoided, paying the price of careless mistakes, or working late to make up for inefficiencies but still meet a deadline. If this sounds like your time data wrangling then your problem might be untidy data. Fret not, for help may be at hand, thanks to the work of Hadley Wickham, the brains behind some neat R packages and now part of the RStudio team.

I’m not going to repeat what Hadley has already written, but I’m spreading the word into the Learning Analytics community because I think Hadley’s conceptualisation of Tidy Data is neat and actionable (and there is an R package too help out, but the principles of Tidy Data are not R-specific, so don’t be put off if you are not an R user).

The best place to start is probably Hadley’s article in the Journal of Statistical Software Vol 59, Issue 10, published September 2014, the abstract of which says:

A huge amount of effort is spent cleaning data to get it ready for analysis, but there has been little research on how to make data cleaning as easy and effective as possible. This paper tackles a small, but important, component of data cleaning: data tidying. Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores.

This paper concludes with a nice worked example to demonstrate that tidy data makes a pre-processing pipeline into a simple and intelligible affair. Other useful resources:

  • The R package “tidyr” (only v0.1 at present but by no means  shabby), which is quite minimal and targetted at tidying-up untidy data.
  • A code-centric version of the JSS paper, which also illustrates use of dplyr, an extensive library for common data manipulation tasks that adds more easily-usable features (“a fast, consistent tool”) than the R Core provides.

One of my take-homes is that in the long run it is worth spending a little bit of time to write scripts, and build tools that follow the conventions of Tidy Data both for our own sanity in moving between software tools or returning to data a later point, and also for the benefit of other people who might use it (unknown people at some unknown time). Several of Hadley’s examples show that what might be a suitable tabular presentation for publication might be untidy for processing, which illustrates a general point about the advantage in discriminating between presentation and semantics. That is something to bear in mind when practising reproducible research (see comments in my previous post on FOAS).

Foundation for Open Access Statistics

Foundation for Open Access Statistics logoFOAS, the Foundation for Open Access Statistics brings together the promotion of three ideas that I think are really important for Learning Analytics (“LA” hereafter). They are:

  • Open Source Software (“free as in freedom”),
  • Open Access publishing, which is increasingly being required by research funders, and
  • Reproducible research.

R is a prime example of Open Source working well, with a genuinely awesome (and I am not prone to use that word) collection of well-maintained packages resting on an excellently-maintained core and properly managed through CRAN. OK, that was probably stating the obvious to most readers, but I think it is worth reflecting on what a good open source licence brings to the table: the virtuous, and unconstrained, cycle of share-use-improve.

I suppose that one of the motivations for FOAS was the sustainability  of the Journal of Statistical Software (JSS). I have found JSS to be an excellent resource over the years since I discovered R, and I really like the way the articles about R packages give both the theoretical background (or discussion of algorithms) and worked examples with code snippets. Not all articles are R-specific, and even those that are usually give a good grounding in the principles. The Journal of Learning Analytics has been an Open Access publication from its inception.

The final bullet, “reproducible research”, remains a minority sport, but one that is sure to grow. The practice of reproducible research is generally still a challenge but the idea is simple: that other people can repeat the research that is published. It baffles me that, for numerical and computational studies, this is not already the norm; repeatability seems like the essence of good scholarship and computational studies to rather lend themselves to repeatability.

So… why should you care about reproducible research?

  1. If you practice LA but are not a publishing researcher, the techniques of reproducible research make it easier to repeat tasks, and to reduce careless errors. They also make it easier to keep a useful archive (a bit more than just a source code repository).
  2. If you are a researcher, reproducible research techniques are likely to improve quality and consistency and make paper revisions a breeze. I have a feeling that it will soon make you a lot more citable.
  3. If you are interesting in adopting the results of LA research, doesn’t due diligence demand that you check the reproducibility of the work, and test how well it generalises to your context?

FOAS isn’t just an evangelist for reproducible research, but a spiritual home for several projects that enable it. I’m particularly fond of RStudio and knitr, and the latest release of RStudio has improved pandoc support, but also intrigued by OpenCPU. I haven’t tried OpenCPU yet, but it looks like a useful and well-implemented linux package to wrap an HTTP API over R. It is currently a post-doctoral research project, so a bit of a risk for production use, but professionally-organised and a good candidate for transfer to a long-term home.

Why not join FOAS and ” promote free software, open access publishing, and reproducible research in statistics“.

Playing with LAK Data 3: Easy Gephi Graphs

Recently I’ve started getting my head around the LAK dataset, this has been a challenge for me not just because I’m getting to grips with the data but also because I’m getting my head around techniques and process. I thought that the process and techniques might be as interesting and useful to people as the data and decided to post my journey in a series of blog posts. You can catch all these posts together using the lak-dataset tag.

Getting the latest LAK dataset in to R  was quite easy using a script that took the LAK data RDF dump, ran a few SPARQL queries and  stored them in dataframes. Scanning over dataframes in R, I quite often find myself spotting relationships between columns and wondering how easy it would be to plot a network graph based on the relationship I’ve spotted. I’ve found a few ways that I think are worth blogging somewhere if only for my own reference. I’m not sure these are the best ways and would appropriate feedback.

Sometimes I have two columns that have different types of data the the columns, for example after importing the LAK dataset using Adam’s modified script I have an authorship dataframe that has both the columns paper and author. In his original script, Adam already has an example of creating and plotting network object from such a dataframe:

library(“network”)
net<-network(authorship[,c(“person”,”paper”)])
plot(net, vertex.cex=0.6, arrowhead.cex=0.5)

I like to be able to export as a graphml file because there are tools such as Gephi that can manipulate the graph data that are more user friendly than R. I can’t see a way to do this using the network package but can using the igraph package. I’m not sure if I’m missing something but it seems to be a common issue as a package called Intergraph exists that helps convert data objects between the two. Using the following I am able to export from two columns to a graphml file ready to import in to Gephi or similar software:

library(“igraph”)
library(“network”)
library(intergraph)
net<-network(authorship[,c(“person”,”paper”)])
graph <- asIgraph(net)
plot(net, vertex.cex=0.6, arrowhead.cex=0.5)
write.graph(graph, file=”graph.graphml”, format=”graphml”);

This method seems to create nodes out of both the person and paper columns, to me this makes the graph confusing, I’d like people as nodes and writing a paper together as edges that tie them together. I guess there is situations you might want both. To create a graph with just one column of the data as nodes I create an adjacency matrix first that contains information on all the people and how many times they have met by writing on papers together. First create an adjacency matrix based off these two columns, then create an igraph object set any parameters I want and write it as a graphml file.

M = as.matrix( table(authorship[,c(“person”,”paper”) )
Mrow = M %*% t(M)
#Mcol = t(M) %*% M
write.csv(Mrow, “test.csv”)
iMrow = graph.adjacency(Mrow, mode = “undirected”)
E(iMrow)$weight <- count.multiple(iMrow)
iMrow <- simplify(iMrow)
iMrow = graph.adjacency(Mrow, mode = “undirected”)
E(iMrow)$weight <- count.multiple(iMrow)
iMrow <- simplify(iMrow)
write.graph(iMrow, file=”graph.graphml”, format=”graphml”);

Which generated something in Gephi that looked like the network graph below. I’m not a fan of sharing Gephi generated pictures because anything important that you learn about the network really comes from playing with it and seems to be lost as soon as you press the export button, but these steps should be helpful in turning data sets in to graphml files to play with yourself.

gephi-lak

Playing with LAK Data 2: Dataset woes

Recently I’ve started getting my head around the LAK dataset, this has been a challenge for me not just because I’m getting to grips with the data but also because I’m getting my head around techniques and process. I thought that the process and techniques might be as interesting and useful to people as the data and decided to post my journey in a series of blog posts. You can catch all these posts together using the lak-dataset tag.

I used this script I took Adam Cooper and modified to work with the latest LAK dataset to import the LAK RDF data dump in to R. I loaded  this script in to Rstudio  and pressed the magic source button which runs the script for you. I end up with three dataframes in my environment with data on papers, people and authorship. If you have done the same in Rstudio you can click the little table next to their name in the environment explore them. The papers dataframe in particular is pretty big because it contains the text of all the papers which makes it pretty hard to work out what you are looking at, you can view the column names to get an idea of what is in them using colnames(<dataframe_name>)

I  wanted to play with some topic modelling techniques I have my eye on, but before I got stuck in I thought it might be worth the few minutes it takes to get to grips with the text I would be playing with. The papers in the dataset: from the papers dataframe we can work out some simple facts about the dataset. In this dataframe each column is a content, There are 462 papers which we can find with nrow(papers), there is also a year column and we can count the number of papers per year using the following: nrow(papers[papers$year == <year>,]) . This gives me a little bit of information on when the papers were published which I think is always good to have at the back of your mind when exploring a collection of publications

Year Publications
2008 31
2009 32
2010 64
2011 84
2012 104
2013 147

The next thing I want to do is work out what kind of size the papers are, I’ll need to work with the content column, I had some problems with this initially but managed to get around them by coercing the data in the column to a character type using the following: papers$content<-as.character(papers$content). The total number of words in all papers can be found using:

length(unlist(lapply(papers$content, function(x) strsplit(x, ” “)[[1]])))

There are 1259664 words in all the papers combined! We can find the mean number of words for each year: length(unlist(lapply(papers[papers$year == year,] so we can find out the mean length of the papers per year using something like:

length(unlist(lapply(papers[papers$year == 2008,]$content, function(i) strsplit(i, ” “)[[1]])))/ nrow(papers[papers$year == 2008,])

So I can update my table with the mean number of words per article in each year

 

Year Publications Mean number of words
2008 31 2070
2009 32 2051
2010 64 1187
2011 84 2673
2012 104 3663
2013 147 3050

Before getting started with my topic moddeling I decided to plot this data so I got a feel of how long the papers were. I created a dataframe of all papers, year and their number of words and plotted as such:

frametoplot <- data.frame(
count = unlist(lapply(papers$content, function(x) length(strsplit(x, ” “)[[1]]))),
year = papers$year, stringsAsFactors = FALSE)

Screen Shot 2014-08-06 at 18.04.14

Hang on! This doesn’t look right! According to my plot lots of papers have 0 words in them and a quick look at this frametoplot[frametoplot$count == 0] shows me there are 37 entries in the dataset that don’t have any text! I emailed the LAK dataset people who got back to me saying that these corrections will be made in the next data set release. Topic Modelling on hold.. I better poke other parts of the dataset for now..

Playing with LAK data 1: Getting the latest data in to R

Recently I’ve started getting my head around the LAK dataset, this has been a challenge for me not just because I’m getting to grips with the data but also because I’m getting my head around techniques and process. I thought that the process and techniques might be as interesting and useful to people as the data and decided to post my journey in a series of blog posts. You can catch all these posts together using the lak-dataset tag.

For some work I want to do I want to get the latest LAK dataset in to R. I was happy to see that one of the options to on the LAK dataset page is to get the data in R format using a script written by Adam Cooper and hosted on the crunch server . The option seemed to still work by using the following lines in my R script:

con <-url(‘http://crunch.kmi.open.ac.uk/people/~acooper/data/LAK-Dataset.RData’)
load(file=con)
close(con)

Having a poke around the dataset I noticed that Adams code is missing some of the newer proceedings from 2013. While I wait for Adam to return from leave to pester Adam him to update the hosted script I spotted that his script exists in Github and it was easy enough to fork and edit so that it works with the latest dataset. It’s quite easy to follow and might be useful as an example for other RDF datasets people wish to get in to R.

1. Download this forked example containing changes to Adam’s code to work with the latest dataset.
2. Download and extract the latest LAK RDF dump, extract and put the LAK-DATASET-DUMP.rdf in your R working directory.
3. Run the script. I like to load the LAKRDF2RData.rdf in Rstudio and run through it line by line to work out what is going on.

The code seems very straightforward if you want to change the SPARQL queries and use it with other data sets. If you get tripped up on the process somewhere I recorded myself doing it:

LACE Tech Focus

The LACE Tech Focus blog is a space for members of the LACE project to share experiments and interesting things they are doing. The space somewhat of a playground for people doing technical to share ideas or simply leave notes that may be useful for people doing similar things. Dirty code posts, failed experiments and the testing of new languages, data sets or algorithms are all welcome here, even if they are only new to you.

If you would like an account to post on this please contact David Sherlock: ds10 [at] bolton.ac.uk