Skip to content
arc12 edited this page Jul 3, 2012 · 31 revisions

This repository contains several bits of R code to undertake text mining to look for emergent trends and "weak signals".

The subject of the work is "technology enhanced learning" (aka "educational technology", "e-learning", ...) but the method is general.

The text that is being "mined" is drawn from conference abstracts and community blog posts. In this wiki, the general term "document" refers to either an abstract or a blog posting.

Code

Code: Rising and Falling Terms

This is currently the only realisation of the ideas described in "Weak Signals and Text Mining II - Text Mining Background and Application Ideas". An elementary statistical test is complemented by the calculation of auxiliary measures of novelty, subjectivity and author centrality. For details, see the pages: Technical details of the rising and falling terms method and Technical details of auxiliary measure calculation. An interpreted walk-through of results also comprises a form of qualitative evaluation of the method and indicates where care is needed in interpreting the results {to be written}.

Code: History Visualiser

History Visualiser is a small program to create a web page (and Google Gadget) containing a Google "Motion Chart" to show how arbitrary sets of terms vary over time. It can show term frequency, the number of documents it occurs in and values of auxiliary measures: positive and negative sentiment and subjectivity. The technical details of the history visualiser

Code: Abstract Acquisition Scripts

This is largely PERL code to process XML data from the DBLP computer science bibliography, fetch and extract conference abstracts from the publisher site and to format the whole into a CSV file for use by the other programs.

Code: Compair

Compair compares pairs of conferences by inspection of the full text of papers presented in a given year. It focusses on dominant or gross differences between the terms used in the two sets of papers using the same statistical test as in "Rising and Falling Terms" to produce two visualisations of the differences: one plot focussing on frequency and significance and a graph showing term co-occurrence (created using Gephi). See the technical details of Compair.

Output and Results

Several forms of output are created, generally using the "Brew" package for R to create HTML/JavaScript around the images and data produced by the main programs. Output/results are generally available from the Text Mining Weak Signals Output Repository and committed to the gh-pages branch such that they are accessible as normal web pages.

Output: Rising and Falling Terms

Output created using "Rising and Falling Terms" is currently a report formatted as an HTML page:

Output: History Visualiser

Output from "History Visualiser" is available as both a plain HTML page and as a Google Gadget (in each case, the guts is a Google MotionChart). The sets of terms that have been processed match the results of "Rising and Falling Terms" (see the Technical details of the rising and falling terms method for an explanation of "Rising", "Falling" and "Established"):

Output: Compair

HTML file plus CSV, Gephi format and PDF downloads for a comparison of the 2011 ICALT and ICCE conferences

Acknowledgements

This work was undertaken as part of the TEL-Map Project; TEL-Map is a support and coordination action within EC IST FP7 Technology Enhanced Learning.

Many thanks to contributors to R core and packages. The whole lot is thrilling!