Skip to content
arc12 edited this page Dec 13, 2011 · 2 revisions

This repository contains several bits of R code to undertake text mining to look for emergent trends and "weak signals".

The subject of the work is "technology enhanced learning" (aka "educational technology", "e-learning", ...) but the method is general.

The text that is being "mined" is drawn from conference abstracts and community blog posts. In this wiki, the general term "document" refers to either an abstract or a blog posting.

    1. Code: Rising and Falling Terms
This is currently the only realisation of the ideas described in "Weak Signals and Text Mining II - Text Mining Background and Application Ideas". An elementary statistical test is complemented by the calculation of auxiliary measures of novelty, subjectivity and author centrality. For details, see the pages: Technical details of the rising and falling terms method and Technical details of auxiliary measure calculation. An interpreted walk-through of results also comprises a form of qualitative evaluation of the method and indicates where care is needed in interpreting the results {to be written}.
    1. Code: History Visualiser
History Visualiser is a small program to create a web page (and Google Gadget) containing a Google "Motion Chart" to show how arbitrary sets of terms vary over time. It can show term frequency, the number of documents it occurs in and values of auxiliary measures: positive and negative sentiment and subjectivity. The technical details of the history visualiser
    1. Code: Abstract Acquisition Scripts
This is largely PERL code to process XML data from the DBLP computer science bibliography, fetch and extract conference abstracts from the publisher site and to format the whole into a CSV file for use by the other programs.
  1. Output and Results
Several forms of output are created, generally using the "Brew" package for R to create HTML/JavaScript around the images and data produced by the main programs. Output/results are generally available from the Text Mining Weak Signals Output Repository and committed to the gh-pages branch such that they are accessible as normal web pages.

Output created using "Rising and Falling Terms" is currently a report formatted as an HTML page:

Output from "History Visualiser" is available as both a plain HTML page and as a Google Gadget (in each case, the guts is a Google MotionChart). The sets of terms that have been processed match the results of "Rising and Falling Terms" (see the Technical details of the rising and falling terms method for an explanation of "Rising", "Falling" and "Established"):
  • 2010 season conferences (ECTEL, ICALT, CAL, ICWL)
    • Rising Terms
    • Falling Terms
    • Established Terms
  1. Acknowledgements
This work was undertaken as part of the TEL-Map Project; TEL-Map is a support and coordination action within EC IST FP7 Technology Enhanced Learning.

Many thanks to contributors to R core and packages. The whole lot is thrilling!