-
Notifications
You must be signed in to change notification settings - Fork 25
Home
This repository contains several bits of R code to undertake text mining to look for emergent trends and "weak signals".
- Background info on "weak signals" is on my blog @ Weak Signals and Text Mining I - An Introduction to Weak Signals
- An outline of relevant text mining techniques and some ideas for using them to look for weak signals is given in a follow-on posting @ Weak Signals and Text Mining II - Text Mining Background and Application Ideas
The subject of the work is "technology enhanced learning" (aka "educational technology", "e-learning", ...) but the method is general.
The text that is being "mined" is drawn from conference abstracts and community blog posts. In this wiki, the general term "document" refers to either an abstract or a blog posting.
This is currently the only realisation of the ideas described in "Weak Signals and Text Mining II - Text Mining Background and Application Ideas". An elementary statistical test is complemented by the calculation of auxiliary measures of novelty, subjectivity and author centrality. For details, see the pages: Technical details of the rising and falling terms method and Technical details of auxiliary measure calculation. An interpreted walk-through of results also comprises a form of qualitative evaluation of the method and indicates where care is needed in interpreting the results {to be written}.
History Visualiser is a small program to create a web page (and Google Gadget) containing a Google "Motion Chart" to show how arbitrary sets of terms vary over time. It can show term frequency, the number of documents it occurs in and values of auxiliary measures: positive and negative sentiment and subjectivity. The technical details of the history visualiser
This is largely PERL code to process XML data from the DBLP computer science bibliography, fetch and extract conference abstracts from the publisher site and to format the whole into a CSV file for use by the other programs.
Compair compares pairs of conferences by inspection of the full text of papers presented in a given year. It focusses on dominant or gross differences between the terms used in the two sets of papers using the same statistical test as in "Rising and Falling Terms" to produce two visualisations of the differences: one plot focussing on frequency and significance and a graph showing term co-occurrence (created using Gephi). See the technical details of Compair.
Several forms of output are created, generally using the "Brew" package for R to create HTML/JavaScript around the images and data produced by the main programs. Output/results are generally available from the Text Mining Weak Signals Output Repository and committed to the gh-pages branch such that they are accessible as normal web pages.
Output created using "Rising and Falling Terms" is currently a report formatted as an HTML page:
- Report on 2010 season conferences (ECTEL, ICALT, CAL, ICWL)
Output from "History Visualiser" is available as both a plain HTML page and as a Google Gadget (in each case, the guts is a Google MotionChart).
For the following, the sets of terms that have been processed match the results of "Rising and Falling Terms" (see the Technical details of the rising and falling terms method for an explanation of "Rising", "Falling" and "Established"):
- History from 2005-2010 for conferences: ECTEL, ICALT, CAL and ICWL
- 2010 Rising Terms (as Gadget)
- 2010 Falling Terms (as Gadget)
- 2010 Established Terms (as Gadget)
The following were done to correlate with surveys of TEL (Technology Enhanced Learning) industry members by Fabrizio Giorgini (eXact Learning).
- History from 2006-2012 for conferences: ICALT, ICHL and ICWL + ECTEL and CAL to 2011 only.
- TEL Blog posts (approx 700 blogs and 7000 posts per year) from Jan 2009 to Mid Sept 2012
The terms and groups are:
- Cloud (Cloud, Virtualisation, Virtual, SaaS, PaaS)
- eBooks (eBook, eTextbook)
- Analytics (Analytics, Data)
- Gesture (Gesture-based, Gesture)
- Context Sensitive Services (Context, Context-sensitive, Context-aware, Context-enriched, Location, Location-based, Location-aware, Geospatial)
- Games (Game, Gamification, Game-based, Game-play)
- Mobile (Tablet, Smartphone, Mobile, Ubiquitous, Pervasive)
- Learning Platforms (LMS, VLE, LCMS, E-Portfolio, Platform)
Previous Sets (different terms)
- History from 2006-2011 for conferences: ECTEL, ICALT, CAL, ICHL and ICWL
- Fabrizio's Terms grouped by theme (with new group/drilldown)
- Fabrizio's Basic Terms
- Fabrizio's Additional Terms
- TEL Blog posts (approx 700 blogs and 7000 posts per year) from Jan 2009 to June 2012
- [Fabrizio's Terms](http://arc12.github.com/Text-Mining-Weak-Signals-Output/History%20Visualiser/TEL Blogs 700 20090101-20120630/Groups.html) grouped by theme (with new group/drilldown)
- [Fabrizio's Basic Terms](http://arc12.github.com/Text-Mining-Weak-Signals-Output/History%20Visualiser/TEL Blogs 700 20090101-20120630/FG.Basic.html)
- [Fabrizio's Additional Terms](http://arc12.github.com/Text-Mining-Weak-Signals-Output/History%20Visualiser/TEL Blogs 700 20090101-20120630/FG.Additional.html)
HTML file plus CSV, Gephi format and PDF downloads for a comparison of the:
This work was undertaken as part of the TEL-Map Project; TEL-Map is a support and coordination action within EC IST FP7 Technology Enhanced Learning.
Many thanks to contributors to R core and packages. The whole lot is thrilling!