Skip to content

Technical details of auxiliary measure calculation

arc12 edited this page Dec 13, 2011 · 9 revisions

A Note About Entities: Whereas the primary measures of frequency and document occurrence - and the statistical significance of their variation - directly refer to terms (i.e. stemmed words), the auxiliary measures only refer to terms indirectly; they directly refer to other entities. Each of the following sections will identify the entities and define the measure in relation to that entity before explaining how the measure is calculated for a term. The output from calculations/programs makes use of auxiliary measures associated with different kinds of entity hence the actual calculation used to create a given value in the results will differ according to the entity.

Novelty and Standardised Novelty

Novelty is a property of a document, d, and is defined in relation to a set of documents, D.

Novelty, N, of document d is defined as: N_d=1-max(S_d), where max(S_d) is the maximum value of the cosine similarity measure, S, between the document in question, d, and all others in the set D.

Similarity, S, is calculated using vectors containing the binary occurrence of stemmed terms. i.e. documents will have a similarity of 1 if they contain an identical set of terms (ignoring differences in how many times each term appears) and a similarity of 0 if they share no common terms. Stop words have been removed and so have terms found in more than 50% of documents.

This measure works poorly in the case when documents are short but the corpus contains many distinct terms. For a typical novelty calculation against a corpus of 2500 conference abstracts, around 6000 terms are in use but abstracts typically contain only a few hundred words. Hence S is <<1 and seemingly-large values of N are found even for ordinary documents; the distribution of N for a set of test documents is skewed strongly towards 1 for the example given. This effect is sometimes known as the "curse of dimensionality".

Since "novelty" is a relative concept and in order to partially compensate for the effects of the "curse of dimensionality" - although reliability for the highly skewed case remains inevitably questionable - a "Standardised Novelty", Std(N), is calculated. Std(N_d)=(N_d-median(N_D))/(1-median(N_D)), where median(N_D) is the median of N for all documents in D. Std(N) is 0 for documents with a value of N at the median and is 1 for a document containing no terms in common with any other in the set (i.e. S_d=0).

This measure is only ever used in relation to the document entity since term novelty is clearly best determined by other means.

Sentiment

At present, positive and negative sentiment and a compound measure called "subjectivity" are calculated.

A simple approach to calculating a sentiment score is used. The Harvard General Inquirer lexicon is used; this marks each of a long list of words as being associated with a range of sentiments or as being associated with a concept such as "politics". The General Inquirer lexicon distinguishes between cases where the same word is used with different meanings but, since my approach is to treat document content as a "bag of words", this information is not available; the most frequent usage as indicated by the General Inquirer is adopted.

The score calculated for a given sentiment (positive or negative) is the fraction of the words used in a document that are listed under the relevant category in the Harvard General Inquirer lexicon. Stop words are removed prior to scoring sentiment but stemming is not employed since the lexicon contains full words.

Subjectivity is simply the sum of the positive and negative sentiment scores. A high Subjectivity is assumed to indicate the author's active interest rather than passive observation; active interest may correlate with weak signals.

Sentiment is indirectly associated with terms according to the sentiment score of documents that contain the term in question; the mean score among these documents is calculated. The same approach is used to calculate the subjectivity, which is not simply the sum of the positive and negative term-sentiment scores.

Author Centrality

Betweenness centrality measures for conference paper authors were calculated by DBIS at RWTH Aachen University as described in the TELMap project deliverable D4.3. These calculations consider the co-authorship network, which is taken to be a surrogate for the influence of the ideas held by a given author. A (weak) signal observed in a document with high author betweenness centrality is assumed to be more likely to be a sign of change because of their influence.

At present, the use of author betweenness centrality is limited to conference papers. Since many documents (papers) have more than one author and since any terms of interest are distributed among documents, there are two stages of indirection to be considered; betweenness centrality measures associated with both document and term entities are indirect.

The author betweenness centrality, B_a, is calculated from the co-authorship network. NB: that the network is dynamic and the values of B_a are dependent on the corpus of work studied (see D4.3 above).

Document betweenness-influence, B_d = max(B_a). This is probably an underestimate but avoids the complexity necessary to merge the networks of each co-author (a simple sum would double-count shared network).

Rather than compute a measure of term betweenness-influence, the values of B_d for documents containing the term in question are enumerated.

, the relationship between author centrality and a measure of the influence of the ideas expressed in the document is calculated as the mean of the betweenness

Text Mining Weak Signals wiki Home Page