Topic models discover 'topics' that occur in a collection of text:
Statistics may be dull, but it has its moments.
67% topic A, 33% topic B.
Please laugh.
50% topic B, 50% topic C.
Laughing is good.
100% topic C.
Towards topic interpretation
Each topic owns a different probability mass function over the same set of words (i.e. vocabulary).
Problem: Topics are not easily interpretable and vocabulary size is often very large. Where should we put our focus?
Typically, one produces a ranked list of words deemed important for understanding a given topic; but how should we measure importance?
Measure 1: \(p(w_i|z_j)\) – probability of word \(w_i\) given each topic \(z_j\).
Drawback: common words tend to appear near the top of such lists for multiple topics, making it hard to differentiate topics.
Measure 2: \(\text{lift} = \frac{p(w_i|z_j)}{p(w_i)}\) where \(p(w_i)\) is overall probability of word \(w_i\).
Drawback: Rare words tend to receive too high of a ranking.
We believe that a compromise between these two measures can aid topic interpretation: \[
\text{relevance} = \lambda * p(w_i|z_j) + (1 - \lambda) * \text{lift}
\]
User study
A few remarks
We anticipate this 'optimal' value of \(\lambda\) will vary for different datasets.
For this reason, it is nice to have an interactive tool that quickly iterates through word rankings (based on different values of \(\lambda\)).
The R package LDAvis makes it easy to create an interactive visualizations to aid topic interpretation.