Interactive & Dynamic Statistical Web Graphics

Carson Sievert

December 17th, 2015

Bio

  • B.A. in Math & Economics, Saint John’s University (06-10)
  • Data Analyst, College Enrollment & Financial Aid Consulting Firm (10-11)
  • M.S. in Statistics, Iowa State University (11-13)
    • Thesis: Tools for Collecting and Analyzing MLB PITCHf/x Data (pitchRx).
  • Research Intern, Statistics Research Department at AT&T Labs (Summer '13)
    • Worked with Dr. Kenny Shirley on LDAvis and LDAtools.
  • Took PhD courses and passed written qualifying exam (13-14)
  • Student, Google Summer of Code (Summer '14)
    • Began work on animint
  • Teaching Assistant, Iowa State University (11-15)
  • Mentor, Google Summer of Code (Summer '15)
  • Software Developer, plotly (Summer '15 - Present)
  • Research Assistant, Monash University (Sept '15 - Present)

Proposal Overview

  • The importance of interface design
  • Interfaces for working with web content
  • Interfaces for acquiring data on the web
  • Dynamic interactive statistical web graphics
    • Why interactive?
    • Indirect versus direct manipulation
    • Linked views and pipelines
    • Web graphics
    • Translating R graphics to the web
    • R interfaces for interactive web graphics

Motivation

  • Why interactive & dynamic graphics? They help us:
    • Find high-dimensional, abstract relationships in data that may otherwise go unnoticed
    • Diagnose models by plotting them in the data space (Wickham, Cook, & Hofmann 2015)
    • Explore and understand complicated statistical model fit(s)
    • Communicate/share our work with others in a compelling way
  • Why web-based?
    • simple to share, portable (web browser)
    • encourages composability
    • guide your audience by providing links to interesting selections/states

An Example: LDA Topic models

  • Topic models are a collection of statistical models with the common goal of finding hidden structure in a collection of text documents.
  • Basic example: Given a document discussing 'sports', you're more likely to see the word 'baseball' in that document compared to a document discussing 'music' (and vice-versa for 'guitar').
  • Documents usually don't have a clear "topic", but we can develop models with latent RV to "discover topics".
  • Latent Dirichlet Allocation (LDA) is a topic model which allows documents to be mixtures of topics (Blei, Ng, Jordan; 2003).

The Generative Model

  1. Choose # of topics K. Let V be # of unique words (vocabulary).
  2. For each document d∈{1,...,D}, draw Θd∼Dir(α) where Θd=(θ1,...,θK)>0 and ∑Ki=1θi=1
  3. For each topic k∈{1,...,K}, draw Φk∼Dir(β) where Φk=(ϕ1,...,ϕV)>0 and ∑Ki=1ϕi=1
  4. Let Nd be # of words in doc d and n∈{1,…,Nd}. For each word wd,n:
    • Draw a (latent) topic, zd,n∼Mult(1,Θd)
    • Draw a word given topic, wd,n∼Mult(1,Φzd,n)

Model fitting

  • Griffiths & Steyvers (2004) derive a collapsed Gibbs sampler. Implemented in R packages LDAtools (Shirley & Sievert, 2013) and lda (Chang, 2015).
  • Wide array of fitting algorithms available in topicmodels (Grun & Hornik, 2011) and mallet (Mimno, 2013).

Model Output

  • In the digital humanities (& elsewhere), LDA is often used to "discover topics" in a large collection of text documents.
  • How are researchers supposed to interpret topics? We can't possibly examine each pmf.
  • "Overview first, then zoom & filter, then detail on demand" (Schneiderman, 1996)

Towards topic interpretation

  • Numerous interactive systems allow users to select a topic z, then list top ~30 words based on p(w|z) (Gardner et al., 2010; Chaney and Blei, 2012; Snyder et al., 2013).
  • But, words likely to occur overall are also likely to occur for a given topic!
  • Taddy (2011) proposed to rank terms by lift=p(w|z)/p(w)
  • But if p(w) is small, lift is large!
  • Bischof and Airoldi (2012) propose a new model to directly estimate an average frequency and exclusivity to a given topic.
  • Sievert & Shirley (2014) propose choosing 0<λ<1, for: relevance(λ)=λ∗p(w|z)+(1−λ)∗lift

Choosing lambda

  • We expect the optimal value of λ to vary across data.
  • LDAvis allows users to interactively alter rankings by changing λ

Who is using it?

  • People who use LDA and want a tool for interpreting topics.
  • Combined LDAvis and pyLDAvis currently have 356 stars on GitHub (a measure of popularity).
  • I know a number of consultants, industry workers, and educators using it for exploration, presentation, and teaching. Here are a few videos.
  • Dr. Grant Arndt in the Department of Anthropology at Iowa State University (and his research assistant) are using it as a research aid.

Why is this important?

  • We're helping analysts to gain insight from sophisticated statistical models, communicate their results, and teach others.

Tools for interactive graphics

  • Producing interactive and dynamic web graphics from "scratch" is time-consuming, but very powerful, and flexible.
  • How do we best enable statisticians to create their own interactive dynamic web graphics?
  • I've worked on two R packages in this direction: animint and plotly.
  • Both can translate ggplot2 (Wickham, 2009) graphics to a web-based format (SVG/canvas) with some basic interactivity.
  • ggplot2 is wildly successful thanks to its implementation of a "Grammar of Graphics" (Wilkinson, 1999).
  • animint extends this grammar in a novel direction to enable a constrained form of "linked views" (Hocking et. al., 2015).

Basic ggplot2 example

library(ggplot2)
p <- qplot(data = iris, x = Sepal.Width, y = Sepal.Length, color = Species)
p

Interactive ggplot2 with plotly

library(plotly)
ggplotly(p)

22.533.544.54.555.566.577.58
Sepal.WidthSepal.LengthsetosaversicolorvirginicaSpecies

Interactive ggplot2 with animint

library(animint)
structure(list(plot = p), class = "animint")

2.02.53.03.54.04.55678Sepal.LengthSepal.Width
Species

geomselected chunkdownloadedtotalstatus
geom1_point_plotgeom1_point_plot_chunk1.tsv11displayed
milliseconds
Species
<