Interactive & Dynamic Statistical Web Graphics

Carson Sievert

December 17th, 2015

Bio

  • B.A. in Math & Economics, Saint John’s University (06-10)
  • Data Analyst, College Enrollment & Financial Aid Consulting Firm (10-11)
  • M.S. in Statistics, Iowa State University (11-13)
    • Thesis: Tools for Collecting and Analyzing MLB PITCHf/x Data (pitchRx).
  • Research Intern, Statistics Research Department at AT&T Labs (Summer '13)
    • Worked with Dr. Kenny Shirley on LDAvis and LDAtools.
  • Took PhD courses and passed written qualifying exam (13-14)
  • Student, Google Summer of Code (Summer '14)
    • Began work on animint
  • Teaching Assistant, Iowa State University (11-15)
  • Mentor, Google Summer of Code (Summer '15)
  • Software Developer, plotly (Summer '15 - Present)
  • Research Assistant, Monash University (Sept '15 - Present)

Proposal Overview

  • The importance of interface design
  • Interfaces for working with web content
  • Interfaces for acquiring data on the web
  • Dynamic interactive statistical web graphics
    • Why interactive?
    • Indirect versus direct manipulation
    • Linked views and pipelines
    • Web graphics
    • Translating R graphics to the web
    • R interfaces for interactive web graphics

Motivation

  • Why interactive & dynamic graphics? They help us:
    • Find high-dimensional, abstract relationships in data that may otherwise go unnoticed
    • Diagnose models by plotting them in the data space (Wickham, Cook, & Hofmann 2015)
    • Explore and understand complicated statistical model fit(s)
    • Communicate/share our work with others in a compelling way
  • Why web-based?
    • simple to share, portable (web browser)
    • encourages composability
    • guide your audience by providing links to interesting selections/states

An Example: LDA Topic models

  • Topic models are a collection of statistical models with the common goal of finding hidden structure in a collection of text documents.
  • Basic example: Given a document discussing 'sports', you're more likely to see the word 'baseball' in that document compared to a document discussing 'music' (and vice-versa for 'guitar').
  • Documents usually don't have a clear "topic", but we can develop models with latent RV to "discover topics".
  • Latent Dirichlet Allocation (LDA) is a topic model which allows documents to be mixtures of topics (Blei, Ng, Jordan; 2003).

The Generative Model

  1. Choose # of topics K. Let V be # of unique words (vocabulary).
  2. For each document d∈{1,...,D}, draw Θd∼Dir(α) where Θd=(θ1,...,θK)>0 and ∑Ki=1θi=1
  3. For each topic k∈{1,...,K}, draw Φk∼Dir(β) where Φk=(ϕ1,...,ϕV)>0 and ∑Ki=1ϕi=1
  4. Let Nd be # of words in doc d and n∈{1,…,Nd}. For each word wd,n:
    • Draw a (latent) topic, zd,n∼Mult(1,Θd)
    • Draw a word given topic, wd,n∼Mult(1,Φzd,n)

Model fitting

  • Griffiths & Steyvers (2004) derive a collapsed Gibbs sampler. Implemented in R packages LDAtools (Shirley & Sievert, 2013) and lda (Chang, 2015).
  • Wide array of fitting algorithms available in topicmodels (Grun & Hornik, 2011) and mallet (Mimno, 2013).

Model Output

  • In the digital humanities (& elsewhere), LDA is often used to "discover topics" in a large collection of text documents.
  • How are researchers supposed to interpret topics? We can't possibly examine each pmf.
  • "Overview first, then zoom & filter, then detail on demand" (Schneiderman, 1996)

Towards topic interpretation

  • Numerous interactive systems allow users to select a topic z, then list top ~30 words based on p(w|z) (Gardner et al., 2010; Chaney and Blei, 2012; Snyder et al., 2013).
  • But, words likely to occur overall are also likely to occur for a given topic!
  • Taddy (2011) proposed to rank terms by lift=p(w|z)/p(w)
  • But if p(w) is small, lift is large!
  • Bischof and Airoldi (2012) propose a new model to directly estimate an average frequency and exclusivity to a given topic.
  • Sievert & Shirley (2014) propose choosing 0<λ<1, for: relevance(λ)=λ∗p(w|z)+(1−λ)∗lift

Choosing lambda

  • We expect the optimal value of λ to vary across data.
  • LDAvis allows users to interactively alter rankings by changing λ

Who is using it?

  • People who use LDA and want a tool for interpreting topics.
  • Combined LDAvis and pyLDAvis currently have 356 stars on GitHub (a measure of popularity).
  • I know a number of consultants, industry workers, and educators using it for exploration, presentation, and teaching. Here are a few videos.
  • Dr. Grant Arndt in the Department of Anthropology at Iowa State University (and his research assistant) are using it as a research aid.

Why is this important?

  • We're helping analysts to gain insight from sophisticated statistical models, communicate their results, and teach others.

Tools for interactive graphics

  • Producing interactive and dynamic web graphics from "scratch" is time-consuming, but very powerful, and flexible.
  • How do we best enable statisticians to create their own interactive dynamic web graphics?
  • I've worked on two R packages in this direction: animint and plotly.
  • Both can translate ggplot2 (Wickham, 2009) graphics to a web-based format (SVG/canvas) with some basic interactivity.
  • ggplot2 is wildly successful thanks to its implementation of a "Grammar of Graphics" (Wilkinson, 1999).
  • animint extends this grammar in a novel direction to enable a constrained form of "linked views" (Hocking et. al., 2015).

Basic ggplot2 example

library(ggplot2)
p <- qplot(data = iris, x = Sepal.Width, y = Sepal.Length, color = Species)
p

Interactive ggplot2 with plotly

library(plotly)
ggplotly(p)

22.533.544.54.555.566.577.58
Sepal.WidthSepal.LengthsetosaversicolorvirginicaSpecies

Interactive ggplot2 with animint

library(animint)
structure(list(plot = p), class = "animint")

2.02.53.03.54.04.55678Sepal.LengthSepal.Width
Species

geomselected chunkdownloadedtotalstatus
geom1_point_plotgeom1_point_plot_chunk1.tsv11displayed
milliseconds
Species

Translating ggplot2: A high-level view

Translating R graphics to the web

  • Pros:
    • Easy to use – extrapolates on existing knowledge/code
    • Doesn't require a Web Server running special software
  • Cons:
    • Translation may depend on internals of other packages
    • To change something that's serialized, you need to re-run R code
    • Hard to extend, customize, and/or add (interactive) features
  • Although pragmatic, we need a custom interface/language designed for that purpose.

R Bindings to JavaScript Libraries

plotly's R interface

library(plotly)
plot_ly(economics, x = date, y = unemploy / pop)

197019751980198519901995200020050.0150.020.0250.030.0350.040.0450.05
dateunemploy/pop

Visual mappings as data attributes

p <- plot_ly(economics, x = date, y = unemploy / pop)
str(p) 
#> Classes ‘plotly’ and 'data.frame':   478 obs. of  6 variables:
#>  $ date    : Date, format: "1967-06-30" "1967-07-31" ...
#>  $ pce     : num  508 511 517 513 518 ...
#>  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
#>  $ psavert : num  9.8 9.8 9 9.8 9.7 9.4 9 9.5 8.9 9.6 ...
#>  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
#>  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
#>  - attr(*, "plotly_hash")= chr "f638d391dcf53809b8426325a842a091#8"

Rendering the data

str(plotly_build(p))
#> List of 2
#>  $ data  :List of 1
#>   ..$ :List of 5
#>   .. ..$ type   : chr "scatter"
#>   .. ..$ x      : Date[1:478], format: "1967-06-30" ...
#>   .. ..$ y      : num [1:478] 0.0148 0.0148 0.0149 0.0158 0.0154 ...
#>  $ layout:List of 2
#>   ..$ xaxis:List of 1
#>   .. ..$ title: chr "date"
#>   ..$ yaxis:List of 1
#>   .. ..$ title: chr "unemploy/pop"

Why represent plots as data?

  • Thinking about graphics as a mapping from data to visual space is powerful (e.g. ggplot2's grammar of graphics).
  • Allows us to implement a "data-plot-pipeline": a sequence of data manipulations and mappings.

Data-plot-pipeline: An Example

%>% is known as a "pipeline operator"

# f(x, y) becomes x %>% f(y)
economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate)

197019751980198519901995200020050.020.030.040.05
daterate

Layering model fits

economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate, name = "raw") %>%
  loess(unemploy / pop ~ as.numeric(date), data = .) %>%
  broom::augment() %>%
  add_trace(y = .fitted, name = "smooth")

197019751980198519901995200020050.010.020.030.040.05
dateraterawsmooth

Adding annotations

economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate, name = "raw") %>%
  subset(rate == max(rate)) %>%
  layout(annotations = list(x = date, y = rate, text = "Peak", showarrow = T),
         title = "The U.S. Unemployment Rate")

197019751980198519901995200020050.020.030.040.05
The U.S. Unemployment RatedateratePeak

Enabling coordinated, linked views

Linked views pipeline

  • Coordinated, linked views is an important quality of any interactive statistical graphics system (e.g., cranvas, ggobi, iplots, mondrian, MANET, etc).
  • In order to have linked views, we need a "data pipeline" (Buja et.al, 1988); (Wickham et. al., 2010).

Linked views pipeline

An example: animint

Client-server model

Making user-events accessible in R

An Example: linked brushing

An Example: touring w/ linked brush

An Example: Passing plotly events to shiny

Timeline

  • December: Revise and resubmit book chapter on MLB Pitching Expertise and Evaluation for the Handbook of Statistical Methods for Design and Analysis in Sports, a volume that is planned to be one of the Chapman & Hall/CRC Handbooks of Modern Statistical Methods.
  • January: Revise and submit animint paper.
  • Feburary: More support for linked views in plotly.
  • April: Write and submit curating data paper.
  • June: Write and submit interactive web graphics paper.
  • August: Thesis defense.

Thanks to my collaborators

  • LDAvis (Kenny Shirley)
  • animint (Toby Dylan Hocking, Susan VanderPlas, Kevin Ferris, and Tony Tsai)
  • plotly (Toby Dylan Hocking and the Plotly Team)

Thanks to Di for helping with the proposal

Thanks to Heike for her direction, guidance, and support