Glue is a project I've been working on to interactively visualize multidimensional datasets in Python. The goal of Glue is to make trivially easy to identify features and trends in data, to inform followup analysis.
This notebook shows an example of using Glue to explore crime statistics collected by the FBI (see this notebook for the scraping code). Because Glue is an interactive tool, I've included a screencast showing the analysis in action. All of the plots in this notebook were made with Glue, and then exported to plotly (see the bottom of this page for details).
In [1]:
from plotly.tools import embed
from IPython.display import VimeoVideo, HTML
from glue import qglue
import pandas as pd
Glue is an application that sits on top of matplotlib, and lets you interactively build standard statistical graphics like scatter plots, histograms, and images. However, all of these plots are "brushable" -- you can select a region on any plot, and that region is used to define a data filter. These filters are automatically displayed across all plots, making it easy to isolate subtle features and put them in context of the rest of the dataset.
Getting dataframes into Glue is pretty easy:
In [2]:
states = pd.read_csv('state_crime.csv')
qglue(states=states)
Out[2]:
This cell will load this dataframe into Glue and bring up the user interface. Here's a screencast showing what the subsequent exploration might look like:
In [3]:
VimeoVideo('97436621', width=700)
Out[3]:
Here's one of the simplest views of the dataset you can make: the murder rate (all rates in the dataset are annual rates per 100,000 people) as a function of time, for all states.
In [4]:
embed('ChrisBeaumont', 36)
There is an obvious set of outlier points with high murder rates -- what's going on there? Glue is really great at isolating outliers, and putting them in context. For example, we can select these points to highlight them, and look at another slice of the data -- Murder rate vs state.
In [5]:
embed('ChrisBeaumont', 37)
All of these points belong to a single "state" -- Washington, D.C.. Now, D.C. is an outlier for one obvious reason -- it's a single urban area, and thus should really be compared to other cities. Still, this murder rate is remarkably high. Furthermore, it has an interesting time dependence -- the 90s were a terrible decade for crime in D.C., when it earned the nickname of the "Murder Capital of the United States."
It turns out there is an entire Wikipedia Page about crime rates in D.C.. The high murder rates were driven by the spread of crack cocaine, combined with an affluent exodus out of the city and into the suburbs. Since the 90s, Gentrification and economic projects have pushed murder rates back down.
Glue's basic workflow of brusing and inter-comparing several plots is surprisingly versatile. For example, one way to tease out the crime trends of each state over time is to color the first and last year of data. Here are some plots that do that, to examine the rate of rape in each state.
In [6]:
embed('ChrisBeaumont', 38)
The trends for murder and rape are quite a bit different -- notice that, while murder rates have slowly declined over the past 50 years, rates of sexual assault have increased.
Interpeting sexual assault statistics, it turns out, is tricky business. Because or rape's social stigma, it is one of the most underreported crimes. Furthermore, that stigma has decreased somewhat over time as society has become more femenist. Thus, the increase in these plots might be driven more by higher rates of reporting rape as opposed to higher rates of rape itself. The National Crime Victimization Survey (which is based on surveys rather than reported crimes) reports that the victimization rate from rape has actually decreased by 85% since the 80s.
There is a large state-to-state variation in sexual assault rates in the dataset -- South Dakota, for example, shifted from having one of the lowest rates in 1960 to one of the highest rates today. I'm not sure what drove that trend (but it's very troubling).
In [7]:
embed('ChrisBeaumont', 39)
In [8]:
embed('ChrisBeaumont', 40)
Glue is designed to quickly build up intuition about multidimensional data, so that you can spend more time following up on interesting questions.
Glue makes it very easy to identify and isolate subtle and/or irregular features in datasets, by selecting and coloring subregions of plots. However, these features are just clues about the underlying story the data are telling. More precise followup analysis is always needed to quantify trends and assemble scientific hypotheses about the data.
Glue is not designed to perform this followup analysis. In fact, my opinion is that graphical interfaces are often the wrong approach here -- programming languages offer more precision for expressing specific computations, and are better suited to this task. For example, with Pandas we can obtain a precise measurement of the change in, say, the murder rate for each state over the past 50 years:
In [9]:
murder_change = (states.sort('Year')
.groupby('State').Murder
.agg({'first':'first', 'last':'last'}))
murder_change['change'] = murder_change['last'] - murder_change['first']
murder_change = murder_change.sort('change', ascending=False)
print 'Largest Increases in Murder Rate (change per 100,000)'
murder_change.head(10)
Out[9]:
Glue isn't a replacement for writing code -- it's a tool that quickly gives you clues about what questions are interesting, and worth writing code for.
All the plots in this document were created with Glue, and then exported to plotly. Uploading plots from Glue to plotly is a 2-click affair:
File->Export->plotly
This will open a new browser window with your uploaded graph, which you can further tweak or share with the world. If you want to upload plots to your own plot.ly account, you can fill in your username and API key under File->Edit Settings
In [11]:
#This makes everything pretty
def css_styling():
styles = open('custom.css', 'r').read()
return HTML(styles)
css_styling()
Out[11]: