Presenting Data

Your job as a data scientist is to communicate. Often you are communicating insight from distilling lots of data (models, stats, EDA, narative/editorialization) // making data accessible/self-service (run queries, dashboards that auto update) // communicating performance/diagnostics of complex models/processes.

  • Why Visualize
    • Point Stats vs distributions
      • much more information in visual (or other senses) channel
    • Making Data Interpretable/Accessible (don't need/want to understand theory necessarily)
      • Dashboards (model/SQL API/interface)
        • streaming data
        • linked charts
      • Reproducible analysis/fact checking
      • Knowldege sharing
    • Narrative/editorialize
    • General Process
    • EDA vs Explanatory
  • What to communicate
    • Results of analysis
    • Insights found through EDA
    • Models performance/diagnostics (residuals, ROC plots, etc.)
      • builds trust in model
      • communicates uncertainty
  • How to communicate
    • Visual encodings
    • Chart Types
    • Maps
    • Data Narratives
      • Adding Context/Editorialization

In [1]:
from bokeh.plotting import figure, show, output_notebook
output_notebook()


Loading BokehJS ...

In [ ]:
# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# output to static HTML file
# output_file("lines.html", title="line plot example")

# create a new plot with a title and axis labels
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')

# add a line renderer with legend and line thickness
p.line(x, y, legend="Temp.", line_width=15)

# show the results
show(p)

Grammar of Graphics

“the fundamental principles or rules of an art or science” (OED Online 1989). A good grammar will allow us to gain insight into the composition of complicated graphics, and reveal unexpected connections between seemingly different graphics (Cox 1978). A grammar provides a strong foundation for understanding a diverse range of graphics. A grammar may also help guide us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics.

-- Wickham (A Layered Grammar of Graphics)

Components

Visual index (Courtesy of yHat)

  • Layers
    • data -> aesthetic mapping
    • statistical transforms
    • geometric objects (geom)
    • position
    • different "view" of the same data
  • Scales
  • Coordinate system
  • Faceting

can change each in relative isolation

(source)

Histogram: binning stat + bar geom

ggplot(data = diamonds, mapping = aes(price)) +
    layer(geom = "bar", stat = "bin",
    mapping = aes(y = ..count..))

In [5]:
from bokeh.charts import Histogram
from bokeh.sampledata.autompg import autompg as df

df.sort('cyl', inplace=True)

hist = Histogram(df, values='hp', title="HP Distribution", legend='top_right')

show(hist)


Out[5]:

<Bokeh Notebook handle for In[5]>


In [6]:
import numpy as np
from bokeh.models import HoverTool, BoxSelectTool

TOOLS = [BoxSelectTool(), HoverTool()]

# create our canvas
p1 = figure(title="HP Distribution", background_fill_color="#E8DDCB", tools=TOOLS)

# stat
hist, edges = np.histogram(df.hp, density=True, bins=50)

# geom
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="#036564", line_color="#033649")

show(p1)


Out[6]:

<Bokeh Notebook handle for In[6]>


In [7]:
df.sort('cyl', inplace=True)

hist = Histogram(df, values='hp', color='cyl',
                 title="HP Distribution by Cylinder Count", legend='top_right')

show(hist)


Out[7]:

<Bokeh Notebook handle for In[7]>

Let's Use some real (interesting) data!


In [3]:
from bokeh.models import GeoJSONDataSource
from bokeh.plotting import figure
from bokeh.sampledata.sample_geojson import geojson


geo_source = GeoJSONDataSource(geojson=geojson)

p = figure()
p.circle(x='x', y='y', alpha=0.9, source=geo_source)
show(p)


Out[3]:

<Bokeh Notebook handle for In[3]>


In [5]:
import pandas as pd
# more time/compute intensive to parse dates. but we know we definitely have/need them
df = pd.read_csv('data/sf_listings.csv', parse_dates=['last_review'], infer_datetime_format=True)
df_reviews = pd.read_csv('data/reviews.csv', parse_dates=['date'], infer_datetime_format=True)

In [6]:
# index DataFrame on listing_id in order to join datasets
reindexed_df = df_reviews.set_index('listing_id')
reindexed_df.head()


Out[6]:
date
listing_id
1994427 2014-02-27
1994427 2015-10-07
1994427 2015-10-12
1994427 2015-10-17
1994427 2015-10-26

In [7]:
# remember the original id in a column to group on
df['listing_id'] = df['id']
df_listing = df.set_index('id')
df_listing.head()


Out[7]:
name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 listing_id
id
1162609 Lovely One Bedroom Apartment 6368122 Taylor NaN Seacliff 37.785217 -122.488655 Entire home/apt 350 4 8 2015-09-17 0.28 1 90 1162609
6032828 Historic Seacliff Home 30384615 Patricia NaN Seacliff 37.783658 -122.489398 Entire home/apt 300 1 0 NaT NaN 1 173 6032828
6938818 Best Secret in Town 36381578 Harris NaN Seacliff 37.781505 -122.504754 Private room 119 1 10 2015-10-08 2.33 1 341 6938818
8087607 Single Room Beautiful Beach Condo 3264449 Keith NaN Seacliff 37.775318 -122.511621 Private room 79 1 0 NaT NaN 1 40 8087607
4781448 3 Bd 2.5 Ba Full Flat Condo w Views 13112558 Pam NaN Seacliff 37.781797 -122.492492 Entire home/apt 695 2 1 2015-08-11 0.36 1 23 4781448

In [8]:
# join the listing information with the review information
review_timeseries = df_listing.join(reindexed_df)

print review_timeseries.columns
review_timeseries.head()


Index([u'name', u'host_id', u'host_name', u'neighbourhood_group',
       u'neighbourhood', u'latitude', u'longitude', u'room_type', u'price',
       u'minimum_nights', u'number_of_reviews', u'last_review',
       u'reviews_per_month', u'calculated_host_listings_count',
       u'availability_365', u'listing_id', u'date'],
      dtype='object')
Out[8]:
name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 listing_id date
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-07-23
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-08-03
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-09-27
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2009-11-05
958 Bright, Modern Garden Unit - 1BR/1B 1169 Holly NaN Western Addition 37.76931 -122.433856 Entire home/apt 170 2 38 2015-08-31 0.5 1 314 958 2010-02-13

In [9]:
# lets try a pivot table...
reviews_over_time = pd.crosstab(review_timeseries.date, review_timeseries.neighbourhood)
reviews_over_time.head()


Out[9]:
neighbourhood Bayview Bernal Heights Castro/Upper Market Chinatown Crocker Amazon Diamond Heights Downtown/Civic Center Excelsior Financial District Glen Park ... Presidio Presidio Heights Russian Hill Seacliff South of Market Treasure Island/YBI Twin Peaks Visitacion Valley West of Twin Peaks Western Addition
date
2009-03-29 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-05-03 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-05-23 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-06-12 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2009-07-15 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 37 columns


In [10]:
# smooth by resampling by month
reviews_over_time.resample('M').mean()[['Mission', 'South of Market', 'Noe Valley']].plot(figsize=(12,6))


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x11379d750>

In [20]:
TOOLS = "pan,wheel_zoom,box_zoom,reset,save,hover"
d = reviews_over_time.resample('M').mean()

p = figure(x_axis_type="datetime", tools=TOOLS)

p.line(d.index, d['Mission'])
show(p)


Out[20]:

<Bokeh Notebook handle for In[20]>


In [23]:
import bokeh.charts as charts

line = charts.Line(d, y=['Mission', 'South of Market', 'Noe Valley'],
            color=['Mission', 'South of Market', 'Noe Valley'],
            title="Interpreter Sample Data", ylabel='Duration', legend=True, tools=TOOLS)

show(line)


Out[23]:

<Bokeh Notebook handle for In[23]>


In [27]:
from bokeh.models.widgets import Select
from bokeh.io import output_file, show, vform

select = Select(title="Option:", value="foo", options=list(reviews_over_time))

show(vform(select))


Out[27]:

<Bokeh Notebook handle for In[27]>

To the Server!

neighborhood_line.py