Presenting Data

Your job as a data scientist is to communicate. Often you are communicating insight from distilling lots of data (models, stats, EDA, narative/editorialization) // making data accessible/self-service (run queries, dashboards that auto update) // communicating performance/diagnostics of complex models/processes.

Why Visualize
- Point Stats vs distributions
  - much more information in visual (or other senses) channel
- Making Data Interpretable/Accessible (don't need/want to understand theory necessarily)
  - Dashboards (model/SQL API/interface)
    - streaming data
    - linked charts
  - Reproducible analysis/fact checking
  - Knowldege sharing
- Narrative/editorialize
- General Process
- EDA vs Explanatory
What to communicate
- Results of analysis
- Insights found through EDA
- Models performance/diagnostics (residuals, ROC plots, etc.)
  - builds trust in model
  - communicates uncertainty
How to communicate
- Visual encodings
- Chart Types
- Maps
- Data Narratives
  - Adding Context/Editorialization



In [1]:

    
from bokeh.plotting import figure, show, output_notebook
output_notebook()









    





    
        
        Loading BokehJS ...

Sanity Check

Does it work? http://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html#userguide-quickstart



In [ ]:

    
# prepare some data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]

# output to static HTML file
# output_file("lines.html", title="line plot example")

# create a new plot with a title and axis labels
p = figure(title="simple line example", x_axis_label='x', y_axis_label='y')

# add a line renderer with legend and line thickness
p.line(x, y, legend="Temp.", line_width=15)

# show the results
show(p)

Grammar of Graphics

“the fundamental principles or rules of an art or science” (OED Online 1989). A good grammar will allow us to gain insight into the composition of complicated graphics, and reveal unexpected connections between seemingly different graphics (Cox 1978). A grammar provides a strong foundation for understanding a diverse range of graphics. A grammar may also help guide us on what a well-formed or correct graphic looks like, but there will still be many grammatically correct but nonsensical graphics.

-- Wickham (A Layered Grammar of Graphics)

Components

Visual index (Courtesy of yHat)

Layers
- data -> aesthetic mapping
- statistical transforms
- geometric objects (geom)
- position
- different "view" of the same data
Scales
Coordinate system
Faceting

can change each in relative isolation

(source)

Histogram: binning stat + bar geom

ggplot(data = diamonds, mapping = aes(price)) +
    layer(geom = "bar", stat = "bin",
    mapping = aes(y = ..count..))

In Python



In [5]:

    
from bokeh.charts import Histogram
from bokeh.sampledata.autompg import autompg as df

df.sort('cyl', inplace=True)

hist = Histogram(df, values='hp', title="HP Distribution", legend='top_right')

show(hist)









    






    







    Out[5]:




<Bokeh Notebook handle for In[5]>



In [6]:

    
import numpy as np
from bokeh.models import HoverTool, BoxSelectTool

TOOLS = [BoxSelectTool(), HoverTool()]

# create our canvas
p1 = figure(title="HP Distribution", background_fill_color="#E8DDCB", tools=TOOLS)

# stat
hist, edges = np.histogram(df.hp, density=True, bins=50)

# geom
p1.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
        fill_color="#036564", line_color="#033649")

show(p1)









    






    







    Out[6]:




<Bokeh Notebook handle for In[6]>

Customizing Tooltips



In [7]:

    
df.sort('cyl', inplace=True)

hist = Histogram(df, values='hp', color='cyl',
                 title="HP Distribution by Cylinder Count", legend='top_right')

show(hist)









    






    







    Out[7]:




<Bokeh Notebook handle for In[7]>

Let's Use some real (interesting) data!



In [3]:

    
from bokeh.models import GeoJSONDataSource
from bokeh.plotting import figure
from bokeh.sampledata.sample_geojson import geojson


geo_source = GeoJSONDataSource(geojson=geojson)

p = figure()
p.circle(x='x', y='y', alpha=0.9, source=geo_source)
show(p)









    






    







    Out[3]:




<Bokeh Notebook handle for In[3]>



In [5]:

    
import pandas as pd
# more time/compute intensive to parse dates. but we know we definitely have/need them
df = pd.read_csv('data/sf_listings.csv', parse_dates=['last_review'], infer_datetime_format=True)
df_reviews = pd.read_csv('data/reviews.csv', parse_dates=['date'], infer_datetime_format=True)



In [6]:

    
# index DataFrame on listing_id in order to join datasets
reindexed_df = df_reviews.set_index('listing_id')
reindexed_df.head()



In [7]:

    
# remember the original id in a column to group on
df['listing_id'] = df['id']
df_listing = df.set_index('id')
df_listing.head()









    Out[7]:






  
    
      
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      listing_id
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1162609
      Lovely One Bedroom Apartment
      6368122
      Taylor
      NaN
      Seacliff
      37.785217
      -122.488655
      Entire home/apt
      350
      4
      8
      2015-09-17
      0.28
      1
      90
      1162609
    
    
      6032828
      Historic Seacliff Home
      30384615
      Patricia
      NaN
      Seacliff
      37.783658
      -122.489398
      Entire home/apt
      300
      1
      0
      NaT
      NaN
      1
      173
      6032828
    
    
      6938818
      Best Secret in Town
      36381578
      Harris
      NaN
      Seacliff
      37.781505
      -122.504754
      Private room
      119
      1
      10
      2015-10-08
      2.33
      1
      341
      6938818
    
    
      8087607
      Single Room Beautiful Beach Condo
      3264449
      Keith
      NaN
      Seacliff
      37.775318
      -122.511621
      Private room
      79
      1
      0
      NaT
      NaN
      1
      40
      8087607
    
    
      4781448
      3 Bd 2.5 Ba Full Flat Condo w Views
      13112558
      Pam
      NaN
      Seacliff
      37.781797
      -122.492492
      Entire home/apt
      695
      2
      1
      2015-08-11
      0.36
      1
      23
      4781448



In [8]:

    
# join the listing information with the review information
review_timeseries = df_listing.join(reindexed_df)

print review_timeseries.columns
review_timeseries.head()









    



Index([u'name', u'host_id', u'host_name', u'neighbourhood_group',
       u'neighbourhood', u'latitude', u'longitude', u'room_type', u'price',
       u'minimum_nights', u'number_of_reviews', u'last_review',
       u'reviews_per_month', u'calculated_host_listings_count',
       u'availability_365', u'listing_id', u'date'],
      dtype='object')






    Out[8]:






  
    
      
      name
      host_id
      host_name
      neighbourhood_group
      neighbourhood
      latitude
      longitude
      room_type
      price
      minimum_nights
      number_of_reviews
      last_review
      reviews_per_month
      calculated_host_listings_count
      availability_365
      listing_id
      date
    
  
  
    
      958
      Bright, Modern Garden Unit - 1BR/1B
      1169
      Holly
      NaN
      Western Addition
      37.76931
      -122.433856
      Entire home/apt
      170
      2
      38
      2015-08-31
      0.5
      1
      314
      958
      2009-07-23
    
    
      958
      Bright, Modern Garden Unit - 1BR/1B
      1169
      Holly
      NaN
      Western Addition
      37.76931
      -122.433856
      Entire home/apt
      170
      2
      38
      2015-08-31
      0.5
      1
      314
      958
      2009-08-03
    
    
      958
      Bright, Modern Garden Unit - 1BR/1B
      1169
      Holly
      NaN
      Western Addition
      37.76931
      -122.433856
      Entire home/apt
      170
      2
      38
      2015-08-31
      0.5
      1
      314
      958
      2009-09-27
    
    
      958
      Bright, Modern Garden Unit - 1BR/1B
      1169
      Holly
      NaN
      Western Addition
      37.76931
      -122.433856
      Entire home/apt
      170
      2
      38
      2015-08-31
      0.5
      1
      314
      958
      2009-11-05
    
    
      958
      Bright, Modern Garden Unit - 1BR/1B
      1169
      Holly
      NaN
      Western Addition
      37.76931
      -122.433856
      Entire home/apt
      170
      2
      38
      2015-08-31
      0.5
      1
      314
      958
      2010-02-13



In [9]:

    
# lets try a pivot table...
reviews_over_time = pd.crosstab(review_timeseries.date, review_timeseries.neighbourhood)
reviews_over_time.head()









    Out[9]:






  
    
      neighbourhood
      Bayview
      Bernal Heights
      Castro/Upper Market
      Chinatown
      Crocker Amazon
      Diamond Heights
      Downtown/Civic Center
      Excelsior
      Financial District
      Glen Park
      ...
      Presidio
      Presidio Heights
      Russian Hill
      Seacliff
      South of Market
      Treasure Island/YBI
      Twin Peaks
      Visitacion Valley
      West of Twin Peaks
      Western Addition
    
    
      date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2009-03-29
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-05-03
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-05-23
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-06-12
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2009-07-15
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 37 columns



In [10]:

    
# smooth by resampling by month
reviews_over_time.resample('M').mean()[['Mission', 'South of Market', 'Noe Valley']].plot(figsize=(12,6))









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x11379d750>



In [20]:

    
TOOLS = "pan,wheel_zoom,box_zoom,reset,save,hover"
d = reviews_over_time.resample('M').mean()

p = figure(x_axis_type="datetime", tools=TOOLS)

p.line(d.index, d['Mission'])
show(p)









    






    







    Out[20]:




<Bokeh Notebook handle for In[20]>



In [23]:

    
import bokeh.charts as charts

line = charts.Line(d, y=['Mission', 'South of Market', 'Noe Valley'],
            color=['Mission', 'South of Market', 'Noe Valley'],
            title="Interpreter Sample Data", ylabel='Duration', legend=True, tools=TOOLS)

show(line)









    






    







    Out[23]:




<Bokeh Notebook handle for In[23]>



In [27]:

    
from bokeh.models.widgets import Select
from bokeh.io import output_file, show, vform

select = Select(title="Option:", value="foo", options=list(reviews_over_time))

show(vform(select))









    






    







    Out[27]:




<Bokeh Notebook handle for In[27]>

To the Server!

neighborhood_line.py

	date
listing_id
1994427	2014-02-27
1994427	2015-10-07
1994427	2015-10-12
1994427	2015-10-17
1994427	2015-10-26

	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	listing_id
id
1162609	Lovely One Bedroom Apartment	6368122	Taylor	NaN	Seacliff	37.785217	-122.488655	Entire home/apt	350	4	8	2015-09-17	0.28	1	90	1162609
6032828	Historic Seacliff Home	30384615	Patricia	NaN	Seacliff	37.783658	-122.489398	Entire home/apt	300	1	0	NaT	NaN	1	173	6032828
6938818	Best Secret in Town	36381578	Harris	NaN	Seacliff	37.781505	-122.504754	Private room	119	1	10	2015-10-08	2.33	1	341	6938818
8087607	Single Room Beautiful Beach Condo	3264449	Keith	NaN	Seacliff	37.775318	-122.511621	Private room	79	1	0	NaT	NaN	1	40	8087607
4781448	3 Bd 2.5 Ba Full Flat Condo w Views	13112558	Pam	NaN	Seacliff	37.781797	-122.492492	Entire home/apt	695	2	1	2015-08-11	0.36	1	23	4781448

	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	listing_id	date
958	Bright, Modern Garden Unit - 1BR/1B	1169	Holly	NaN	Western Addition	37.76931	-122.433856	Entire home/apt	170	2	38	2015-08-31	0.5	1	314	958	2009-07-23
958	Bright, Modern Garden Unit - 1BR/1B	1169	Holly	NaN	Western Addition	37.76931	-122.433856	Entire home/apt	170	2	38	2015-08-31	0.5	1	314	958	2009-08-03
958	Bright, Modern Garden Unit - 1BR/1B	1169	Holly	NaN	Western Addition	37.76931	-122.433856	Entire home/apt	170	2	38	2015-08-31	0.5	1	314	958	2009-09-27
958	Bright, Modern Garden Unit - 1BR/1B	1169	Holly	NaN	Western Addition	37.76931	-122.433856	Entire home/apt	170	2	38	2015-08-31	0.5	1	314	958	2009-11-05
958	Bright, Modern Garden Unit - 1BR/1B	1169	Holly	NaN	Western Addition	37.76931	-122.433856	Entire home/apt	170	2	38	2015-08-31	0.5	1	314	958	2010-02-13

neighbourhood	Bayview	Bernal Heights	Castro/Upper Market	Chinatown	Crocker Amazon	Diamond Heights	Downtown/Civic Center	Excelsior	Financial District	Glen Park	...	Presidio	Presidio Heights	Russian Hill	Seacliff	South of Market	Treasure Island/YBI	Twin Peaks	Visitacion Valley	West of Twin Peaks	Western Addition
date
2009-03-29	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2009-05-03	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2009-05-23	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2009-06-12	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2009-07-15	0	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0