End-to-end tutorial

This is a thoroughly exhaustive walkthrough that explains several steps:

  1. Data Collection
  2. Data Wrangling
  3. Plotting
  4. Styling Plots

First off, let's gather our imports. We're going to use the knockout duo of numpy and pandas to aggregate our data, and our featured module bokeh to plot directly in the notebook.


In [177]:
# Conventional import naming
import numpy as np
import pandas as pd

# Import plotting objects into our namespace for easy access
from bokeh.plotting import *
#Set the target of the plots to the current notebook
output_notebook()

# Utilities we'll reference later in the notebook
from bokeh.objects import HoverTool, Range1d

# Miscellaneous
import datetime as dt
from collections import OrderedDict


Bokeh Plot

Configuring embedded BokehJS mode.

Data Collection

These data were obtained with a nifty tool called astronomer that pulls data on GitHub stargazers for any given repository. ("Stargazers" here refers to those GitHub users who have starred a particular repository.)

Example usage: astronomer --format csv --token PASTE_TOKEN_HERE --outfile bokeh-stats.csv ContinuumIO/bokeh

I first obtained a snapshot of Bokeh stargazers on April 9th, 2014; let's read it in with pandas and take a peek.


In [178]:
df = pd.read_csv('data/bokeh-stats_2014-4-9.csv', header=0)
df.head()


Out[178]:
id bio name blog email login company hireable location followers following created_at updated_at gravatar_id public_repos public_gists
0 1504875 NaN Eric NaN NaN etdub NaN False NaN 4 4 2012-03-05T23:23:19Z 2014-04-07T16:28:10Z 7b0fe706afbf759c8fbbfbc6f9ed59b0 12 0
1 1047 NaN Pradeep Gowda NaN pradeep@btbytes.com btbytes NaN True Indianapolis, IN 49 105 2008-02-27T08:49:03Z 2014-04-09T20:53:20Z 84c3eab99b7425d6b614ba6d11402d6b 92 160
2 2581 NaN Leonard http://randomfoo.net/ NaN lhl Lensley False Los Angeles, CA 80 60 2008-03-08T09:29:25Z 2014-04-09T08:23:51Z 719d80396d9c18a83e11f43de27a924c 39 3
3 4290 I'm a computer scientist! Carter Tazio Schonwald www.cstheory.net first name dot last name at the google mail thing cartazio wellposed.com True NYC, USA 87 23 2008-04-02T18:45:11Z 2014-04-10T01:10:22Z 274a52c00e4f37979f33423fc8acf371 44 153
4 4881 NaN Sergey Karayev http://sergeykarayev.com NaN sergeyk UC Berkeley False Berkeley, CA 27 3 2008-04-04T02:15:30Z 2014-04-10T01:31:00Z 74753165ec3a98f6c6539d1d2d6ba1fd 19 24

5 rows × 16 columns

There are a lot of interesting columns worth exploring here, but for the purpose of this tutorial we're going to analyze the followers-following relationship of Bokeh stargazers.

Data Wrangling

First things first: data cleansing. Let's convert the 'created_at' column data from strings to a proper datetime format.


In [179]:
print 'Before: ', type(df['created_at'][0])
df['created_at'] = pd.to_datetime(df['created_at'])
print 'After: ', type(df['created_at'][0])


Before:  <type 'str'>
After:  <class 'pandas.tslib.Timestamp'>

And let's only keep the columns we're interested in analyzing.


In [180]:
df = df[['name','login','followers','following','created_at']] # Pandas is nifty!

We're also going to use color to get a little insight into the "age" of the GitHub users, partitioning the 'created_at' column into even intervals and color-coding them appropriately. What's our spread for this column? Let's take a look.


In [181]:
youngest = df['created_at'].max()
oldest = df['created_at'].min()

print oldest
print youngest


2008-01-15 04:47:24
2014-03-26 20:41:49

Looks like we have about a six year interval between the "oldest" and "youngest" GitHub members, so let's partition them into six buckets.


In [182]:
delta = youngest - oldest
period = delta/6
bucket_ends = [oldest+period*(x+1) for x in range(6)]
bucket_ends


Out[182]:
[Timestamp('2009-01-26 07:26:28.166666', tz=None),
 Timestamp('2010-02-07 10:05:32.333332', tz=None),
 Timestamp('2011-02-19 12:44:36.499998', tz=None),
 Timestamp('2012-03-02 15:23:40.666664', tz=None),
 Timestamp('2013-03-14 18:02:44.833330', tz=None),
 Timestamp('2014-03-26 20:41:48.999996', tz=None)]

Let's also select a color palette. If you work with color plots consistently it may be worthwhile to explore options like brewer2mpl or cubehelix, but for simplicity I just used the most popular online utility: Colorbrewer. Here's a pleasing purple gradient:


In [183]:
COLORS = ['#f2f0f7',
          '#dadaeb',
          '#bcbddc',
          '#9e9ac8',
          '#756bb1',
          '#54278f']

Aside: just for fun, let's also take advantage of IPython's HTML capabilities to render six <div> elements color-coded respectively.


In [184]:
from IPython.display import HTML
boxes = ['<div style="width: 100px; height: 100px; float: left; background-color:%s"></div>' %c for c in COLORS]
HTML(''.join(boxes))
# Pretty nifty!!


Out[184]:

The following is a naïve algorithm that assigns each user a numerical index based on which bucket they fall into, [0..5].


In [185]:
def index(date):
    for y in range(6):
        if date <= buckets[y]:
            return y
    return 5 # Edge case for youngest member

df['age'] = [index(date) for date in df['created_at']]

# Confirm that all dates were bucketed properly
#check = [i for i, x in enumerate(df['age']) if x is None]
#if check:
#    print [x for i, x in enumerate(df['created_at']) if i in check]

Plotting

This is where the magic happens.

Although we will take full advantage of Bokeh's plotting capabilities in the Plot Styling section, let's see how to generate a basic plot.


In [187]:
circle(
        df['followers'], # X axis
        df['following']  # Y axis
)

show()


Bokeh Plot
Plots

So what exactly happened here?

We called circle() with two parameters, which correspond to x- and y-coordinate arrays, and show(), and we got a plot! Plus it's interactive to boot. But let's finish the process: let's style.

Plot Styling


In [120]:
figure(plot_width=800,
       background_fill= '#eeeeee')

size = (1 + 4 * np.log10(df['followers']))
colors = [COLOR_SCHEME[x] for x in df['age']]

source = ColumnDataSource(
    data=dict(
        x=df['followers'],
        y=df['following'],
        size=size,
        colors=colors,
        name=df['name'],
        username=df['login'],
    )
)

In [ ]:
circle('x', 'y',
        source=source, tools=TOOLS,
        size='size',        
        fill_color='colors', fill_alpha=0.4,
        line_color=None, Title="Hoverful Scatter")

In [ ]:
hover = [t for t in curplot().tools if isinstance(t, HoverTool)][0]

In [ ]:
# Variables from the data source are available with a "@" prefix, e.g., "@x" will display the
# x value under the cursor. There are also some special known values that
# start with "$" symbol:
#   - $index     index of selected point in the data source
#   - $x, $y     "data" coordinates under cursor
#   - $sx, $sy   canvas coordinates under cursor
#   - $color     color data from data source, syntax: $color[options]:field_name
# NOTE: we use an OrderedDict to preserve the order in the displayed tooltip
hover.tooltips = OrderedDict([
    # add to this
    ("(Followers, Following)", "(@x, @y)"),
    ("Name", "@name"),
    ("Username", "@username"),
])

In [ ]:
curplot().title = "Bokeh Stargazers"
xaxis().axis_label = "Followers"
yaxis().axis_label = "Following"

In [ ]:
grid().grid_line_color = "white"
grid().grid_line_width = 1
axis().major_label_text_font_size = "12pt"
axis().major_label_standoff = 10            # distance of tick labels from ticks
axis().axis_line_color = None               # color, or None, to suppress the line
#xaxis().major_label_orientation = np.pi/4   # radians, "horizontal", "vertical", "normal"

axis().major_tick_line_alpha = 0
axis().axis_line_alpha = 0

In [ ]:
show()

In [204]:
x = df['followers']
y = df['following']

binner = np.histogram2d(x, y, bins=1000)

In [205]:
image(image=[binner[0]], x=[0], y=[0], dw=[800], dh=[20], palette=["Spectral-11"],
      x_range = Range1d(start=0, end=800), y_range = Range1d(start=0, end=4000),
      tools="pan,wheel_zoom,box_zoom,reset,previewsave")


Out[205]:
<bokeh.objects.Plot at 0x10821dd10>

In [206]:
show()


Bokeh Plot
Plots

In [ ]:
xr1 = Range1d(start=0, end=800)
xr2 = Range1d(start=0, end=100)
yr = Range1d(start=0, end=200)
       x_range=xr1,
       y_range=yr,