This is a thoroughly exhaustive walkthrough that explains several steps:
First off, let's gather our imports. We're going to use the knockout duo of numpy and pandas to aggregate our data, and our featured module bokeh to plot directly in the notebook.
In [177]:
# Conventional import naming
import numpy as np
import pandas as pd
# Import plotting objects into our namespace for easy access
from bokeh.plotting import *
#Set the target of the plots to the current notebook
output_notebook()
# Utilities we'll reference later in the notebook
from bokeh.objects import HoverTool, Range1d
# Miscellaneous
import datetime as dt
from collections import OrderedDict
These data were obtained with a nifty tool called astronomer that pulls data on GitHub stargazers for any given repository. ("Stargazers" here refers to those GitHub users who have starred a particular repository.)
Example usage: astronomer --format csv --token PASTE_TOKEN_HERE --outfile bokeh-stats.csv ContinuumIO/bokeh
I first obtained a snapshot of Bokeh stargazers on April 9th, 2014; let's read it in with pandas and take a peek.
In [178]:
df = pd.read_csv('data/bokeh-stats_2014-4-9.csv', header=0)
df.head()
Out[178]:
There are a lot of interesting columns worth exploring here, but for the purpose of this tutorial we're going to analyze the followers-following relationship of Bokeh stargazers.
First things first: data cleansing. Let's convert the 'created_at' column data from strings to a proper datetime format.
In [179]:
print 'Before: ', type(df['created_at'][0])
df['created_at'] = pd.to_datetime(df['created_at'])
print 'After: ', type(df['created_at'][0])
And let's only keep the columns we're interested in analyzing.
In [180]:
df = df[['name','login','followers','following','created_at']] # Pandas is nifty!
We're also going to use color to get a little insight into the "age" of the GitHub users, partitioning the 'created_at' column into even intervals and color-coding them appropriately. What's our spread for this column? Let's take a look.
In [181]:
youngest = df['created_at'].max()
oldest = df['created_at'].min()
print oldest
print youngest
Looks like we have about a six year interval between the "oldest" and "youngest" GitHub members, so let's partition them into six buckets.
In [182]:
delta = youngest - oldest
period = delta/6
bucket_ends = [oldest+period*(x+1) for x in range(6)]
bucket_ends
Out[182]:
Let's also select a color palette. If you work with color plots consistently it may be worthwhile to explore options like brewer2mpl or cubehelix, but for simplicity I just used the most popular online utility: Colorbrewer. Here's a pleasing purple gradient:
In [183]:
COLORS = ['#f2f0f7',
'#dadaeb',
'#bcbddc',
'#9e9ac8',
'#756bb1',
'#54278f']
Aside: just for fun, let's also take advantage of IPython's HTML capabilities to render six <div> elements color-coded respectively.
In [184]:
from IPython.display import HTML
boxes = ['<div style="width: 100px; height: 100px; float: left; background-color:%s"></div>' %c for c in COLORS]
HTML(''.join(boxes))
# Pretty nifty!!
Out[184]:
The following is a naïve algorithm that assigns each user a numerical index based on which bucket they fall into, [0..5].
In [185]:
def index(date):
for y in range(6):
if date <= buckets[y]:
return y
return 5 # Edge case for youngest member
df['age'] = [index(date) for date in df['created_at']]
# Confirm that all dates were bucketed properly
#check = [i for i, x in enumerate(df['age']) if x is None]
#if check:
# print [x for i, x in enumerate(df['created_at']) if i in check]
This is where the magic happens.
Although we will take full advantage of Bokeh's plotting capabilities in the Plot Styling section, let's see how to generate a basic plot.
In [187]:
circle(
df['followers'], # X axis
df['following'] # Y axis
)
show()
So what exactly happened here?
We called circle() with two parameters, which correspond to x- and y-coordinate arrays, and show(), and we got a plot! Plus it's interactive to boot. But let's finish the process: let's style.
In [120]:
figure(plot_width=800,
background_fill= '#eeeeee')
size = (1 + 4 * np.log10(df['followers']))
colors = [COLOR_SCHEME[x] for x in df['age']]
source = ColumnDataSource(
data=dict(
x=df['followers'],
y=df['following'],
size=size,
colors=colors,
name=df['name'],
username=df['login'],
)
)
In [ ]:
circle('x', 'y',
source=source, tools=TOOLS,
size='size',
fill_color='colors', fill_alpha=0.4,
line_color=None, Title="Hoverful Scatter")
In [ ]:
hover = [t for t in curplot().tools if isinstance(t, HoverTool)][0]
In [ ]:
# Variables from the data source are available with a "@" prefix, e.g., "@x" will display the
# x value under the cursor. There are also some special known values that
# start with "$" symbol:
# - $index index of selected point in the data source
# - $x, $y "data" coordinates under cursor
# - $sx, $sy canvas coordinates under cursor
# - $color color data from data source, syntax: $color[options]:field_name
# NOTE: we use an OrderedDict to preserve the order in the displayed tooltip
hover.tooltips = OrderedDict([
# add to this
("(Followers, Following)", "(@x, @y)"),
("Name", "@name"),
("Username", "@username"),
])
In [ ]:
curplot().title = "Bokeh Stargazers"
xaxis().axis_label = "Followers"
yaxis().axis_label = "Following"
In [ ]:
grid().grid_line_color = "white"
grid().grid_line_width = 1
axis().major_label_text_font_size = "12pt"
axis().major_label_standoff = 10 # distance of tick labels from ticks
axis().axis_line_color = None # color, or None, to suppress the line
#xaxis().major_label_orientation = np.pi/4 # radians, "horizontal", "vertical", "normal"
axis().major_tick_line_alpha = 0
axis().axis_line_alpha = 0
In [ ]:
show()
In [204]:
x = df['followers']
y = df['following']
binner = np.histogram2d(x, y, bins=1000)
In [205]:
image(image=[binner[0]], x=[0], y=[0], dw=[800], dh=[20], palette=["Spectral-11"],
x_range = Range1d(start=0, end=800), y_range = Range1d(start=0, end=4000),
tools="pan,wheel_zoom,box_zoom,reset,previewsave")
Out[205]:
In [206]:
show()
In [ ]:
xr1 = Range1d(start=0, end=800)
xr2 = Range1d(start=0, end=100)
yr = Range1d(start=0, end=200)
x_range=xr1,
y_range=yr,