This notebook will walk you through the process of importing a data set, wrangling the data into the desired shape, and plotting them with Bokeh. It will also demonstrate some basic styling commands.
In [1]:
import numpy as np
import pandas as pd
Now let's import our plotting tool: Bokeh. The following import convention makes it easy to reference Bokeh tools in our namespace.
In [2]:
from bokeh.plotting import *
Now for some data. This tutorial will attempt to recreate a particular Variance Chart example, which lends some insight into the global temperature trend for the past century. The monthly data were accessed from this URL.
A key feature of Python data analytics is the ability to iteratively interact with your source data, dynamically reshaping and cleaning data. So whereas Variance Charts requires independent data files in a precise format for each of their plotting layers (monthly data, annual mean, and five year mean), we can perform numerical operations to derive the second and third categories from the first. And now with the introduction of Bokeh, we can enjoy an end-to-end analytics workflow in one programming language on one platform.
But talk is cheap. The rest of this tutorial will focus on demonstrating a common Bokeh use case: plotting scatter data and a couple of derived lines.
In [3]:
a = pd.read_csv('global_temp_monthly_recordings.csv', index_col=0, header=0)
That was easy!
A perk of being in an IPython Notebook environment: we can quickly get pretty-printed views into our data. For example, let's take a look at the top few rows in this data.
In [4]:
a.head()
Out[4]:
Pretty nifty! Now let's wrangle this data into a usable format. First thing we'll do is stack the data.
In [5]:
b = a.stack()
b.head()
Out[5]:
Now we can see that pandas has automatically collapsed the DataFrame into a Series, to better represent the new 1-dimensional data. The Series has a MultiIndex, which is currently missing a title for the months, so let's add that in.
In [6]:
b.index.set_names(['year', 'month'], inplace=1)
b.head()
Out[6]:
Bokeh supports a datetime axis, so let's format our data a little bit.
First we'll reset the index, so that the year and month become data columns that we can more easily access.
In [7]:
c = b.reset_index()
c.head()
Out[7]:
And let's add a header for our data column, so we can keep track of it with a label.
In [8]:
c.rename(columns={0:'temp_delta'}, inplace=1)
c.head()
Out[8]:
Now for some magic: pandas has a class method to_datetime() which returns a Series; we're going to assign that to a new column in our DataFrame.
In [9]:
c[['year','month']] = c[['year', 'month']].astype(str) # A little type coercion so our to_datetime() call won't complain
c['date'] = pd.Series(pd.to_datetime(c['year'] + c['month'], format="%Y%b"))
We no longer need the year and month columns, so we can drop them from the DataFrame.
In [10]:
c.drop(['year', 'month'], inplace=1, axis=1)
c.head()
Out[10]:
Great; we now have the data we want in the format we want: datetime objects for each month and a corresponding temperature delta. It's time to visualize!
First let's specify the output format: we want to display the plots directly in the notebook!
In [11]:
output_notebook()
(That previous call should return with a jaunty response of "Configuring embedded BokehJS mode.")
Let's begin plotting: our first layer will be a scatter plot of the monthly temperature deltas. Below you can see the scatter() command, which accepts data in several formats. Here we're passing in the date and temp_delta columns from our DataFrame, declaring the x_axis_type to be "datetime", setting a legend, and selecting which tools we want available to interact with the plot.
In [12]:
months = c['date'] # Refer to the 'date' column of the DataFrame as 'months'
monthly_data = c['temp_delta'] # Refer to the 'temp_delta' column of the DataFrame as 'monthly_data'
scatter(
months, # X coordinates
monthly_data, # Y coordinates
x_axis_type = "datetime",
legend='Temperature Delta (montly)',
tools="pan,wheel_zoom,box_zoom,reset,previewsave" # Declare available plot tools
)
Out[12]:
There's our Plot instance! We can render it directly in the notebook with a show() command:
In [13]:
show()
From the above plot you should see that Bokeh has automatically determined the bounds based on the data, with appropriately scaled tick marks.
Next let's explore some common modifications. First we will declare a new figure() with an increased plot width, and call hold() so that new renderers are added to the same plot.
In [14]:
figure(plot_width=1000)
hold()
Now let's get a view into average annual temperature data.
We want the mean temperature delta per year, which involves averaging across twelve months for each year. (This is not entirely accurate, given that we are averaging across months of unequal size, but it is accurate enough for the purpose of demonstration).
Lucky for us, pandas has a mean() function that will let us average across an index level:
In [15]:
annual = b.mean(level=0)
annual.head()
Out[15]:
The fellows over at Variance decided to align their monthly data along the same vertical axis for each year—this helps clean up the presentation a little bit at the slight cost of horizontal accuracy. We can handily reproduce this view with a couple lines of Python: first we convert the annual index values to Timestamps, then we expand the list by a factor of 12.
In [16]:
years = pd.Series(pd.to_datetime(annual.index.values, format="%Y"))
years_expanded = np.array([12*[x] for x in years]).flatten()
years_expanded[0:20]
Out[16]:
You can see above that each Timestamp is repeated 12 times; monthly_data will now render vertically aligned by year.
Now let's call our scatter() method again with a few tweaks.
In [17]:
# Development for Bokeh's datetime axis support is still ongoing,
# so in the meantime we have to set the radius by milliseconds.
MS_IN_YEAR = 31556952000 # Number of milliseconds in a year
# This lets us scale along the x-axis (datetime) in units of years
scatter(
years_expanded,
monthly_data,
color='#BBBBBB',
line_alpha=0.2,
fill_alpha=0.5,
radius=MS_IN_YEAR/12*4, # Workaround; radius of points should be a third of a year (four months)
legend='Monthly Mean',
x_axis_type = "datetime",
tools="pan,wheel_zoom,box_zoom,reset,previewsave"
)
Out[17]:
Perfect! Now let's add our annual scatter and line renderers to our plot as well.
In [18]:
# Scatter points of annual temperature deltas (calculated)
scatter(
years,
annual,
color='#464678',
alpha=1,
radius=MS_IN_YEAR/12*4,
)
# Line to connect scatter points; refers to same data sources (years, annual)
line(
years,
annual,
color='#464678',
alpha=1,
radius=MS_IN_YEAR/12*4,
legend='Annual Mean'
)
show()
This is a good start, and it's quickly beginning to resemble the plot we are targeting, but here we will diverge a bit from the Variance chart.
Instead of plotting a line of the five year mean, we will use Kaiser window smoothing with the help of numpy's numerical convolution.
In [19]:
# This technique is heavily derived from from glowingpython.blogspot.com/2012/02/convolution-with-numpy.html
def smooth(x, beta):
""" Smoothing with Kaiser window function """
# Set window length to five years (5 * 12 months)
window_len = 60
# Extend the data at beginning and at the end to apply the window at the borders
s = np.r_[x[window_len-1:0:-1],x,x[-1:-window_len:-1]]
w = np.kaiser(window_len, beta)
y = np.convolve(w/w.sum(),s,mode='valid')
return y[window_len:len(y)-window_len]
We can pass that function an array of data and specify a beta factor, and it will return a smoothed array.
In [20]:
smoothed = smooth(monthly_data, 2)
Let's put it all together now.
First we declare a new figure, turn on hold(), and draw our first four renderers: monthly scatter data, annual scatter points (calculated), a line to connect those points, and our Kaiser smoothed line.
In [21]:
figure(plot_width=1000)
hold()
scatter(
years_expanded,
monthly_data,
color='#BBBBBB', #
line_alpha=0.2,
fill_alpha=0.5,#
radius=MS_IN_YEAR/12*4, # Workaround; radius of points should be a third of a year (four months)
legend='Monthly Mean',
x_axis_type = "datetime",
background_fill='#F7F7EF',
tools="pan,wheel_zoom,box_zoom,reset,previewsave"
)
line(
years,
annual,
color='#464678',
alpha=1,
radius=MS_IN_YEAR/12*4,
legend='Annual Mean'
)
scatter(
years,
annual,
color='#464678',
alpha=1,
radius=MS_IN_YEAR/12*4,
)
line(
months,
smoothed,
color='#F13239',
line_width=2,
legend='Kaiser Smoothed',
x_axis_type = "datetime",
tools="pan,wheel_zoom,box_zoom,reset,previewsave"
)
Out[21]:
Let's also add in the text for completeness.
In [22]:
text([0], # X baseline: 0 milliseconds from start of UNIX time = 1970 C.E.
[-0.5], # Y baseline: -0.5 C°
"Global Surface Temperature",
text_font='helvetica',
text_font_style='bold',
text_color='#3A3A3A',
angle=0)
text([0, 0],[-0.6, -0.67], # List of x- and y-index positions
["Change in global surface temperature relative",
"to 1951-1980 average temperatures."],
text_font_size='10pt',
text_font='helvetica neue',
text_color='#3A3A3A',
angle=0)
Out[22]:
With our renderers declared, let's apply some more advanced plot styling.
In [23]:
legend().orientation = 'top_left' # Put the legend in an empty corner
# Set axis line, label, and tick color
gray = "#DDDDDD"
axis().axis_line_color = gray
axis().axis_label_color = gray
axis().axis_major_tick_label_color = gray
yaxis().location = 'right' # Move y-axis to the right side of the plot
yaxis().bounds = (-1, 1) # Set y-axis bounds to -1..1 C°
# As Bokeh is still under development, we do not yet have an intelligent way to
# dynamically add units to axis labels. Instead we will give the y-axis a descriptive label.
yaxis().axis_label = "Temperature Delta (C°)"
ygrid().grid_line_color = gray
ygrid().grid_line_dash = "2 2" # Set line dash pattern (2px on, 2px off)
xgrid().grid_line_color = None # Disable x-grid lines
Our renderers have been declared and the plot styles have been applied; now let's show() the plot!
In [24]:
show()