Hello! Welcome!

Did you see my talk at PyCon 2014! You can use this notebook to try it out for yourself!

If you didn't, and you'd like to, the video is here: Diving into Open Data with IPython Notebook and Pandas. The pandas cookbook I mentioned is at http://github.com/jvns/pandas-cookbook.

IMPORTANT: To make this work, you'll need to

  1. install IPython Notebook and Pandas (easiest way: use the free/open source Anaconda)
  2. Download this notebook
  3. Download the data and put it in the same directory: 2012.csv (original source)
  4. In that directory, run ipython notebook


This work is licensed under a Creative Commons Attribution 4.0 International License.


In [1]:
# Some imports we'll need
import numpy as np
import pandas as pd
julia = {'email': 'julia@jvns.ca', 'twitter': 'http://twitter.com/b0rk', 'slides': 'http://bit.ly/pycon-pandas', 'website': 'http://jvns.ca'}

Diving into Open Data with IPython Notebook & Pandas

I'm Julia Evans

Data at Stripe, work on Montréal All-Girl Hack Night, PyLadies MTL

You can follow along with this talk at:

http://bit.ly/pycon-pandas


In [2]:
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']


Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca

This talk

1. What are IPython Notebook & Pandas?

2. Practical examples of using them!

3. Advice.

IPython Notebook + Pandas + Numpy: what are they?

IPython Notebook

  • web-based user interface to IPython
  • pretty graphs
  • literate programming
  • Can make slideshows :) (this presentation)
  • version controlled science!

Pandas: No loops!

Imagine:

  • dataset with every complaint call ever made in New York
  • You want to know how many noise complaints are made in each borough each hour

You don't need to write loops to do this!

Also the solution is 5 lines of code

Numpy makes Pandas fast

  • numpy: fast array computations
  • pandas is built on top of numpy

In [3]:
py_list = range(20000000)
numpy_array = np.arange(20000000)

In [4]:
%%timeit
total = 0
for x in py_list:
    x += total * total


1 loops, best of 3: 1.22 s per loop

In [5]:
%%timeit
np.sum(numpy_array * numpy_array)


10 loops, best of 3: 83.4 ms per loop

How do I install it?

Don't: Use the Ubuntu packages

sudo apt-get install ipython-notebook

Do: Use pip or Anaconda

pip install ipython tornado pyzmq
pip install numpy pandas matplotlib

How to run IPython Notebook

$ ipython notebook
</code>

Practical examples!!!

Sensors on bike paths measure the number of cyclists

Part 1: Import the 2012 bike path data from a CSV


In [6]:
import pandas as pd
import numpy as np

In [7]:
import matplotlib
# display graphs inline
%matplotlib inline 

# Make graphs prettier
pd.set_option('display.max_columns', 15)
pd.set_option('display.line_width', 400)
pd.set_option('display.mpl_style', 'default')

# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
matplotlib.rc('font', family='normal', weight='bold', size=22)

In [8]:
bike_data = pd.read_csv("./2012.csv")
bike_data[:5]


Out[8]:
Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0 01/01/2012;35;;0;38;51;26;10;16;
1 02/01/2012;83;;1;68;153;53;6;43;
2 03/01/2012;135;;2;104;248;89;3;58;
3 04/01/2012;144;;1;116;318;111;8;61;
4 05/01/2012;197;;2;124;330;97;13;95;

5 rows × 1 columns

After


In [9]:
bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', parse_dates=True, dayfirst=True)

In [10]:
# Get rid of missing columns
bike_data = bike_data.dropna(axis=1)
# Only use 3 of the columns so it all fits on the screen

In [11]:
bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']]
bike_data[:5]


Out[11]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1
Date
2012-01-01 35 0 38
2012-01-02 83 1 68
2012-01-03 135 2 104
2012-01-04 144 1 116
2012-01-05 197 2 124

5 rows × 3 columns

Part 2: take a look at the data

We have a dataframe:


In [12]:
bike_data[:3]


Out[12]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1
Date
2012-01-01 35 0 38
2012-01-02 83 1 68
2012-01-03 135 2 104

3 rows × 3 columns


In [13]:
bike_data.plot()


Out[13]:
<matplotlib.axes.AxesSubplot at 0x21100390>
/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

In [13]:
bike_data.median()


Out[13]:
Berri 1                  3128.0
Côte-Sainte-Catherine    1269.0
Maisonneuve 1            2019.5
dtype: float64

In [14]:
bike_data.median().plot(kind='bar')


Out[14]:
<matplotlib.axes.AxesSubplot at 0x21e3c090>
/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Slicing dataframes


In [15]:
# column slice
column_slice = bike_data[['Berri 1', 'Maisonneuve 1']]
# row slice
column_slice[:3]


Out[15]:
Berri 1 Maisonneuve 1
Date
2012-01-01 35 38
2012-01-02 83 68
2012-01-03 135 104

3 rows × 2 columns

Which days are the worst?


In [16]:
bike_data['Berri 1'] < 75


Out[16]:
Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15     True
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310

Which days are the worst?


In [17]:
bike_data[bike_data['Berri 1'] < 75]


Out[17]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1
Date
2012-01-01 35 0 38
2012-01-14 32 0 54
2012-01-15 54 0 33
2012-01-21 53 0 47
2012-01-22 71 0 41
2012-02-05 72 0 46
2012-02-11 71 0 63
2012-02-25 62 0 48

8 rows × 3 columns

Part 2: Do more people bike on weekdays or weekends?

Step 1: add a 'weekday' column to our dataframe


In [18]:
bike_data['weekday'] = bike_data.index.weekday
bike_data.head()


Out[18]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 weekday
Date
2012-01-01 35 0 38 6
2012-01-02 83 1 68 0
2012-01-03 135 2 104 1
2012-01-04 144 1 116 2
2012-01-05 197 2 124 3

5 rows × 4 columns

Step 2: Use .groupby() and .aggregate() to get the counts


In [19]:
counts_by_day = bike_data.groupby('weekday').aggregate(np.sum)
counts_by_day


Out[19]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1
weekday
0 134298 60329 90051
1 135305 58708 92035
2 152972 67344 104891
3 160131 69028 111895
4 141771 56446 98568
5 101578 34018 62067
6 99310 36466 55324

7 rows × 3 columns

Step 3: draw a graph!


In [20]:
counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
counts_by_day['Berri 1'].plot(kind='bar')


Out[20]:
<matplotlib.axes.AxesSubplot at 0x21aad6d0>

ZERO LOOPS!

There's more going on, though


In [19]:
bike_data['Berri 1'].plot()


Out[19]:
<matplotlib.axes.AxesSubplot at 0x57c2650>

Part 3: Grab some weather data and look at the temperatures


In [20]:
def get_weather_data(year):
    url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
    # mctavish station: 10761, airport station: 5415
    data_by_month = []
    for month in range(1, 13):
        url = url_template.format(year=year, month=month)
        weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True).dropna(axis=1)
        weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)
        weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
        data_by_month.append(weather_data.dropna())
    # Concatenate and drop any empty columns
    return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()

In [21]:
weather_data = get_weather_data(2012)

In [22]:
weather_data[:5]


Out[22]:
Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather
Date/Time
2012-01-01 00:00:00 -1.8 -3.9 86 4 8.0 101.24 Fog
2012-01-01 01:00:00 -1.8 -3.7 87 4 8.0 101.24 Fog
2012-01-01 02:00:00 -1.8 -3.4 89 7 4.0 101.26 Freezing Drizzle,Fog
2012-01-01 03:00:00 -1.5 -3.2 88 6 4.0 101.27 Freezing Drizzle,Fog
2012-01-01 04:00:00 -1.5 -3.3 88 7 4.8 101.23 Fog

5 rows × 7 columns

We need the temperatures every day, not every hour...


In [23]:
bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean')

In [24]:
bike_data.head()


Out[24]:
Berri 1 Côte-Sainte-Catherine Maisonneuve 1 weekday mean temp
Date
2012-01-01 35 0 38 6 0.629167
2012-01-02 83 1 68 0 0.041667
2012-01-03 135 2 104 1 -14.416667
2012-01-04 144 1 116 2 -13.645833
2012-01-05 197 2 124 3 -6.750000

5 rows × 5 columns

Bikers per day and temperature


In [25]:
bike_data[['Berri 1', 'mean temp']].plot(subplots=True)


Out[25]:
array([<matplotlib.axes.AxesSubplot object at 0x5bae610>,
       <matplotlib.axes.AxesSubplot object at 0x616d2d0>], dtype=object)

Do people bike when it's raining?


In [26]:
bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').resample('D', how='mean')

In [27]:
bike_data[['Berri 1', 'Rain']].plot(subplots=True)


Out[27]:
array([<matplotlib.axes.AxesSubplot object at 0x6855b90>,
       <matplotlib.axes.AxesSubplot object at 0x6548b50>], dtype=object)

Some advice

  • Read (some of) the documentation
  • http://pandas.pydata.org/ has a 460-page PDF with lots of examples
  • Python for Data Analysis by Wes McKinney is great
  • Use numpy/pandas built-in operations, try not to write your own loops (though: see Numba, Cython, ...)

Thanks! Questions?


In [78]:
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']
print 'Slides:',  julia['slides']


Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca
Slides: http://bit.ly/pycon-pandas