Hello! Welcome!

Did you see my talk at PyCon 2014! You can use this notebook to try it out for yourself!

If you didn't, and you'd like to, the video is here: Diving into Open Data with IPython Notebook and Pandas. The pandas cookbook I mentioned is at http://github.com/jvns/pandas-cookbook.

IMPORTANT: To make this work, you'll need to

install IPython Notebook and Pandas (easiest way: use the free/open source Anaconda)
Download this notebook
Download the data and put it in the same directory: 2012.csv (original source)
In that directory, run ipython notebook

This work is licensed under a Creative Commons Attribution 4.0 International License.



In [1]:

    
# Some imports we'll need
import numpy as np
import pandas as pd
julia = {'email': 'julia@jvns.ca', 'twitter': 'http://twitter.com/b0rk', 'slides': 'http://bit.ly/pycon-pandas', 'website': 'http://jvns.ca'}

Diving into Open Data with IPython Notebook & Pandas

I'm Julia Evans

Data at Stripe, work on Montréal All-Girl Hack Night, PyLadies MTL

You can follow along with this talk at:

http://bit.ly/pycon-pandas



In [2]:

    
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']









    



Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca

This talk

1. What are IPython Notebook & Pandas?

2. Practical examples of using them!

3. Advice.

IPython Notebook + Pandas + Numpy: what are they?

IPython Notebook

web-based user interface to IPython
pretty graphs
literate programming
Can make slideshows :) (this presentation)
version controlled science!

Pandas: No loops!

Imagine:

dataset with every complaint call ever made in New York
You want to know how many noise complaints are made in each borough each hour

You don't need to write loops to do this!

Also the solution is 5 lines of code

Numpy makes Pandas fast

numpy: fast array computations
pandas is built on top of numpy



In [3]:

    
py_list = range(20000000)
numpy_array = np.arange(20000000)



In [4]:

    
%%timeit
total = 0
for x in py_list:
    x += total * total









    



1 loops, best of 3: 1.22 s per loop



In [5]:

    
%%timeit
np.sum(numpy_array * numpy_array)









    



10 loops, best of 3: 83.4 ms per loop

How do I install it?

Don't: Use the Ubuntu packages

sudo apt-get install ipython-notebook

Do: Use pip or Anaconda

pip install ipython tornado pyzmq pip install numpy pandas matplotlib

How to run IPython Notebook

$ ipython notebook

</code>

Practical examples!!!

Sensors on bike paths measure the number of cyclists

Part 1: Import the 2012 bike path data from a CSV



In [6]:

    
import pandas as pd
import numpy as np



In [7]:

    
import matplotlib
# display graphs inline
%matplotlib inline 

# Make graphs prettier
pd.set_option('display.max_columns', 15)
pd.set_option('display.line_width', 400)
pd.set_option('display.mpl_style', 'default')

# Make the fonts bigger
matplotlib.rc('figure', figsize=(14, 7))
matplotlib.rc('font', family='normal', weight='bold', size=22)



In [8]:

    
bike_data = pd.read_csv("./2012.csv")
bike_data[:5]









    Out[8]:






  
    
      
      Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
    
  
  
    
      0
          01/01/2012;35;;0;38;51;26;10;16;
    
    
      1
          02/01/2012;83;;1;68;153;53;6;43;
    
    
      2
        03/01/2012;135;;2;104;248;89;3;58;
    
    
      3
       04/01/2012;144;;1;116;318;111;8;61;
    
    
      4
       05/01/2012;197;;2;124;330;97;13;95;
    
  

5 rows × 1 columns

After



In [9]:

    
bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', parse_dates=True, dayfirst=True)



In [10]:

    
# Get rid of missing columns
bike_data = bike_data.dropna(axis=1)
# Only use 3 of the columns so it all fits on the screen



In [11]:

    
bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']]
bike_data[:5]









    Out[11]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      Date
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
    
    
      2012-01-02
        83
       1
        68
    
    
      2012-01-03
       135
       2
       104
    
    
      2012-01-04
       144
       1
       116
    
    
      2012-01-05
       197
       2
       124
    
  

5 rows × 3 columns

Part 2: take a look at the data

We have a dataframe:



In [12]:

    
bike_data[:3]









    Out[12]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      Date
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
    
    
      2012-01-02
        83
       1
        68
    
    
      2012-01-03
       135
       2
       104
    
  

3 rows × 3 columns



In [13]:

    
bike_data.plot()









    Out[13]:





<matplotlib.axes.AxesSubplot at 0x21100390>






    



/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))



In [13]:

    
bike_data.median()









    Out[13]:





Berri 1                  3128.0
Côte-Sainte-Catherine    1269.0
Maisonneuve 1            2019.5
dtype: float64



In [14]:

    
bike_data.median().plot(kind='bar')









    Out[14]:





<matplotlib.axes.AxesSubplot at 0x21e3c090>






    



/opt/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))

Slicing dataframes



In [15]:

    
# column slice
column_slice = bike_data[['Berri 1', 'Maisonneuve 1']]
# row slice
column_slice[:3]









    Out[15]:






  
    
      
      Berri 1
      Maisonneuve 1
    
    
      Date
      
      
    
  
  
    
      2012-01-01
        35
        38
    
    
      2012-01-02
        83
        68
    
    
      2012-01-03
       135
       104
    
  

3 rows × 2 columns

Which days are the worst?



In [16]:

    
bike_data['Berri 1'] < 75









    Out[16]:





Date
2012-01-01     True
2012-01-02    False
2012-01-03    False
2012-01-04    False
2012-01-05    False
2012-01-06    False
2012-01-07    False
2012-01-08    False
2012-01-09    False
2012-01-10    False
2012-01-11    False
2012-01-12    False
2012-01-13    False
2012-01-14     True
2012-01-15     True
...
2012-10-22    False
2012-10-23    False
2012-10-24    False
2012-10-25    False
2012-10-26    False
2012-10-27    False
2012-10-28    False
2012-10-29    False
2012-10-30    False
2012-10-31    False
2012-11-01    False
2012-11-02    False
2012-11-03    False
2012-11-04    False
2012-11-05    False
Name: Berri 1, Length: 310

Which days are the worst?



In [17]:

    
bike_data[bike_data['Berri 1'] < 75]









    Out[17]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      Date
      
      
      
    
  
  
    
      2012-01-01
       35
       0
       38
    
    
      2012-01-14
       32
       0
       54
    
    
      2012-01-15
       54
       0
       33
    
    
      2012-01-21
       53
       0
       47
    
    
      2012-01-22
       71
       0
       41
    
    
      2012-02-05
       72
       0
       46
    
    
      2012-02-11
       71
       0
       63
    
    
      2012-02-25
       62
       0
       48
    
  

8 rows × 3 columns

Part 2: Do more people bike on weekdays or weekends?

Step 1: add a 'weekday' column to our dataframe



In [18]:

    
bike_data['weekday'] = bike_data.index.weekday
bike_data.head()









    Out[18]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      weekday
    
    
      Date
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
       6
    
    
      2012-01-02
        83
       1
        68
       0
    
    
      2012-01-03
       135
       2
       104
       1
    
    
      2012-01-04
       144
       1
       116
       2
    
    
      2012-01-05
       197
       2
       124
       3
    
  

5 rows × 4 columns

Step 2: Use .groupby() and .aggregate() to get the counts



In [19]:

    
counts_by_day = bike_data.groupby('weekday').aggregate(np.sum)
counts_by_day









    Out[19]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      weekday
      
      
      
    
  
  
    
      0
       134298
       60329
        90051
    
    
      1
       135305
       58708
        92035
    
    
      2
       152972
       67344
       104891
    
    
      3
       160131
       69028
       111895
    
    
      4
       141771
       56446
        98568
    
    
      5
       101578
       34018
        62067
    
    
      6
        99310
       36466
        55324
    
  

7 rows × 3 columns

Step 3: draw a graph!



In [20]:

    
counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
counts_by_day['Berri 1'].plot(kind='bar')









    Out[20]:





<matplotlib.axes.AxesSubplot at 0x21aad6d0>

ZERO LOOPS!

There's more going on, though



In [19]:

    
bike_data['Berri 1'].plot()









    Out[19]:





<matplotlib.axes.AxesSubplot at 0x57c2650>

Part 3: Grab some weather data and look at the temperatures



In [20]:

    
def get_weather_data(year):
    url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
    # mctavish station: 10761, airport station: 5415
    data_by_month = []
    for month in range(1, 13):
        url = url_template.format(year=year, month=month)
        weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True).dropna(axis=1)
        weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)
        weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
        data_by_month.append(weather_data.dropna())
    # Concatenate and drop any empty columns
    return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()



In [21]:

    
weather_data = get_weather_data(2012)



In [22]:

    
weather_data[:5]









    Out[22]:






  
    
      
      Temp (C)
      Dew Point Temp (C)
      Rel Hum (%)
      Wind Spd (km/h)
      Visibility (km)
      Stn Press (kPa)
      Weather
    
    
      Date/Time
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01 00:00:00
      -1.8
      -3.9
       86
       4
       8.0
       101.24
                        Fog
    
    
      2012-01-01 01:00:00
      -1.8
      -3.7
       87
       4
       8.0
       101.24
                        Fog
    
    
      2012-01-01 02:00:00
      -1.8
      -3.4
       89
       7
       4.0
       101.26
       Freezing Drizzle,Fog
    
    
      2012-01-01 03:00:00
      -1.5
      -3.2
       88
       6
       4.0
       101.27
       Freezing Drizzle,Fog
    
    
      2012-01-01 04:00:00
      -1.5
      -3.3
       88
       7
       4.8
       101.23
                        Fog
    
  

5 rows × 7 columns

We need the temperatures every day, not every hour...



In [23]:

    
bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean')



In [24]:

    
bike_data.head()









    Out[24]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      weekday
      mean temp
    
    
      Date
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
       6
        0.629167
    
    
      2012-01-02
        83
       1
        68
       0
        0.041667
    
    
      2012-01-03
       135
       2
       104
       1
      -14.416667
    
    
      2012-01-04
       144
       1
       116
       2
      -13.645833
    
    
      2012-01-05
       197
       2
       124
       3
       -6.750000
    
  

5 rows × 5 columns

Bikers per day and temperature



In [25]:

    
bike_data[['Berri 1', 'mean temp']].plot(subplots=True)









    Out[25]:





array([<matplotlib.axes.AxesSubplot object at 0x5bae610>,
       <matplotlib.axes.AxesSubplot object at 0x616d2d0>], dtype=object)

Do people bike when it's raining?



In [26]:

    
bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').resample('D', how='mean')



In [27]:

    
bike_data[['Berri 1', 'Rain']].plot(subplots=True)









    Out[27]:





array([<matplotlib.axes.AxesSubplot object at 0x6855b90>,
       <matplotlib.axes.AxesSubplot object at 0x6548b50>], dtype=object)

Some advice

Read (some of) the documentation
http://pandas.pydata.org/ has a 460-page PDF with lots of examples
Python for Data Analysis by Wes McKinney is great
Use numpy/pandas built-in operations, try not to write your own loops (though: see Numba, Cython, ...)

http://github.com/jvns/pandas-cookbook

Thanks! Questions?



In [78]:

    
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Blog:', julia['website']
print 'Slides:',  julia['slides']









    



Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Blog: http://jvns.ca
Slides: http://bit.ly/pycon-pandas

	Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0	01/01/2012;35;;0;38;51;26;10;16;
1	02/01/2012;83;;1;68;153;53;6;43;
2	03/01/2012;135;;2;104;248;89;3;58;
3	04/01/2012;144;;1;116;318;111;8;61;
4	05/01/2012;197;;2;124;330;97;13;95;

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-02	83	1	68
2012-01-03	135	2	104
2012-01-04	144	1	116
2012-01-05	197	2	124

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-14	32	0	54
2012-01-15	54	0	33
2012-01-21	53	0	47
2012-01-22	71	0	41
2012-02-05	72	0	46
2012-02-11	71	0	63
2012-02-25	62	0	48

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
weekday
0	134298	60329	90051
1	135305	58708	92035
2	152972	67344	104891
3	160131	69028	111895
4	141771	56446	98568
5	101578	34018	62067
6	99310	36466	55324

	Temp (C)	Dew Point Temp (C)	Rel Hum (%)	Wind Spd (km/h)	Visibility (km)	Stn Press (kPa)	Weather
Date/Time
2012-01-01 00:00:00	-1.8	-3.9	86	4	8.0	101.24	Fog
2012-01-01 01:00:00	-1.8	-3.7	87	4	8.0	101.24	Fog
2012-01-01 02:00:00	-1.8	-3.4	89	7	4.0	101.26	Freezing Drizzle,Fog
2012-01-01 03:00:00	-1.5	-3.2	88	6	4.0	101.27	Freezing Drizzle,Fog
2012-01-01 04:00:00	-1.5	-3.3	88	7	4.8	101.23	Fog

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	weekday	mean temp
Date
2012-01-01	35	0	38	6	0.629167
2012-01-02	83	1	68	0	0.041667
2012-01-03	135	2	104	1	-14.416667
2012-01-04	144	1	116	2	-13.645833
2012-01-05	197	2	124	3	-6.750000