Diving into Open Data with IPython Notebook & Pandas

I'm Julia Evans

http://twitter.com/b0rk

http://jvns.ca

http://github.com/jvns

Data scientist, programmer, co-organize Montréal All-Girl Hack Night, PyLadies MTL

You can follow along with this talk at:

http://bit.ly/pyconca-pandas

IPython Notebook

web-based user interface to IPython
pretty graphs
literate programming
Can make slideshows :) (this presentation)
version controlled science!

Pandas

"R for Python"
Provides easy to use data structures & a ton of useful helper functions for data cleanup and transformations
Fast! (backed by numpy arrays)
integrates well with scikit-learn

An installation warning

Don't: Use the Ubuntu packages

sudo apt-get install ipython-notebook sudo apt-get install python-pandas

Do: Use pip or Anaconda

pip install ipython tornado pyzmq pip install pandas

Anaconda is amazing.

How to run IPython Notebook

$ ipython notebook --pylab inline

</code>

The open data

Taken from http://donnees.ville.montreal.qc.ca/fiche/velos-comptage/ (click "Vélos - comptage")

Number of people per day on 7 bike paths (collected using sensors)

Part 1: Import the 2012 bike path data from a CSV

Before

Download and unzip the zip file from this page to run this yourself.



In [1]:

    
import pandas as pd



In [2]:

    
# some display options to make figures bigger
# hide this
pd.set_option('display.max_columns', 15)
pd.set_option('display.line_width', 400)
pd.set_option('display.mpl_style', 'default')
rcParams['figure.figsize'] = (14, 7)
import matplotlib
font = {'family' : 'normal',
        'weight' : 'bold',
        'size'   : 22}

matplotlib.rc('font', **font)



In [3]:

    
bike_data = pd.read_csv("./2012.csv")
bike_data[:5]









    Out[3]:






  
    
      
      Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
    
  
  
    
      0
          01/01/2012;35;;0;38;51;26;10;16;
    
    
      1
          02/01/2012;83;;1;68;153;53;6;43;
    
    
      2
        03/01/2012;135;;2;104;248;89;3;58;
    
    
      3
       04/01/2012;144;;1;116;318;111;8;61;
    
    
      4
       05/01/2012;197;;2;124;330;97;13;95;

After



In [4]:

    
bike_data = pd.read_csv("./2012.csv", encoding='latin1', sep=';', index_col='Date', parse_dates=True, dayfirst=True)



In [5]:

    
# Get rid of missing columns
bike_data = bike_data.dropna(axis=1)
# Only use 3 of the columns so it all fits on the screen



In [6]:

    
bike_data = bike_data[['Berri 1', u'Côte-Sainte-Catherine', 'Maisonneuve 1']]
bike_data[:5]









    Out[6]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      Date
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
    
    
      2012-01-02
        83
       1
        68
    
    
      2012-01-03
       135
       2
       104
    
    
      2012-01-04
       144
       1
       116
    
    
      2012-01-05
       197
       2
       124

Exercise: Parse the CSVs from 2011 and earlier (warning: it's annoying)

Part 2: take a look at the data

We have a dataframe:



In [7]:

    
bike_data[:3]









    Out[7]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      Date
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
    
    
      2012-01-02
        83
       1
        68
    
    
      2012-01-03
       135
       2
       104



In [8]:

    
bike_data.plot()









    Out[8]:





<matplotlib.axes.AxesSubplot at 0x3e59a90>






    



/opt/anaconda/envs/ipython-1.0.0a1/lib/python2.7/site-packages/matplotlib/font_manager.py:1224: UserWarning: findfont: Font family ['normal'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))



In [9]:

    
bike_data.median()









    Out[9]:





Berri 1                  3128.0
Côte-Sainte-Catherine    1269.0
Maisonneuve 1            2019.5
dtype: float64



In [10]:

    
bike_data.median().plot(kind='bar')









    Out[10]:





<matplotlib.axes.AxesSubplot at 0x3fd98d0>

Slicing dataframes



In [11]:

    
# column slice
column_slice = bike_data[['Berri 1', 'Maisonneuve 1']]
# row slice
column_slice[:3]









    Out[11]:






  
    
      
      Berri 1
      Maisonneuve 1
    
    
      Date
      
      
    
  
  
    
      2012-01-01
        35
        38
    
    
      2012-01-02
        83
        68
    
    
      2012-01-03
       135
       104



In [12]:

    
column_slice.plot()









    Out[12]:





<matplotlib.axes.AxesSubplot at 0x43bcbd0>

Part 2: Do more people bike on weekdays or weekends?

Step 1: add a 'weekday' column to our dataframe



In [13]:

    
bike_data['weekday'] = bike_data.index.weekday
bike_data.head()









    Out[13]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      weekday
    
    
      Date
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
       6
    
    
      2012-01-02
        83
       1
        68
       0
    
    
      2012-01-03
       135
       2
       104
       1
    
    
      2012-01-04
       144
       1
       116
       2
    
    
      2012-01-05
       197
       2
       124
       3

Step 2: Use .groupby() and .aggregate() to get the counts



In [14]:

    
counts_by_day = bike_data.groupby('weekday').aggregate(numpy.sum)
counts_by_day









    Out[14]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
    
    
      weekday
      
      
      
    
  
  
    
      0
       134298
       60329
        90051
    
    
      1
       135305
       58708
        92035
    
    
      2
       152972
       67344
       104891
    
    
      3
       160131
       69028
       111895
    
    
      4
       141771
       56446
        98568
    
    
      5
       101578
       34018
        62067
    
    
      6
        99310
       36466
        55324

Step 3: draw a graph!



In [15]:

    
counts_by_day.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
counts_by_day.plot()









    Out[15]:





<matplotlib.axes.AxesSubplot at 0x4a6f950>

There's more going on, though



In [16]:

    
bike_data['Berri 1'].plot()









    Out[16]:





<matplotlib.axes.AxesSubplot at 0x47cc710>

Part 3: Grab some weather data and look at the temperatures



In [17]:

    
def get_weather_data(year):
    url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
    # mctavish station: 10761, airport station: 5415
    data_by_month = []
    for month in range(1, 13):
        url = url_template.format(year=year, month=month)
        weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True).dropna(axis=1)
        weather_data.columns = map(lambda x: x.replace('\xb0', ''), weather_data.columns)
        weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
        data_by_month.append(weather_data.dropna())
    # Concatenate and drop any empty columns
    return pd.concat(data_by_month).dropna(axis=1, how='all').dropna()



In [18]:

    
weather_data = get_weather_data(2012)



In [19]:

    
weather_data[:5]









    Out[19]:






  
    
      
      Dew Point Temp (C)
      Rel Hum (%)
      Stn Press (kPa)
      Temp (C)
      Visibility (km)
      Weather
      Wind Spd (km/h)
    
    
      Date/Time
      
      
      
      
      
      
      
    
  
  
    
      2012-01-01 00:00:00
      -3.9
       86
       101.24
      -1.8
       8.0
                        Fog
       4
    
    
      2012-01-01 01:00:00
      -3.7
       87
       101.24
      -1.8
       8.0
                        Fog
       4
    
    
      2012-01-01 02:00:00
      -3.4
       89
       101.26
      -1.8
       4.0
       Freezing Drizzle,Fog
       7
    
    
      2012-01-01 03:00:00
      -3.2
       88
       101.27
      -1.5
       4.0
       Freezing Drizzle,Fog
       6
    
    
      2012-01-01 04:00:00
      -3.3
       88
       101.23
      -1.5
       4.8
                        Fog
       7

We need the temperatures every day, not every hour...



In [20]:

    
bike_data['mean temp'] = weather_data['Temp (C)'].resample('D', how='mean')



In [21]:

    
bike_data.head()









    Out[21]:






  
    
      
      Berri 1
      Côte-Sainte-Catherine
      Maisonneuve 1
      weekday
      mean temp
    
    
      Date
      
      
      
      
      
    
  
  
    
      2012-01-01
        35
       0
        38
       6
        0.629167
    
    
      2012-01-02
        83
       1
        68
       0
        0.041667
    
    
      2012-01-03
       135
       2
       104
       1
      -14.416667
    
    
      2012-01-04
       144
       1
       116
       2
      -13.645833
    
    
      2012-01-05
       197
       2
       124
       3
       -6.750000

Bikers per day and temperature



In [22]:

    
bike_data[['Berri 1', 'mean temp']].plot(subplots=True)









    Out[22]:





array([<matplotlib.axes.AxesSubplot object at 0x52efed0>,
       <matplotlib.axes.AxesSubplot object at 0x5525a90>], dtype=object)

Do people bike when it's raining?



In [23]:

    
bike_data['Rain'] = weather_data['Weather'].str.contains('Rain').map(lambda x: int(x)).resample('D', how='mean')



In [24]:

    
bike_data[['Berri 1', 'Rain']].plot(subplots=True)









    Out[24]:





array([<matplotlib.axes.AxesSubplot object at 0x5900b10>,
       <matplotlib.axes.AxesSubplot object at 0x6289ed0>], dtype=object)

Let's look at unpopular days in the summer



In [25]:

    
# Look at everything between May and September
summertime_data = bike_data['2012-05-01':'2012-09-01']



In [26]:

    
summertime_data['Berri 1'][:5] < 2500









    Out[26]:





Date
2012-05-01     True
2012-05-02    False
2012-05-03    False
2012-05-04    False
2012-05-05    False
Name: Berri 1, dtype: bool



In [27]:

    
summertime_data = bike_data['2012-05-01':'2012-09-01']
bad_days = summertime_data[summertime_data['Berri 1'] < 2500]
bad_days[['Berri 1', 'Rain', 'mean temp', 'weekday']]

Some advice

Read (some of) the documentation
http://pandas.pydata.org/ has a 460-page PDF with lots of examples
Python for Data Analysis by Wes McKinney is great
Always use vectorized operations, try not to write your own loops (though: see Numba)



In [28]:

    
julia = {'email': 'julia@jvns.ca', 'twitter': 'http://twitter.com/b0rk'}

Thanks! Questions?



In [29]:

    
print 'Email:', julia['email']
print 'Twitter:', julia['twitter']
print 'Slides: http://bit.ly/pyconca-pandas'









    



Email: julia@jvns.ca
Twitter: http://twitter.com/b0rk
Slides: http://bit.ly/pyconca-pandas

	Berri 1	Rain	mean temp	weekday
Date
2012-05-01	1986	0.416667	9.437500	1
2012-05-08	1241	0.666667	12.645833	1
2012-05-22	2315	0.583333	18.279167	1
2012-06-02	943	0.583333	13.566667	5
2012-06-25	2245	0.208333	17.270833	0
2012-08-05	1864	0.166667	25.783333	6
2012-08-10	2414	0.458333	19.841667	4
2012-08-11	2453	0.125000	20.891667	5

	Date;Berri 1;Br�beuf (donn�es non disponibles);C�te-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (donn�es non disponibles)
0	01/01/2012;35;;0;38;51;26;10;16;
1	02/01/2012;83;;1;68;153;53;6;43;
2	03/01/2012;135;;2;104;248;89;3;58;
3	04/01/2012;144;;1;116;318;111;8;61;
4	05/01/2012;197;;2;124;330;97;13;95;

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
Date
2012-01-01	35	0	38
2012-01-02	83	1	68
2012-01-03	135	2	104
2012-01-04	144	1	116
2012-01-05	197	2	124

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1
weekday
0	134298	60329	90051
1	135305	58708	92035
2	152972	67344	104891
3	160131	69028	111895
4	141771	56446	98568
5	101578	34018	62067
6	99310	36466	55324

	Dew Point Temp (C)	Rel Hum (%)	Stn Press (kPa)	Temp (C)	Visibility (km)	Weather	Wind Spd (km/h)
Date/Time
2012-01-01 00:00:00	-3.9	86	101.24	-1.8	8.0	Fog	4
2012-01-01 01:00:00	-3.7	87	101.24	-1.8	8.0	Fog	4
2012-01-01 02:00:00	-3.4	89	101.26	-1.8	4.0	Freezing Drizzle,Fog	7
2012-01-01 03:00:00	-3.2	88	101.27	-1.5	4.0	Freezing Drizzle,Fog	6
2012-01-01 04:00:00	-3.3	88	101.23	-1.5	4.8	Fog	7

	Berri 1	Côte-Sainte-Catherine	Maisonneuve 1	weekday	mean temp
Date
2012-01-01	35	0	38	6	0.629167
2012-01-02	83	1	68	0	0.041667
2012-01-03	135	2	104	1	-14.416667
2012-01-04	144	1	116	2	-13.645833
2012-01-05	197	2	124	3	-6.750000