A practical introduction to IPython Notebook & pandas

Hi! I'm Julia.

Right now: Hacker School.
Before: Data scientist.

I'm on the internet at http://jvns.ca, http://twitter.com/b0rk

You can follow along by downloading this presentation and running the code yourself:

The notebook: http://bit.ly/nyc-python-pandas
The data: http://bit.ly/311-data-zip. Originally from NYC Open Data

Goal (Today)

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

Demo with rats
Dataframes: what makes pandas powerful
Selecting data from a dataframe
Time series and indexes and resampling
Groupby + aggregate

Some notes about installation:

Don't do this:

sudo apt-get install ipython-notebook

Instead, do this:

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline



In [1]:

    
# Some display stuff. Don't mind this for now.
import pandas as pd
pd.set_option('display.mpl_style', 'default') # Make graphs pretty
figsize(15, 6) # Make graphs a good size for my screen
# Display all the columns instead of a summary
pd.set_option('display.line_width', 4000) 
pd.set_option('display.max_columns', 100)

Example: Graph the number of noise complaints each hour in New York



In [2]:

    
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])



In [43]:

    
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[43]:





<matplotlib.axes.AxesSubplot at 0xb69fad0>

What is pandas?

From: http://commons.wikimedia.org/wiki/File:Chengdu-pandas-d10.jpg

A few awesome things about pandas

Really, really, really, really good at time series
Can import Excel files (!!!)
Fast (joining, merging)
Handles NaNs pretty easily

2. Dataframes: what makes pandas powerful

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.



In [4]:

    
people = pd.read_csv('tiny.csv')
people

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 5 rows.

3. Selecting data from a dataframe

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.



In [5]:

    
# Load the first 5 rows of our CSV
requests = pd.read_csv('./311-service-requests.csv', nrows=5)



In [6]:

    
# How to get a column
requests['Complaint Type']









    Out[6]:





0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object



In [7]:

    
# How to get a subset of the columns
requests[['Complaint Type', 'Created Date']]









    Out[7]:






  
    
      
      Complaint Type
      Created Date
    
  
  
    
      0
       Noise - Street/Sidewalk
       10/31/2013 02:08:41 AM
    
    
      1
               Illegal Parking
       10/31/2013 02:01:04 AM
    
    
      2
            Noise - Commercial
       10/31/2013 02:00:24 AM
    
    
      3
               Noise - Vehicle
       10/31/2013 01:56:23 AM
    
    
      4
                        Rodent
       10/31/2013 01:53:44 AM



In [8]:

    
# How to get 3 rows
requests[:3]









    Out[8]:






  
    
      
      Unique Key
      Created Date
      Closed Date
      Agency
      Agency Name
      Complaint Type
      Descriptor
      Location Type
      Incident Zip
      Incident Address
      Street Name
      Cross Street 1
      Cross Street 2
      Intersection Street 1
      Intersection Street 2
      Address Type
      City
      Landmark
      Facility Type
      Status
      Due Date
      Resolution Action Updated Date
      Community Board
      Borough
      X Coordinate (State Plane)
      Y Coordinate (State Plane)
      Park Facility Name
      Park Borough
      School Name
      School Number
      School Region
      School Code
      School Phone Number
      School Address
      School City
      School State
      School Zip
      School Not Found
      School or Citywide Complaint
      Vehicle Type
      Taxi Company Borough
      Taxi Pick Up Location
      Bridge Highway Name
      Bridge Highway Direction
      Road Ramp
      Bridge Highway Segment
      Garage Lot Name
      Ferry Direction
      Ferry Terminal Name
      Latitude
      Longitude
      Location
    
  
  
    
      0
       26589651
       10/31/2013 02:08:41 AM
                          NaN
       NYPD
       New York City Police Department
       Noise - Street/Sidewalk
                       Loud Talking
           Street/Sidewalk
       11432
       90-03 169 STREET
       169 STREET
             90 AVENUE
             91 AVENUE
      NaN
      NaN
         ADDRESS
        JAMAICA
      NaN
       Precinct
       Assigned
       10/31/2013 10:08:41 AM
       10/31/2013 02:35:17 AM
          12 QUEENS
          QUEENS
       1042027
       197389
       Unspecified
          QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.708275
      -73.791604
        (40.70827532593202, -73.79160395779721)
    
    
      1
       26593698
       10/31/2013 02:01:04 AM
                          NaN
       NYPD
       New York City Police Department
               Illegal Parking
       Commercial Overnight Parking
           Street/Sidewalk
       11378
              58 AVENUE
        58 AVENUE
              58 PLACE
             59 STREET
      NaN
      NaN
       BLOCKFACE
        MASPETH
      NaN
       Precinct
           Open
       10/31/2013 10:01:04 AM
                          NaN
          05 QUEENS
          QUEENS
       1009349
       201984
       Unspecified
          QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.721041
      -73.909453
       (40.721040535628305, -73.90945306791765)
    
    
      2
       26594139
       10/31/2013 02:00:24 AM
       10/31/2013 02:40:32 AM
       NYPD
       New York City Police Department
            Noise - Commercial
                   Loud Music/Party
       Club/Bar/Restaurant
       10032
          4060 BROADWAY
         BROADWAY
       WEST 171 STREET
       WEST 172 STREET
      NaN
      NaN
         ADDRESS
       NEW YORK
      NaN
       Precinct
         Closed
       10/31/2013 10:00:24 AM
       10/31/2013 02:39:42 AM
       12 MANHATTAN
       MANHATTAN
       1001088
       246531
       Unspecified
       MANHATTAN
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.843330
      -73.939144
        (40.84332975466513, -73.93914371913482)

Get the first 3 rows of a column



In [9]:

    
requests['Agency Name'][:3]









    Out[9]:





0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object



In [10]:

    
requests[:3]['Agency Name']









    Out[10]:





0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object

Compare a column to a value



In [11]:

    
requests['Complaint Type']









    Out[11]:





0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object



In [12]:

    
requests['Complaint Type'] == 'Noise - Street/Sidewalk'









    Out[12]:





0     True
1    False
2    False
3    False
4    False
Name: Complaint Type, dtype: bool

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints



In [13]:

    
noise_complaints = requests[requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints









    Out[13]:






  
    
      
      Unique Key
      Created Date
      Closed Date
      Agency
      Agency Name
      Complaint Type
      Descriptor
      Location Type
      Incident Zip
      Incident Address
      Street Name
      Cross Street 1
      Cross Street 2
      Intersection Street 1
      Intersection Street 2
      Address Type
      City
      Landmark
      Facility Type
      Status
      Due Date
      Resolution Action Updated Date
      Community Board
      Borough
      X Coordinate (State Plane)
      Y Coordinate (State Plane)
      Park Facility Name
      Park Borough
      School Name
      School Number
      School Region
      School Code
      School Phone Number
      School Address
      School City
      School State
      School Zip
      School Not Found
      School or Citywide Complaint
      Vehicle Type
      Taxi Company Borough
      Taxi Pick Up Location
      Bridge Highway Name
      Bridge Highway Direction
      Road Ramp
      Bridge Highway Segment
      Garage Lot Name
      Ferry Direction
      Ferry Terminal Name
      Latitude
      Longitude
      Location
    
  
  
    
      0
       26589651
       10/31/2013 02:08:41 AM
       NaN
       NYPD
       New York City Police Department
       Noise - Street/Sidewalk
       Loud Talking
       Street/Sidewalk
       11432
       90-03 169 STREET
       169 STREET
       90 AVENUE
       91 AVENUE
      NaN
      NaN
       ADDRESS
       JAMAICA
      NaN
       Precinct
       Assigned
       10/31/2013 10:08:41 AM
       10/31/2013 02:35:17 AM
       12 QUEENS
       QUEENS
       1042027
       197389
       Unspecified
       QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.708275
      -73.791604
       (40.70827532593202, -73.79160395779721)



In [14]:

    
# How to get a specific row
requests.ix[0]









    Out[14]:





Unique Key                                                       26589651
Created Date                                       10/31/2013 02:08:41 AM
Closed Date                                                           NaN
Agency                                                               NYPD
Agency Name                               New York City Police Department
Complaint Type                                    Noise - Street/Sidewalk
Descriptor                                                   Loud Talking
Location Type                                             Street/Sidewalk
Incident Zip                                                        11432
Incident Address                                         90-03 169 STREET
Street Name                                                    169 STREET
Cross Street 1                                                  90 AVENUE
Cross Street 2                                                  91 AVENUE
Intersection Street 1                                                 NaN
Intersection Street 2                                                 NaN
Address Type                                                      ADDRESS
City                                                              JAMAICA
Landmark                                                              NaN
Facility Type                                                    Precinct
Status                                                           Assigned
Due Date                                           10/31/2013 10:08:41 AM
Resolution Action Updated Date                     10/31/2013 02:35:17 AM
Community Board                                                 12 QUEENS
Borough                                                            QUEENS
X Coordinate (State Plane)                                        1042027
Y Coordinate (State Plane)                                         197389
Park Facility Name                                            Unspecified
Park Borough                                                       QUEENS
School Name                                                   Unspecified
School Number                                                 Unspecified
School Region                                                 Unspecified
School Code                                                   Unspecified
School Phone Number                                           Unspecified
School Address                                                Unspecified
School City                                                   Unspecified
School State                                                  Unspecified
School Zip                                                    Unspecified
School Not Found                                                        N
School or Citywide Complaint                                          NaN
Vehicle Type                                                          NaN
Taxi Company Borough                                                  NaN
Taxi Pick Up Location                                                 NaN
Bridge Highway Name                                                   NaN
Bridge Highway Direction                                              NaN
Road Ramp                                                             NaN
Bridge Highway Segment                                                NaN
Garage Lot Name                                                       NaN
Ferry Direction                                                       NaN
Ferry Terminal Name                                                   NaN
Latitude                                                         40.70828
Longitude                                                        -73.7916
Location                          (40.70827532593202, -73.79160395779721)
Name: 0, Length: 52, dtype: object



In [15]:

    
# How not to get a row
requests[0]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-15-32d897be788b> in <module>()
      1 # How not to get a row
----> 2 requests[0]

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1926         else:
   1927             # get column
-> 1928             return self._get_item_cache(key)
   1929 
   1930     def _getitem_slice(self, key):

/opt/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    568             return cache[item]
    569         except Exception:
--> 570             values = self._data.get(item)
    571             res = self._box_item_values(item, values)
    572             cache[item] = res

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   1381 
   1382     def get(self, item):
-> 1383         _, block = self._find_block(item)
   1384         return block.get(item)
   1385 

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   1523 
   1524     def _find_block(self, item):
-> 1525         self._check_have(item)
   1526         for i, block in enumerate(self.blocks):
   1527             if item in block:

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   1530     def _check_have(self, item):
   1531         if item not in self.items:
-> 1532             raise KeyError('no item named %s' % com.pprint_thing(item))
   1533 
   1534     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: u'no item named 0'

Back to our example



In [16]:

    
requests = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])



In [17]:

    
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[17]:





<matplotlib.axes.AxesSubplot at 0x92b9790>

Indexes



In [18]:

    
noise_complaints[:3]









    Out[18]:






  
    
      
      Created Date
      Complaint Type
    
  
  
    
      0 
      2013-10-31 02:08:41
       Noise - Street/Sidewalk
    
    
      16
      2013-10-31 00:54:03
       Noise - Street/Sidewalk
    
    
      25
      2013-10-31 00:35:18
       Noise - Street/Sidewalk



In [19]:

    
noise_complaints = noise_complaints.set_index('Created Date')



In [20]:

    
noise_complaints[:3]









    Out[20]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-31 02:08:41
       Noise - Street/Sidewalk
    
    
      2013-10-31 00:54:03
       Noise - Street/Sidewalk
    
    
      2013-10-31 00:35:18
       Noise - Street/Sidewalk

Sorting the index

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES



In [21]:

    
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]









    Out[21]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-07 15:45:56
       Noise - Street/Sidewalk
    
    
      2013-10-07 16:17:41
       Noise - Street/Sidewalk
    
    
      2013-10-07 16:58:08
       Noise - Street/Sidewalk

Counting the complaints each hour



In [22]:

    
noise_complaints.resample('H', how=len)[:3]









    Out[22]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-07 15:00:00
       1
    
    
      2013-10-07 16:00:00
       2
    
    
      2013-10-07 17:00:00
       0

Example 1: done!



In [23]:

    
noise_complaints.resample('H', how=len).plot()









    Out[23]:





<matplotlib.axes.AxesSubplot at 0x9773790>

Chaining commands together



In [25]:

    
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index('Created Date')
noise_complaints = noise_complaints.sort_index()
noise_complaints = noise_complaints.resample('H', how=len)
#noise_complaints.plot()

is the same as



In [26]:

    
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[26]:





<matplotlib.axes.AxesSubplot at 0x4db98d0>

What if we resample by day?



In [46]:

    
noise_complaints.set_index('Created Date').sort_index().resample('D', how=len).plot(kind='bar')









    Out[46]:





<matplotlib.axes.AxesSubplot at 0xc191d90>

Which day of the week is the complainiest?



In [48]:

    
noise_complaints = noise_complaints.set_index('Created Date').sort_index()
noise_complaints['weekday'] = noise_complaints.index.weekday



In [53]:

    
complaints_by_day = noise_complaints.groupby('weekday').aggregate(len)
complaints_by_day









    Out[53]:






  
    
      
      Complaint Type
    
    
      weekday
      
    
  
  
    
      0
       200
    
    
      1
       187
    
    
      2
       204
    
    
      3
       149
    
    
      4
       180
    
    
      5
       312
    
    
      6
       280



In [55]:

    
complaints_by_day.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
complaints_by_day.plot(kind='bar')









    Out[55]:





<matplotlib.axes.AxesSubplot at 0xd08b510>

Some more examples, for fun...

Example 2: What are the most common complaint types?



In [27]:

    
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')









    Out[27]:





<matplotlib.axes.AxesSubplot at 0x6368a90>

Example 3: Does every zip code complain about the same things?



In [28]:

    
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')









    Out[28]:





<matplotlib.axes.AxesSubplot at 0x8d97d50>

Example 4: Which borough complains the most about rats?



In [56]:

    
rodent_complaints = orig_data[orig_data['Complaint Type'] == 'Rodent']['Borough'].value_counts()
total_complaints = orig_data['Borough'].value_counts()
(rodent_complaints / total_complaints ).plot(kind='bar')









    Out[56]:





<matplotlib.axes.AxesSubplot at 0xd14d290>

	Complaint Type	Created Date
0	Noise - Street/Sidewalk	10/31/2013 02:08:41 AM
1	Illegal Parking	10/31/2013 02:01:04 AM
2	Noise - Commercial	10/31/2013 02:00:24 AM
3	Noise - Vehicle	10/31/2013 01:56:23 AM
4	Rodent	10/31/2013 01:53:44 AM

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	Street Name	Cross Street 1	Cross Street 2	Intersection Street 1	Intersection Street 2	Address Type	City	Landmark	Facility Type	Status	Due Date	Resolution Action Updated Date	Community Board	Borough	X Coordinate (State Plane)	Y Coordinate (State Plane)	Park Facility Name	Park Borough	School Name	School Number	School Region	School Code	School Phone Number	School Address	School City	School State	School Zip	School Not Found	School or Citywide Complaint	Vehicle Type	Taxi Company Borough	Taxi Pick Up Location	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location
0	26589651	10/31/2013 02:08:41 AM	NaN	NYPD	New York City Police Department	Noise - Street/Sidewalk	Loud Talking	Street/Sidewalk	11432	90-03 169 STREET	169 STREET	90 AVENUE	91 AVENUE	NaN	NaN	ADDRESS	JAMAICA	NaN	Precinct	Assigned	10/31/2013 10:08:41 AM	10/31/2013 02:35:17 AM	12 QUEENS	QUEENS	1042027	197389	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.708275	-73.791604	(40.70827532593202, -73.79160395779721)
1	26593698	10/31/2013 02:01:04 AM	NaN	NYPD	New York City Police Department	Illegal Parking	Commercial Overnight Parking	Street/Sidewalk	11378	58 AVENUE	58 AVENUE	58 PLACE	59 STREET	NaN	NaN	BLOCKFACE	MASPETH	NaN	Precinct	Open	10/31/2013 10:01:04 AM	NaN	05 QUEENS	QUEENS	1009349	201984	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.721041	-73.909453	(40.721040535628305, -73.90945306791765)
2	26594139	10/31/2013 02:00:24 AM	10/31/2013 02:40:32 AM	NYPD	New York City Police Department	Noise - Commercial	Loud Music/Party	Club/Bar/Restaurant	10032	4060 BROADWAY	BROADWAY	WEST 171 STREET	WEST 172 STREET	NaN	NaN	ADDRESS	NEW YORK	NaN	Precinct	Closed	10/31/2013 10:00:24 AM	10/31/2013 02:39:42 AM	12 MANHATTAN	MANHATTAN	1001088	246531	Unspecified	MANHATTAN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.843330	-73.939144	(40.84332975466513, -73.93914371913482)

	Created Date	Complaint Type
0	2013-10-31 02:08:41	Noise - Street/Sidewalk
16	2013-10-31 00:54:03	Noise - Street/Sidewalk
25	2013-10-31 00:35:18	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-31 02:08:41	Noise - Street/Sidewalk
2013-10-31 00:54:03	Noise - Street/Sidewalk
2013-10-31 00:35:18	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-07 15:45:56	Noise - Street/Sidewalk
2013-10-07 16:17:41	Noise - Street/Sidewalk
2013-10-07 16:58:08	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-07 15:00:00	1
2013-10-07 16:00:00	2
2013-10-07 17:00:00	0