A practical introduction to IPython Notebook & pandas

Hi! I'm Julia.

Right now: Hacker School.
Before: Data scientist.

I'm on the internet at http://jvns.ca, http://twitter.com/b0rk

Follow along by downloading this presentation and running the code yourself:

The notebook: http://bit.ly/pydata-pandas-tutorial
The data: http://bit.ly/311-data-tar-gz. Originally from NYC Open Data

Setup:

You can ask me any question, any time.
It's the end of the day
I'd rather cover less material and have you understand more of it.
There will be exercises! Pair up with the person next to you and do the exercises.



In [1]:

    
%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)









    



Populating the interactive namespace from numpy and matplotlib

Goal (in 6 months)

Know how to use IPython Notebook + pandas to answer your questions about data

How to start IPython Notebook
How to read data into pandas
How to do simple manipulations on pandas dataframes

Goal (Today)

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

Demo with rats
Dataframes: what makes pandas powerful
Selecting data from a dataframe
Time series and indexes and resampling
Groupby + aggregate

Some notes about installation:

Don't do this:

sudo apt-get install ipython-notebook

Instead, do this:

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline

First: Read the data



In [78]:

    
# Download and read the data
!wget "http://bit.ly/311-data-tar-gz"
!tar -xzf "311-data.tar.gz" # wget does different things
!tar -xzf "311-data-tar-gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])









    



--2013-11-08 15:02:41--  http://bit.ly/311-data-tar-gz
Resolving bit.ly (bit.ly)... 69.58.188.40, 69.58.188.39
Connecting to bit.ly (bit.ly)|69.58.188.40|:80... connected.
HTTP request sent, awaiting response... 301 Moved
Location: https://dl.dropboxusercontent.com/u/115162019/311-data.tar.gz [following]
--2013-11-08 15:02:41--  https://dl.dropboxusercontent.com/u/115162019/311-data.tar.gz
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 184.73.228.95
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|184.73.228.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8492118 (8.1M) [application/octet-stream]
Saving to: `311-data-tar-gz.2'

100%[======================================>] 8,492,118   1.61M/s   in 6.1s    

2013-11-08 15:02:48 (1.33 MB/s) - `311-data-tar-gz.2' saved [8492118/8492118]

tar (child): 311-data.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now



In [81]:

    
plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")









    Out[81]:





[<matplotlib.lines.Line2D at 0xafb9fd0>]

Example 1: Graph the number of noise complaints each hour in New York



In [3]:

    
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[3]:





<matplotlib.axes.AxesSubplot at 0x6e2e190>

Example 2: What are the most common complaint types?



In [4]:

    
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')









    Out[4]:





<matplotlib.axes.AxesSubplot at 0x4bc6e90>

Example 3: Does every zip code complain about the same things?



In [5]:

    
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')









    Out[5]:





<matplotlib.axes.AxesSubplot at 0x32b5850>

Roadmap:

Numpy: what makes pandas fast
Dataframes: what makes pandas powerful
Selecting data from a dataframe
Time series and indexes
Graphing

1. Numpy: What makes pandas fast



In [6]:

    
import numpy as np

How to create a numpy array



In [7]:

    
np.array([1,2,8.0, 3])









    Out[7]:





array([ 1.,  2.,  8.,  3.])



In [8]:

    
np.arange(10)









    Out[8]:





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])



In [9]:

    
# Generate random numbers
np.random.random(10)









    Out[9]:





array([ 0.28288364,  0.82679209,  0.24453135,  0.1364644 ,  0.54386584,
        0.4973374 ,  0.47602631,  0.42742955,  0.57587563,  0.97857772])

How to operate on numpy arrays



In [10]:

    
prices = np.array([31, 40, 12, 40])
prices









    Out[10]:





array([31, 40, 12, 40])



In [11]:

    
# Change the type
prices.astype(np.float32)









    Out[11]:





array([ 31.,  40.,  12.,  40.], dtype=float32)



In [12]:

    
prices.astype(np.int64)









    Out[12]:





array([31, 40, 12, 40])



In [13]:

    
# Find which ones are even
prices % 2 == 0









    Out[13]:





array([False,  True,  True,  True], dtype=bool)



In [14]:

    
# Get only the even prices
prices[prices % 2 == 0]









    Out[14]:





array([40, 12, 40])

More array operations



In [15]:

    
# Find the mean
np.mean(prices)









    Out[15]:





30.75



In [16]:

    
prices * prices









    Out[16]:





array([ 961, 1600,  144, 1600])

Vectorized operations: Don't do this:



In [17]:

    
v1 = np.array([1, 2, 3, 4, 5])
v2 = np.array([1, 2, 3, 8, 9])



In [18]:

    
result = np.zeros_like(v1)
for i in xrange(len(v1)):
    result[i] = 2 * v1[i] + 3 * v2[i]
print result









    



[ 5 10 15 32 37]

Do this instead:



In [19]:

    
result = 2 * v1 + 3 * v2
print result









    



[ 5 10 15 32 37]

Exercise 1: Compute the mean of the numbers 1-1000000

When you're done, try some harder things:

Generate some random numbers and find the mean or standard deviation.
play around some more with creating arrays



In [20]:

    
# Your code here



In [20]:



In [20]:

Exercise 2: Find all the elements in the `prices` array that are divisible by 6

When you're done:

find all the cubes less than 10000



In [21]:

    
# Your code here



In [21]:

What is pandas?

A few awesome things about pandas

Really, really, really, really good at time series
Can import Excel files (!!!)
Fast (joining dataframes, etc.)

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.

2. Dataframes: what makes pandas powerful



In [22]:

    
people = pd.read_csv('tiny.csv')
people

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.

3. Selecting data from a dataframe

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.



In [23]:

    
# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)



In [24]:

    
# How to get a column
small_requests['Complaint Type']









    Out[24]:





0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object



In [25]:

    
# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]









    Out[25]:






  
    
      
      Complaint Type
      Created Date
    
  
  
    
      0
       Noise - Street/Sidewalk
       10/31/2013 02:08:41 AM
    
    
      1
               Illegal Parking
       10/31/2013 02:01:04 AM
    
    
      2
            Noise - Commercial
       10/31/2013 02:00:24 AM
    
    
      3
               Noise - Vehicle
       10/31/2013 01:56:23 AM
    
    
      4
                        Rodent
       10/31/2013 01:53:44 AM



In [26]:

    
# How to get 3 rows
small_requests[:3]









    Out[26]:






  
    
      
      Unique Key
      Created Date
      Closed Date
      Agency
      Agency Name
      Complaint Type
      Descriptor
      Location Type
      Incident Zip
      Incident Address
      Street Name
      Cross Street 1
      Cross Street 2
      Intersection Street 1
      Intersection Street 2
      Address Type
      City
      Landmark
      Facility Type
      Status
      Due Date
      Resolution Action Updated Date
      Community Board
      Borough
      X Coordinate (State Plane)
      Y Coordinate (State Plane)
      Park Facility Name
      Park Borough
      School Name
      School Number
      School Region
      School Code
      School Phone Number
      School Address
      School City
      School State
      School Zip
      School Not Found
      School or Citywide Complaint
      Vehicle Type
      Taxi Company Borough
      Taxi Pick Up Location
      Bridge Highway Name
      Bridge Highway Direction
      Road Ramp
      Bridge Highway Segment
      Garage Lot Name
      Ferry Direction
      Ferry Terminal Name
      Latitude
      Longitude
      Location
    
  
  
    
      0
       26589651
       10/31/2013 02:08:41 AM
                          NaN
       NYPD
       New York City Police Department
       Noise - Street/Sidewalk
                       Loud Talking
           Street/Sidewalk
       11432
       90-03 169 STREET
       169 STREET
             90 AVENUE
             91 AVENUE
      NaN
      NaN
         ADDRESS
        JAMAICA
      NaN
       Precinct
       Assigned
       10/31/2013 10:08:41 AM
       10/31/2013 02:35:17 AM
          12 QUEENS
          QUEENS
       1042027
       197389
       Unspecified
          QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.708275
      -73.791604
        (40.70827532593202, -73.79160395779721)
    
    
      1
       26593698
       10/31/2013 02:01:04 AM
                          NaN
       NYPD
       New York City Police Department
               Illegal Parking
       Commercial Overnight Parking
           Street/Sidewalk
       11378
              58 AVENUE
        58 AVENUE
              58 PLACE
             59 STREET
      NaN
      NaN
       BLOCKFACE
        MASPETH
      NaN
       Precinct
           Open
       10/31/2013 10:01:04 AM
                          NaN
          05 QUEENS
          QUEENS
       1009349
       201984
       Unspecified
          QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.721041
      -73.909453
       (40.721040535628305, -73.90945306791765)
    
    
      2
       26594139
       10/31/2013 02:00:24 AM
       10/31/2013 02:40:32 AM
       NYPD
       New York City Police Department
            Noise - Commercial
                   Loud Music/Party
       Club/Bar/Restaurant
       10032
          4060 BROADWAY
         BROADWAY
       WEST 171 STREET
       WEST 172 STREET
      NaN
      NaN
         ADDRESS
       NEW YORK
      NaN
       Precinct
         Closed
       10/31/2013 10:00:24 AM
       10/31/2013 02:39:42 AM
       12 MANHATTAN
       MANHATTAN
       1001088
       246531
       Unspecified
       MANHATTAN
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.843330
      -73.939144
        (40.84332975466513, -73.93914371913482)

Get the first 3 rows of a column



In [27]:

    
small_requests['Agency Name'][:3]









    Out[27]:





0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object



In [28]:

    
small_requests[:3]['Agency Name']









    Out[28]:





0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object

Compare a column to a value



In [29]:

    
small_requests['Complaint Type']









    Out[29]:





0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object



In [30]:

    
# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'









    Out[30]:





0     True
1    False
2    False
3    False
4    False
Name: Complaint Type, dtype: bool

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints



In [31]:

    
# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints









    Out[31]:






  
    
      
      Unique Key
      Created Date
      Closed Date
      Agency
      Agency Name
      Complaint Type
      Descriptor
      Location Type
      Incident Zip
      Incident Address
      Street Name
      Cross Street 1
      Cross Street 2
      Intersection Street 1
      Intersection Street 2
      Address Type
      City
      Landmark
      Facility Type
      Status
      Due Date
      Resolution Action Updated Date
      Community Board
      Borough
      X Coordinate (State Plane)
      Y Coordinate (State Plane)
      Park Facility Name
      Park Borough
      School Name
      School Number
      School Region
      School Code
      School Phone Number
      School Address
      School City
      School State
      School Zip
      School Not Found
      School or Citywide Complaint
      Vehicle Type
      Taxi Company Borough
      Taxi Pick Up Location
      Bridge Highway Name
      Bridge Highway Direction
      Road Ramp
      Bridge Highway Segment
      Garage Lot Name
      Ferry Direction
      Ferry Terminal Name
      Latitude
      Longitude
      Location
    
  
  
    
      0
       26589651
       10/31/2013 02:08:41 AM
       NaN
       NYPD
       New York City Police Department
       Noise - Street/Sidewalk
       Loud Talking
       Street/Sidewalk
       11432
       90-03 169 STREET
       169 STREET
       90 AVENUE
       91 AVENUE
      NaN
      NaN
       ADDRESS
       JAMAICA
      NaN
       Precinct
       Assigned
       10/31/2013 10:08:41 AM
       10/31/2013 02:35:17 AM
       12 QUEENS
       QUEENS
       1042027
       197389
       Unspecified
       QUEENS
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       Unspecified
       N
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
       40.708275
      -73.791604
       (40.70827532593202, -73.79160395779721)

Any Dataframe has an index, which is a integer or date or something else associated to each row.



In [32]:

    
# How to get a specific row
small_requests.ix[0]









    Out[32]:





Unique Key                                                       26589651
Created Date                                       10/31/2013 02:08:41 AM
Closed Date                                                           NaN
Agency                                                               NYPD
Agency Name                               New York City Police Department
Complaint Type                                    Noise - Street/Sidewalk
Descriptor                                                   Loud Talking
Location Type                                             Street/Sidewalk
Incident Zip                                                        11432
Incident Address                                         90-03 169 STREET
Street Name                                                    169 STREET
Cross Street 1                                                  90 AVENUE
Cross Street 2                                                  91 AVENUE
Intersection Street 1                                                 NaN
Intersection Street 2                                                 NaN
Address Type                                                      ADDRESS
City                                                              JAMAICA
Landmark                                                              NaN
Facility Type                                                    Precinct
Status                                                           Assigned
Due Date                                           10/31/2013 10:08:41 AM
Resolution Action Updated Date                     10/31/2013 02:35:17 AM
Community Board                                                 12 QUEENS
Borough                                                            QUEENS
X Coordinate (State Plane)                                        1042027
Y Coordinate (State Plane)                                         197389
Park Facility Name                                            Unspecified
Park Borough                                                       QUEENS
School Name                                                   Unspecified
School Number                                                 Unspecified
School Region                                                 Unspecified
School Code                                                   Unspecified
School Phone Number                                           Unspecified
School Address                                                Unspecified
School City                                                   Unspecified
School State                                                  Unspecified
School Zip                                                    Unspecified
School Not Found                                                        N
School or Citywide Complaint                                          NaN
Vehicle Type                                                          NaN
Taxi Company Borough                                                  NaN
Taxi Pick Up Location                                                 NaN
Bridge Highway Name                                                   NaN
Bridge Highway Direction                                              NaN
Road Ramp                                                             NaN
Bridge Highway Segment                                                NaN
Garage Lot Name                                                       NaN
Ferry Direction                                                       NaN
Ferry Terminal Name                                                   NaN
Latitude                                                         40.70828
Longitude                                                        -73.7916
Location                          (40.70827532593202, -73.79160395779721)
Name: 0, Length: 52, dtype: object



In [33]:

    
# How not to get a row
small_requests[0]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-cb28920bbdf9> in <module>()
      1 # How not to get a row
----> 2 small_requests[0]

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1926         else:
   1927             # get column
-> 1928             return self._get_item_cache(key)
   1929 
   1930     def _getitem_slice(self, key):

/opt/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    568             return cache[item]
    569         except Exception:
--> 570             values = self._data.get(item)
    571             res = self._box_item_values(item, values)
    572             cache[item] = res

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   1381 
   1382     def get(self, item):
-> 1383         _, block = self._find_block(item)
   1384         return block.get(item)
   1385 

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   1523 
   1524     def _find_block(self, item):
-> 1525         self._check_have(item)
   1526         for i, block in enumerate(self.blocks):
   1527             if item in block:

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   1530     def _check_have(self, item):
   1531         if item not in self.items:
-> 1532             raise KeyError('no item named %s' % com.pprint_thing(item))
   1533 
   1534     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: u'no item named 0'

Exercise 2: Selecting things from dataframes

Select all the people with even ages from people
Find out how complaints were filed with the NYPD
The zip code here is 10007. How many complaints were filed here?
Find out which values the Descriptor column can have when the Complaint Type is "Noise - Street/Sidewalk"



In [34]:

    
# Your code here



In [34]:



In [34]:



In [34]:

Back to our example



In [35]:

    
# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])



In [36]:

    
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[36]:





<matplotlib.axes.AxesSubplot at 0x662ae90>

Indexes



In [37]:

    
noise_complaints[:3]









    Out[37]:






  
    
      
      Created Date
      Complaint Type
    
  
  
    
      0 
      2013-10-31 02:08:41
       Noise - Street/Sidewalk
    
    
      16
      2013-10-31 00:54:03
       Noise - Street/Sidewalk
    
    
      25
      2013-10-31 00:35:18
       Noise - Street/Sidewalk



In [38]:

    
noise_complaints = noise_complaints.set_index('Created Date')



In [39]:

    
noise_complaints[:3]









    Out[39]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-31 02:08:41
       Noise - Street/Sidewalk
    
    
      2013-10-31 00:54:03
       Noise - Street/Sidewalk
    
    
      2013-10-31 00:35:18
       Noise - Street/Sidewalk

Sorting the index

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES



In [40]:

    
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]









    Out[40]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-07 15:45:56
       Noise - Street/Sidewalk
    
    
      2013-10-07 16:17:41
       Noise - Street/Sidewalk
    
    
      2013-10-07 16:58:08
       Noise - Street/Sidewalk

Counting the complaints each hour



In [41]:

    
noise_complaints.resample('H', how=len)[:3]









    Out[41]:






  
    
      
      Complaint Type
    
    
      Created Date
      
    
  
  
    
      2013-10-07 15:00:00
       1
    
    
      2013-10-07 16:00:00
       2
    
    
      2013-10-07 17:00:00
       0

Example 1: done!



In [42]:

    
noise_complaints.resample('H', how=len).plot()









    Out[42]:





<matplotlib.axes.AxesSubplot at 0x330ea50>

Chaining commands together



In [43]:

    
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()









    Out[43]:





<matplotlib.axes.AxesSubplot at 0x3b63910>

Exercise 3: Time series resampling

Find the number of noise complaints every day!
Find how many complaints about rodents there are each week

Example 2: What are the most common complaint types?



In [44]:

    
orig_data['Complaint Type'].value_counts()









    Out[44]:





HEATING                     13983
GENERAL CONSTRUCTION         6859
Street Light Condition       6513
DOF Literature Request       5107
PLUMBING                     4884
PAINT - PLASTER              4671
Blocked Driveway             3992
NONCONST                     3646
Street Condition             3070
Noise                        2942
Traffic Signal Condition     2895
Illegal Parking              2865
Dirty Conditions             2364
ELECTRIC                     2154
Noise - Commercial           2120
...
Window Guard                         2
Legal Services Provider Complaint    2
Public Assembly                      2
Ferry Permit                         1
Trans Fat                            1
DFTA Literature Request              1
Highway Sign - Damaged               1
X-Ray Machine/Equipment              1
DHS Income Savings Requirement       1
Tunnel Condition                     1
Snow                                 1
Stalled Sites                        1
Open Flame Permit                    1
Municipal Parking Facility           1
DWD                                  1
Length: 165, dtype: int64



In [45]:

    
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')









    Out[45]:





<matplotlib.axes.AxesSubplot at 0x3702210>

Exercise 4: Do the same thing for a different column



In [46]:

    
# Your code here.

Example 3: Which weekday has the most noise complaints?



In [50]:

    
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")



In [63]:

    
noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]









    Out[63]:






  
    
      
      Complaint Type
      weekday
    
    
      Created Date
      
      
    
  
  
    
      2013-10-31 02:08:41
       Noise - Street/Sidewalk
       3
    
    
      2013-10-31 00:54:03
       Noise - Street/Sidewalk
       3
    
    
      2013-10-31 00:35:18
       Noise - Street/Sidewalk
       3



In [64]:

    
# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday









    Out[64]:






  
    
      
      Complaint Type
    
    
      weekday
      
    
  
  
    
      0
       200
    
    
      1
       187
    
    
      2
       204
    
    
      3
       149
    
    
      4
       180
    
    
      5
       312
    
    
      6
       280



In [65]:

    
# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]



In [66]:

    
counts_by_weekday.plot(kind='bar')









    Out[66]:





<matplotlib.axes.AxesSubplot at 0x6c9da50>

Exercise 5: Count the complaints by hour instead



In [67]:

    
# Your code here



In [67]:



In [67]:

A few more cool things

String searching



In [77]:

    
# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')



In [75]:

    
manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets









    Out[75]:





263          MANHATTAN AVENUE
1387         MANHATTAN AVENUE
1589         MANHATTAN AVENUE
1943         MANHATTAN AVENUE
2826         MANHATTAN AVENUE
2968         MANHATTAN AVENUE
3364         MANHATTAN AVENUE
6068         MANHATTAN AVENUE
7359     MANHATTAN BEACH PARK
7360     MANHATTAN BEACH PARK
7917         MANHATTAN AVENUE
10095        MANHATTAN AVENUE
10688        MANHATTAN AVENUE
11043        MANHATTAN AVENUE
12668        MANHATTAN AVENUE
...
77358             MANHATTAN AVENUE
77404    MANHATTAN COLLEGE PARKWAY
77885             MANHATTAN AVENUE
82118         MANHATTAN BEACH PARK
82122         MANHATTAN BEACH PARK
84985                MANHATTAN AVE
85032             MANHATTAN AVENUE
85202              MANHATTAN COURT
85602             MANHATTAN AVENUE
85680             MANHATTAN AVENUE
89159             MANHATTAN AVENUE
91088    MANHATTAN COLLEGE PARKWAY
93843         MANHATTAN BEACH PARK
95950             MANHATTAN AVENUE
96630             MANHATTAN AVENUE
Name: Street Name, Length: 106, dtype: object



In [76]:

    
manhattan_streets.value_counts()









    Out[76]:





MANHATTAN AVENUE             88
MANHATTAN COLLEGE PARKWAY     7
MANHATTAN BEACH PARK          6
MANHATTAN STREET              3
MANHATTAN COURT               1
MANHATTAN AVE                 1
dtype: int64

Looking at complaints close to us



In [91]:

    
# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878



In [94]:

    
distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2



In [96]:

    
pd.Series(distance_from_us).hist()









    Out[96]:





<matplotlib.axes.AxesSubplot at 0xa5d7350>



In [103]:

    
close_complaints = orig_data[distance_from_us < 0.00005]



In [106]:

    
close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')









    Out[106]:





<matplotlib.axes.AxesSubplot at 0x1988ff90>

	Complaint Type	Created Date
0	Noise - Street/Sidewalk	10/31/2013 02:08:41 AM
1	Illegal Parking	10/31/2013 02:01:04 AM
2	Noise - Commercial	10/31/2013 02:00:24 AM
3	Noise - Vehicle	10/31/2013 01:56:23 AM
4	Rodent	10/31/2013 01:53:44 AM

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	Street Name	Cross Street 1	Cross Street 2	Intersection Street 1	Intersection Street 2	Address Type	City	Landmark	Facility Type	Status	Due Date	Resolution Action Updated Date	Community Board	Borough	X Coordinate (State Plane)	Y Coordinate (State Plane)	Park Facility Name	Park Borough	School Name	School Number	School Region	School Code	School Phone Number	School Address	School City	School State	School Zip	School Not Found	School or Citywide Complaint	Vehicle Type	Taxi Company Borough	Taxi Pick Up Location	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location
0	26589651	10/31/2013 02:08:41 AM	NaN	NYPD	New York City Police Department	Noise - Street/Sidewalk	Loud Talking	Street/Sidewalk	11432	90-03 169 STREET	169 STREET	90 AVENUE	91 AVENUE	NaN	NaN	ADDRESS	JAMAICA	NaN	Precinct	Assigned	10/31/2013 10:08:41 AM	10/31/2013 02:35:17 AM	12 QUEENS	QUEENS	1042027	197389	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.708275	-73.791604	(40.70827532593202, -73.79160395779721)
1	26593698	10/31/2013 02:01:04 AM	NaN	NYPD	New York City Police Department	Illegal Parking	Commercial Overnight Parking	Street/Sidewalk	11378	58 AVENUE	58 AVENUE	58 PLACE	59 STREET	NaN	NaN	BLOCKFACE	MASPETH	NaN	Precinct	Open	10/31/2013 10:01:04 AM	NaN	05 QUEENS	QUEENS	1009349	201984	Unspecified	QUEENS	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.721041	-73.909453	(40.721040535628305, -73.90945306791765)
2	26594139	10/31/2013 02:00:24 AM	10/31/2013 02:40:32 AM	NYPD	New York City Police Department	Noise - Commercial	Loud Music/Party	Club/Bar/Restaurant	10032	4060 BROADWAY	BROADWAY	WEST 171 STREET	WEST 172 STREET	NaN	NaN	ADDRESS	NEW YORK	NaN	Precinct	Closed	10/31/2013 10:00:24 AM	10/31/2013 02:39:42 AM	12 MANHATTAN	MANHATTAN	1001088	246531	Unspecified	MANHATTAN	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	Unspecified	N	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.843330	-73.939144	(40.84332975466513, -73.93914371913482)

	Created Date	Complaint Type
0	2013-10-31 02:08:41	Noise - Street/Sidewalk
16	2013-10-31 00:54:03	Noise - Street/Sidewalk
25	2013-10-31 00:35:18	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-31 02:08:41	Noise - Street/Sidewalk
2013-10-31 00:54:03	Noise - Street/Sidewalk
2013-10-31 00:35:18	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-07 15:45:56	Noise - Street/Sidewalk
2013-10-07 16:17:41	Noise - Street/Sidewalk
2013-10-07 16:58:08	Noise - Street/Sidewalk

	Complaint Type
Created Date
2013-10-07 15:00:00	1
2013-10-07 16:00:00	2
2013-10-07 17:00:00	0

A practical introduction to IPython Notebook & pandas

Goal (in 6 months)

Goal (Today)

Some notes about installation:

Don't do this:

Instead, do this:

First: Read the data

Example 1: Graph the number of noise complaints each hour in New York

Example 2: What are the most common complaint types?

Example 3: Does every zip code complain about the same things?

Roadmap:

1. Numpy: What makes pandas fast

How to create a numpy array

How to operate on numpy arrays

More array operations

Vectorized operations: Don't do this:

Do this instead:

Exercise 1: Compute the mean of the numbers 1-1000000

Exercise 2: Find all the elements in the prices array that are divisible by 6

What is pandas?

A few awesome things about pandas

2. Dataframes: what makes pandas powerful

3. Selecting data from a dataframe

Get the first 3 rows of a column

Compare a column to a value

Selecting only the rows with noise complaints

Exercise 2: Selecting things from dataframes

Back to our example

Indexes

Sorting the index

Counting the complaints each hour

Example 1: done!

Chaining commands together

Exercise 3: Time series resampling

Example 2: What are the most common complaint types?

Exercise 4: Do the same thing for a different column

Example 3: Which weekday has the most noise complaints?

Exercise 5: Count the complaints by hour instead

A few more cool things

String searching

Looking at complaints close to us

Exercise 2: Find all the elements in the `prices` array that are divisible by 6