Hi! I'm Julia.
Right now: Hacker School.
Before: Data scientist.
I'm on the internet at http://jvns.ca, http://twitter.com/b0rk
You can follow along by downloading this presentation and running the code yourself:
sudo apt-get install ipython-notebook
pip install ipython tornado pyzmq
or install Anaconda from http://store.continuum.io (what I do)
You can start IPython notebook by running
ipython notebook --pylab inline
In [1]:
# Some display stuff. Don't mind this for now.
import pandas as pd
pd.set_option('display.mpl_style', 'default') # Make graphs pretty
figsize(15, 6) # Make graphs a good size for my screen
# Display all the columns instead of a summary
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)
In [2]:
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [43]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[43]:
This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.
If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.
In [4]:
people = pd.read_csv('tiny.csv')
people
Out[4]:
This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 5 rows.
In [5]:
# Load the first 5 rows of our CSV
requests = pd.read_csv('./311-service-requests.csv', nrows=5)
In [6]:
# How to get a column
requests['Complaint Type']
Out[6]:
In [7]:
# How to get a subset of the columns
requests[['Complaint Type', 'Created Date']]
Out[7]:
In [8]:
# How to get 3 rows
requests[:3]
Out[8]:
In [9]:
requests['Agency Name'][:3]
Out[9]:
In [10]:
requests[:3]['Agency Name']
Out[10]:
In [11]:
requests['Complaint Type']
Out[11]:
In [12]:
requests['Complaint Type'] == 'Noise - Street/Sidewalk'
Out[12]:
That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values
In [13]:
noise_complaints = requests[requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints
Out[13]:
In [14]:
# How to get a specific row
requests.ix[0]
Out[14]:
In [15]:
# How not to get a row
requests[0]
In [16]:
requests = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [17]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[17]:
In [18]:
noise_complaints[:3]
Out[18]:
In [19]:
noise_complaints = noise_complaints.set_index('Created Date')
In [20]:
noise_complaints[:3]
Out[20]:
Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES
In [21]:
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]
Out[21]:
In [22]:
noise_complaints.resample('H', how=len)[:3]
Out[22]:
In [23]:
noise_complaints.resample('H', how=len).plot()
Out[23]:
In [25]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index('Created Date')
noise_complaints = noise_complaints.sort_index()
noise_complaints = noise_complaints.resample('H', how=len)
#noise_complaints.plot()
In [26]:
complaints = requests[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[26]:
In [46]:
noise_complaints.set_index('Created Date').sort_index().resample('D', how=len).plot(kind='bar')
Out[46]:
In [48]:
noise_complaints = noise_complaints.set_index('Created Date').sort_index()
noise_complaints['weekday'] = noise_complaints.index.weekday
In [53]:
complaints_by_day = noise_complaints.groupby('weekday').aggregate(len)
complaints_by_day
Out[53]:
In [55]:
complaints_by_day.index = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
complaints_by_day.plot(kind='bar')
Out[55]:
In [27]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[27]:
In [28]:
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')
Out[28]:
In [56]:
rodent_complaints = orig_data[orig_data['Complaint Type'] == 'Rodent']['Borough'].value_counts()
total_complaints = orig_data['Borough'].value_counts()
(rodent_complaints / total_complaints ).plot(kind='bar')
Out[56]: