Hi! I'm Julia.
Right now: Hacker School.
Before: Data scientist.
I'm on the internet at http://jvns.ca, http://twitter.com/b0rk
Follow along by downloading this presentation and running the code yourself:
Setup:
In [1]:
%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)
sudo apt-get install ipython-notebook
pip install ipython tornado pyzmq
or install Anaconda from http://store.continuum.io (what I do)
You can start IPython notebook by running
ipython notebook --pylab inline
In [78]:
# Download and read the data
!wget "http://bit.ly/311-data-tar-gz"
!tar -xzf "311-data.tar.gz" # wget does different things
!tar -xzf "311-data-tar-gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [81]:
plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")
Out[81]:
In [3]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[3]:
In [4]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[4]:
In [5]:
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')
Out[5]:
In [6]:
import numpy as np
In [7]:
np.array([1,2,8.0, 3])
Out[7]:
In [8]:
np.arange(10)
Out[8]:
In [9]:
# Generate random numbers
np.random.random(10)
Out[9]:
In [10]:
prices = np.array([31, 40, 12, 40])
prices
Out[10]:
In [11]:
# Change the type
prices.astype(np.float32)
Out[11]:
In [12]:
prices.astype(np.int64)
Out[12]:
In [13]:
# Find which ones are even
prices % 2 == 0
Out[13]:
In [14]:
# Get only the even prices
prices[prices % 2 == 0]
Out[14]:
In [15]:
# Find the mean
np.mean(prices)
Out[15]:
In [16]:
prices * prices
Out[16]:
In [17]:
v1 = np.array([1, 2, 3, 4, 5])
v2 = np.array([1, 2, 3, 8, 9])
In [18]:
result = np.zeros_like(v1)
for i in xrange(len(v1)):
result[i] = 2 * v1[i] + 3 * v2[i]
print result
In [19]:
result = 2 * v1 + 3 * v2
print result
In [20]:
# Your code here
In [20]:
In [20]:
In [21]:
# Your code here
In [21]:
This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.
If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.
In [22]:
people = pd.read_csv('tiny.csv')
people
Out[22]:
This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.
In [23]:
# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)
In [24]:
# How to get a column
small_requests['Complaint Type']
Out[24]:
In [25]:
# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]
Out[25]:
In [26]:
# How to get 3 rows
small_requests[:3]
Out[26]:
In [27]:
small_requests['Agency Name'][:3]
Out[27]:
In [28]:
small_requests[:3]['Agency Name']
Out[28]:
In [29]:
small_requests['Complaint Type']
Out[29]:
In [30]:
# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'
Out[30]:
That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values
In [31]:
# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints
Out[31]:
Any Dataframe has an index, which is a integer or date or something else associated to each row.
In [32]:
# How to get a specific row
small_requests.ix[0]
Out[32]:
In [33]:
# How not to get a row
small_requests[0]
peopleDescriptor column can have when the Complaint Type is "Noise - Street/Sidewalk"
In [34]:
# Your code here
In [34]:
In [34]:
In [34]:
In [35]:
# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [36]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[36]:
In [37]:
noise_complaints[:3]
Out[37]:
In [38]:
noise_complaints = noise_complaints.set_index('Created Date')
In [39]:
noise_complaints[:3]
Out[39]:
Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES
In [40]:
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]
Out[40]:
In [41]:
noise_complaints.resample('H', how=len)[:3]
Out[41]:
In [42]:
noise_complaints.resample('H', how=len).plot()
Out[42]:
In [43]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[43]:
In [44]:
orig_data['Complaint Type'].value_counts()
Out[44]:
In [45]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[45]:
In [46]:
# Your code here.
In [50]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")
In [63]:
noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]
Out[63]:
In [64]:
# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday
Out[64]:
In [65]:
# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
In [66]:
counts_by_weekday.plot(kind='bar')
Out[66]:
In [67]:
# Your code here
In [67]:
In [67]:
In [77]:
# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')
In [75]:
manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets
Out[75]:
In [76]:
manhattan_streets.value_counts()
Out[76]:
In [91]:
# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878
In [94]:
distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2
In [96]:
pd.Series(distance_from_us).hist()
Out[96]:
In [103]:
close_complaints = orig_data[distance_from_us < 0.00005]
In [106]:
close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[106]: