Homework 11: PARKING TICKETS
Data is here: https://www.dropbox.com/s/4cbpwdcd802av1z/11-homework-data.zip?dl=0 [ www.dropbox.com/s/4cbpwdcd802av1z/11-homework-data.zip?dl=0 ]
Read this: http://iquantny.tumblr.com/post/144197004989/the-nypd-was-systematically-ticketing-legally
And the then we can take a peek at some data on parking. After you've cleaned up the data as instructed below, feel free to browse in other directions than the ones I say. When fixing up the data, you'll probably making a lot of use of date parsing and .apply.
In [562]:
import pandas as pd
import numpy as np
from dateutil import parser
from time import strptime
from time import mktime
from datetime import datetime
import math
import matplotlib.pyplot
%matplotlib inline
In [342]:
def timefixer(time):
return_time=time
return_time.replace('+', '0')
if time == 'nan':
return np.nan
if time[0:2]=='00':
hour='12'
rest=time[2:]
return_time=hour+rest
#strptime(row, '%I%M%p')
try:
return strptime(return_time, '%I%M%p')
except:
return np.nan
In [346]:
violations_df=pd.read_csv('violations.csv', nrows=100000)
In [347]:
violations_df.columns
Out[347]:
I want to make sure my Plate ID is a string. Can't lose the leading zeroes!
In [348]:
type(violations_df['Plate ID'][0])
Out[348]:
I don't think anyone's car was built in 0AD. Discard the '0's as NaN.
In [349]:
violations_df['Vehicle Year']=violations_df['Vehicle Year'].apply(lambda row: row if row!=0 else np.nan)
In [350]:
violations_df['Vehicle Year'].head()
Out[350]:
I want the dates to be dates! Read the read_csv documentation to find out how to make pandas automatically parse dates.
In [351]:
violations_df['Issue Date']=pd.to_datetime(violations_df['Issue Date'])
In [352]:
violations_df['Issue Date'][0]
Out[352]:
"Date first observed" is a pretty weird column, but it seems like it has a date hiding inside. Using a function with .apply, transform the string (e.g. "20140324") into a Python date. Make the 0's show up as NaN.
In [353]:
violations_df['Date First Observed']=violations_df['Date First Observed'].apply(lambda row: parser.parse(str(row)) if row!=0 else np.nan)
"Violation time" is... not a time. Make it a time.
In [355]:
strptime(thing, '%I%M%p')
Out[355]:
In [356]:
violations_df['Violation Time'].head(1)
Out[356]:
In [357]:
violations_df['Violation Time']=violations_df['Violation Time'].apply(lambda row: str(row).replace('A', 'am')).apply(lambda row: str(row).replace('P', 'pm'))
In [358]:
violations_df['Violation Time'].head(1)
Out[358]:
In [359]:
np.nan is np.nan
Out[359]:
In [360]:
violations_df['Violation Time']=violations_df['Violation Time'].apply(lambda row: timefixer(row))
There sure are a lot of colors of cars, too bad so many of them are the same. Make "BLK" and "BLACK", "WT" and "WHITE", and any other combinations that you notice.
In [438]:
def colorfixer(color):
if color in ['BLK', 'BALCK', 'BLACL', 'BK', 'BLCK', 'BLAC']:
return 'BLACK'
elif color in ['WH', 'WITE', 'WHT', 'WIHTE', 'W', 'WT', 'WHI', 'WYH', 'WHTE']:
return 'WHITE'
elif color in ['GY', 'GR', 'GRY', 'GREY']:
return 'GRAY'
elif color in ['GN', 'GRN', 'GREEB']:
return 'GREEN'
elif color in ['BR', 'SROWN', 'BN', 'BRW', 'BROW', 'BRN', 'BRWN', 'BRO', 'BWN']:
return 'BROWN'
elif color=='RD':
return 'RED'
elif color in ['BLU', 'BL', 'BLE', 'B LUE']:
return 'BLUE'
elif color=='TN':
return 'TAN'
elif color in ['YELL', 'YEL', 'YELLO', 'YL', 'YW', 'YELLW', 'YLW']:
return 'YELLOW'
elif color in ['ONG', 'ORAN', 'OR', 'ORANG', 'ORG']:
return 'ORANGE'
elif color in ['SIL', 'SL', 'SILV', 'SILVE', 'SILVR', 'SIVLE', 'SLVR', 'SLIVE', 'SLV', 'SIV', 'SLR']:
return 'SILVER'
elif color in ['GL', 'GLD', 'GD']:
return 'GOLD'
elif color in ['PURPL', 'PUR', 'PURP']:
return 'PURPLE'
elif color in ['BURGA', 'BURGU', 'BIRG', 'BURG', 'BUR']:
return 'BURGUNDY'
elif color in ['MAR', 'MARON', 'MAROO', 'MR']:
return 'MAROON'
elif color in ['BG', 'BIEGE']:
return 'BEIGE'
else:
return color
In [439]:
violations_df['Vehicle Color']=violations_df['Vehicle Color'].apply(lambda row: colorfixer(row))
In [440]:
violations_df['Vehicle Color'].value_counts()
Out[440]:
Join the data with the Parking Violations Code dataset from the NYC Open Data site.
In [441]:
codes_df=pd.read_csv('DOF_Parking_Violation_Codes.csv')
In [442]:
codes_df.columns
Out[442]:
In [443]:
violations_df.columns
Out[443]:
In [468]:
pd.merge?
In [446]:
codes_df['Violation Code']=codes_df['CODE']
In [480]:
violations_df['CODE']=violations_df['Violation Code'].apply(lambda row: str(row))
In [483]:
df=pd.merge(left=violations_df, right=codes_df, on='CODE')#, how='outer')
In [487]:
df.columns
Out[487]:
How much money did NYC make off of parking violations?
In [499]:
def dollarfixer(amount):
import re
dollar=re.search(r'[0-9]+', amount)
if dollar:
return int(dollar.group())
else:
return 0
In [500]:
dollarfixer('$115')
Out[500]:
In [501]:
df['ticket cost']=df['Manhattan\xa0 96th St. & below'].apply(lambda row: dollarfixer(row))
In [502]:
df['ticket cost'].sum()
Out[502]:
What's the most lucrative kind of parking violation? The most frequent?
In [513]:
df.groupby(by='DEFINITION')['ticket cost'].sum().sort_values(ascending=False).head(1)
Out[513]:
In [514]:
df.groupby(by='DEFINITION')['DEFINITION'].count().sort_values(ascending=False).head(1)
Out[514]:
New Jersey has bad drivers, but does it have bad parkers, too? How much money does NYC make off of all non-New York vehicles? Make a chart of the top few.
In [518]:
df[df['Registration State']!='NY']['ticket cost'].sum()
Out[518]:
In [530]:
out_of_state=df[df['Registration State']!='NY'].groupby(by='Registration State')['ticket cost'].sum().sort_values(ascending=False).head(10)
In [532]:
out_of_state.plot(kind='barh').invert_yaxis()
What time of day do people usually get their tickets? You can break the day up into several blocks - for example 12am-6am, 6am-12pm, 12pm-6pm, 6pm-12am. What's the average ticket cost in NYC? Make a graph of the number of tickets per day. Make a graph of the amount of revenue collected per day. Manually construct a dataframe out of https://dmv.ny.gov/statistic/2015licinforce-web.pdf (only NYC boroughts - bronx, queens, manhattan, staten island, brooklyn), having columns for borough name, abbreviation, and number of licensed drivers. What's the parking-ticket-$-per-licensed-driver in each borough of NYC? Do this with pandas and the dataframe you just made, not with your head!
In [ ]:
dt =
In [563]:
df['Violation Time'].apply(lambda x: datetime.fromtimestamp(mktime(x)))#groupby(by=tm_hour)
In [ ]: