Homework 11: PARKING TICKETS



In [1]:

    
import pandas as pd

1. I want to make sure my Plate ID is a string. Can't lose the leading zeroes!



In [2]:

    
#As the datatypes in all of the columns vary, I decided to to make all the values, except for the ones I specify
#into str. This also takes care of questions 1. -> dtype=str

2. I don't think anyone's car was built in 0AD. Discard the '0's as NaN.



In [3]:

    
# The syntax for this is really very nice and clear, an example na_values= {'Vehicle Year' : ['0']}
# A lot more here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

3. I want the dates to be dates! Read the read_csv documentation to find out how to make pandas automatically parse dates.



In [4]:

    
#parse_dates=True,keep_date_col = True

import datetime
import datetime as dt

dt.datetime.strptime('08/04/2013', '%m/%d/%Y')
datetime.datetime(2013, 8, 4, 0, 0)

parser = lambda date: pd.datetime.strptime(date, '%m/%d/%Y')

#taken from here: http://stackoverflow.com/questions/23797491/parse-dates-in-pandas
#adding this, column counts starts at 0, without index, parse_dates=[4], date_parser=parser

#nrows=100000,



In [5]:

    
df = pd.read_csv("violations.csv", na_values= {'Vehicle Year' : ['0']}, parse_dates=[4], date_parser=parser, dtype=str)



In [6]:

    
df.head()









    Out[6]:






  
    
      
      Summons Number
      Plate ID
      Registration State
      Plate Type
      Issue Date
      Violation Code
      Vehicle Body Type
      Vehicle Make
      Issuing Agency
      Street Code1
      ...
      Vehicle Color
      Unregistered Vehicle?
      Vehicle Year
      Meter Number
      Feet From Curb
      Violation Post Code
      Violation Description
      No Standing or Stopping Violation
      Hydrant Violation
      Double Parking Violation
    
  
  
    
      0
      1283294138
      GBB9093
      NY
      PAS
      2013-08-04
      46
      SUBN
      AUDI
      P
      37250
      ...
      GY
      0
      2013
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1283294151
      62416MB
      NY
      COM
      2013-08-04
      46
      VAN
      FORD
      P
      37290
      ...
      WH
      0
      2012
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1283294163
      78755JZ
      NY
      COM
      2013-08-05
      46
      P-U
      CHEVR
      P
      37030
      ...
      NaN
      0
      NaN
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1283294175
      63009MA
      NY
      COM
      2013-08-05
      46
      VAN
      FORD
      P
      37270
      ...
      WH
      0
      2010
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      1283294187
      91648MC
      NY
      COM
      2013-08-08
      41
      TRLR
      GMC
      P
      37240
      ...
      BR
      0
      2012
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 43 columns



In [7]:

    
#df.info()

4. "Date first observed" is a pretty weird column, but it seems like it has a date hiding inside. Using a function with .apply, transform the string (e.g. "20140324") into a Python date. Make the 0's show up as NaN.



In [8]:

    
#df['Date First Observed'].value_counts()



In [9]:

    
import dateutil.parser



In [10]:

    
def date_transfer(x):
    x = str(x)
    if x == '0':
        return 'NaN'
    try:
        date = dateutil.parser.parse(x)
        return str(date.strftime('%d/%m/%Y'))
    except:
        return 'NaN'    
    #    dt.datetime.strptime('20130719', '%Y%m%d')
    #    datetime.datetime(2013, 7, 19, 0, 0)
    #    year_month_tag = dateutil.parser.parse(x)
    return x
print(date_transfer(20151198))

NaN



In [11]:

    
df['Date First Observed'].apply(date_transfer).head()









    Out[11]:





0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Date First Observed, dtype: object

5. "Violation time" is... not a time. Make it a time.



In [12]:

    
#adding n_values = {'Violation Time' : ['0']}
#df = pd.read_csv("violations.csv", na_values= {'Vehicle Year' : ['0'], 'Violation Time' : ['0']}, parse_dates=[4], date_parser=parser, dtype=str)



In [13]:

    
df['Violation Time'].head(5)









    Out[13]:





0    0752A
1    1240P
2    1243P
3    0232P
4    1239P
Name: Violation Time, dtype: object



In [14]:

    
#df['Violation Time'].value_counts()



In [15]:

    
import dateutil

def pmam(x):
    x = str(x)
    #x = (':'.join(a+b for a,b in zip(x[::2], x[1::2])))
    if x == 'NaN':
        pass
    try:
        x = str(x[:2] + ':' + x[2:])
        date = dateutil.parser.parse(x)
        return str(date.strftime('%H:%M %p'))
    except:
        return 'NaN'

print(pmam('NaN'))
print(pmam("0752A"))
print(pmam("1240P"))
print(pmam("0232P"))









    



NaN
07:52 AM
12:40 PM
14:32 PM



In [16]:

    
df['Violation Time pmam'] = df['Violation Time'].apply(pmam)

6. There sure are a lot of colors of cars, too bad so many of them are the same. Make "BLK" and "BLACK", "WT" and "WHITE", and any other combinations that you notice.



In [17]:

    
df['Vehicle Color'].head(10)









    Out[17]:





0       GY
1       WH
2      NaN
3       WH
4       BR
5       RD
6       GN
7    WHITE
8    WHITE
9    BLACK
Name: Vehicle Color, dtype: object



In [18]:

    
def black_and_white(car_color):
    car_color = str(car_color)
    if car_color == 'NaN':
        return 'NaN'
    elif car_color == 'WHITE':
        return car_color
    elif car_color == 'BLACK':
        return car_color
    else:
        return car_color.replace("BK","BLACK").replace("WH", "WHITE")



In [19]:

    
def rainbow(car_color):
    car_color = str(car_color)
    if car_color == 'NaN':
        return car_color
    elif car_color == 'BROWN':
        return car_color
    elif car_color == 'BLUE':
        return car_color
    elif car_color == 'RED':
        return car_color
    elif car_color == 'GREY':
        return car_color
    elif car_color == 'GREEN':
        return car_color
    else:
        return car_color.replace("BR","BROWN").replace("RD", "RED").replace("GY", "GRAY").replace("TN", "TAN").replace("BLU", "BLUE").replace("GN", "GREEN")



In [20]:

    
df['Vehicle Color'].apply(black_and_white).apply(rainbow).head(5)









    Out[20]:





0     GRAY
1    WHITE
2      nan
3    WHITE
4    BROWN
Name: Vehicle Color, dtype: object

7.Join the data with the Parking Violations Code dataset from the NYC Open Data site.



In [21]:

    
DOF_Parking_Violation_Codes_df = pd.read_csv("DOF_Parking_Violation_Codes.csv")
DOF_Parking_Violation_Codes_df.head()









    Out[21]:






  
    
      
      CODE
      DEFINITION
      Manhattan  96th St. & below
      All Other Areas
    
  
  
    
      0
      10
      Stopping, standing or parking where a sign, st...
      $115
      $115
    
    
      1
      11
      Hotel Loading/Unloading: Standing or parking w...
      $115
      $115
    
    
      2
      12
      Snow Emergency: Standing or parking where stan...
      $95
      $95
    
    
      3
      13
      Taxi Stand: Standing or parking where standing...
      $115
      $115
    
    
      4
      14
      General No Standing: Standing or parking where...
      $115
      $115



In [22]:

    
df_merged = df.merge(DOF_Parking_Violation_Codes_df, left_on='Violation Code', right_on='CODE')

8. How much money did NYC make off of parking violations?



In [23]:

    
def money_to_int(money_str):
    if money_str == '200 (Heavy Tow plus violation fine)':
        money_str_Heavy_Tow = money_str[:3]
        return int(money_str_Heavy_Tow)
    if money_str == '100\n(Regular Tow, plus violation fine)':
        money_str_Heavy_Tow = money_str[:3]
        return int(money_str_Heavy_Tow)
    try:
        return int(money_str.replace("$","").replace(",",""))
    except:
        return None



In [24]:

    
df_merged['All Other Areas int'] = df_merged['All Other Areas'].apply(money_to_int)



In [25]:

    
df_merged['Manhattan\xa0 96th St. & below int'] = df_merged['Manhattan\xa0 96th St. & below'].apply(money_to_int)



In [26]:

    
Mininum_Amount = df_merged['All Other Areas int'].sum()



In [27]:

    
print("NYC made at least: ${:,.2f} on parking fines.".format(Mininum_Amount))









    



NYC made at least: $296,225,480.00 on parking fines.

9. What's the most lucrative kind of parking violation? The most frequent?



In [28]:

    
def def_strip(definition):
    return definition.strip()



In [29]:

    
df_merged['DEFINITION_stripped'] = df_merged['DEFINITION'].apply(def_strip)



In [30]:

    
#Most frequent kind of parking violation
df_merged['DEFINITION'].value_counts().head(1)









    Out[30]:





Street Cleaning: No parking where parking is not allowed by sign, street marking or traffic control device.    706228
Name: DEFINITION, dtype: int64



In [31]:

    
count_by_violation_code = df_merged['CODE'].value_counts()



In [32]:

    
count_by_violation_code.head()









    Out[32]:





21    706228
14    468987
20    299391
71    281492
46    257686
Name: CODE, dtype: int64



In [33]:

    
df_merged.groupby('CODE')['All Other Areas int'].sum().sort_values(ascending=False).head()









    Out[33]:





CODE
14    53933505.0
21    31780260.0
46    29633890.0
40    28869025.0
71    18296980.0
Name: All Other Areas int, dtype: float64



In [34]:

    
Most_lucrative_amount = df_merged.groupby('CODE')['All Other Areas int'].sum().sort_values(ascending=False)[1]



In [35]:

    
Most_lucrative_fine_in_words = df_merged[df_merged['CODE'] == '14']



In [36]:

    
Most_lucrative_fine_in_words = Most_lucrative_fine_in_words['DEFINITION'].head(1)



In [37]:

    
print("The most lucrative fine regarded ${:,.2f} on parking fines.".format(Most_lucrative_amount)) 
print("It regarded " + Most_lucrative_fine_in_words)









    



The most lucrative fine regarded $31,780,260.00 on parking fines.
257686    It regarded General No Standing: Standing or p...
Name: DEFINITION, dtype: object

10. New Jersey has bad drivers, but does it have bad parkers, too? How much money does NYC make off of all non-New York vehicles?



In [38]:

    
df_merged['Registration State'].value_counts().head(10)









    Out[38]:





NY    2686324
NJ     366192
PA      92566
CT      52905
FL      43229
MA      32860
VA      25023
99      22545
MD      20699
IN      20042
Name: Registration State, dtype: int64



In [39]:

    
Fines_for_New_York_Cars = df_merged.groupby('Registration State')['All Other Areas int'].sum().sort_values(ascending=False).head(1)



In [40]:

    
Fines_for_New_York_Cars









    Out[40]:





Registration State
NY    220943620.0
Name: All Other Areas int, dtype: float64



In [41]:

    
Mininum_Amount #Total fines









    Out[41]:





296225480.0



In [42]:

    
print("New York City makes ${:,.2f} off non-New-York cars.".format(Mininum_Amount - 220943620.0))









    



New York City makes $75,281,860.00 off non-New-York cars.



In [43]:

    
#Double Check:
df_merged[df_merged['Registration State'] != 'NY'].sum()









    Out[43]:





All Other Areas int                75281860.0
Manhattan  96th St. & below int    80444775.0
dtype: float64

11. Make a chart of the top few.



In [44]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")



In [45]:

    
df_merged['Registration State'].value_counts().sort_values(ascending=True).tail(10).plot(kind='barh')









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x14049f668>

12. What time of day do people usually get their tickets? You can break the day up into several blocks - for example 12am-6am, 6am-12pm, 12pm-6pm,6pm-12am.



In [46]:

    
import dateutil

def Time_of_day(x):
    x = str(x)
    #x = (':'.join(a+b for a,b in zip(x[::2], x[1::2])))
    if x == 'NaN':
        pass
    try:
        x = str(x[:2] + ':' + x[2:])
        date = dateutil.parser.parse(x)
        time = int(str(date.strftime('%H')))
        if time <= 6:
            return '12am - 6am'
        elif time <= 12:
            return '6am - 12pm'
        elif time <= 18:
            return '12pm - 6pm'
        elif time <= 24:
            return '6pm - 12am'
    except:
        return 'NaN'



In [47]:

    
df_merged['Time of Day'] = df['Violation Time'].apply(Time_of_day)



In [48]:

    
df_merged['Time of Day'].value_counts()









    Out[48]:





6am - 12pm    1746577
12pm - 6pm    1338139
12am - 6am     251461
6pm - 12am     220209
NaN              1200
Name: Time of Day, dtype: int64

14. Make a graph of the number of tickets per day.



In [49]:

    
df_merged['Issue Date'].head()









    Out[49]:





0   2013-08-04
1   2013-08-04
2   2013-08-05
3   2013-08-05
4   2013-07-23
Name: Issue Date, dtype: datetime64[ns]



In [50]:

    
#Filtering out all the dates that were in the future and before 2013, obviously wrong dates.
df_merged_only_2013_dates = df_merged[(df_merged['Issue Date'] > '2013-01-01') & (df_merged['Issue Date'] < '2013-12-31')]



In [51]:

    
#More on histograms
#http://stackoverflow.com/questions/27365467/python-pandas-plot-histogram-of-dates
#http://stackoverflow.com/questions/29672375/histogram-in-matplotlib-time-on-x-axis
df_merged_only_2013_dates.groupby('Issue Date')['Issue Date'].count().plot(kind="bar", figsize=(15, 6))









    Out[51]:





<matplotlib.axes._subplots.AxesSubplot at 0x140a72048>

15. Make a graph of the amount of revenue collected per day.



In [52]:

    
df_merged['Issue Date'].head()









    Out[52]:





0   2013-08-04
1   2013-08-04
2   2013-08-05
3   2013-08-05
4   2013-07-23
Name: Issue Date, dtype: datetime64[ns]



In [53]:

    
df_merged_only_2013_dates.groupby('Issue Date')['All Other Areas int'].sum().plot(kind="bar", figsize=(15, 6))









    Out[53]:





<matplotlib.axes._subplots.AxesSubplot at 0x14062b6d8>

16. Manually construct a dataframe out of https://dmv.ny.gov/statistic/2015licinforce-web.pdf (only NYC boroughts - bronx, queens, manhattan, staten island, brooklyn), having columns for borough name, abbreviation, and number of licensed drivers.



In [54]:

    
df_boroughs = pd.read_csv("NYC_Boroughs.csv")



In [55]:

    
df_boroughs.head()









    Out[55]:






  
    
      
      borough name
      abbreviation
      Total
    
  
  
    
      0
      Manhattan (New York)
      NY
      748583
    
    
      1
      Bronx
      BX
      456847
    
    
      2
      Brooklyn
      K
      983769
    
    
      3
      Queens
      Q
      1134064
    
    
      4
      Staten Island (Richmond)
      R
      304250

17. What's the parking-ticket-$-per-licensed-driver in each borough of NYC? Do this with pandas and the dataframe you just made, not with your head!



In [56]:

    
df_merged['Violation County'].value_counts()









    Out[56]:





NY       1577324
K         824241
Q         695722
BX        389760
R          40140
RICH           4
QUEEN          2
RC             2
NYC            2
103            1
BRONX          1
KINGS          1
Name: Violation County, dtype: int64



In [57]:

    
df_merged.groupby('Violation County')['All Other Areas int'].sum()









    Out[57]:





Violation County
103            115.0
BRONX          115.0
BX        30345650.0
K         63160360.0
KINGS          115.0
NY       143169105.0
NYC            175.0
Q         53535510.0
QUEEN          230.0
R          3236915.0
RC             280.0
RICH           375.0
Name: All Other Areas int, dtype: float64



In [58]:

    
def boroughs_trans(borough):
    borough = str(borough)
    if borough == 'NaN':
        pass
    elif borough == 'RICH':
        return 'R'
    elif borough == 'BRONX':
        return 'BX'
    else:
        return None

	Summons Number	Plate ID	Registration State	Plate Type	Issue Date	Violation Code	Vehicle Body Type	Vehicle Make	Issuing Agency	Street Code1	...	Vehicle Color	Vehicle Year	Meter Number	Violation Post Code	Violation Description	No Standing or Stopping Violation	Hydrant Violation	Double Parking Violation
0	1283294138	GBB9093	NY	PAS	2013-08-04	46	SUBN	AUDI	P	37250	...	GY	2013	-	NaN	NaN	NaN	NaN	NaN
1	1283294151	62416MB	NY	COM	2013-08-04	46	VAN	FORD	P	37290	...	WH	2012	-	NaN	NaN	NaN	NaN	NaN
2	1283294163	78755JZ	NY	COM	2013-08-05	46	P-U	CHEVR	P	37030	...	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN
3	1283294175	63009MA	NY	COM	2013-08-05	46	VAN	FORD	P	37270	...	WH	2010	-	NaN	NaN	NaN	NaN	NaN
4	1283294187	91648MC	NY	COM	2013-08-08	41	TRLR	GMC	P	37240	...	BR	2012	-	NaN	NaN	NaN	NaN	NaN

	CODE	DEFINITION	Manhattan 96th St. & below	All Other Areas
0	10	Stopping, standing or parking where a sign, st...	$115	$115
1	11	Hotel Loading/Unloading: Standing or parking w...	$115	$115
2	12	Snow Emergency: Standing or parking where stan...	$95	$95
3	13	Taxi Stand: Standing or parking where standing...	$115	$115
4	14	General No Standing: Standing or parking where...	$115	$115

	borough name	abbreviation	Total
0	Manhattan (New York)	NY	748583
1	Bronx	BX	456847
2	Brooklyn	K	983769
3	Queens	Q	1134064
4	Staten Island (Richmond)	R	304250