Homework 11: PARKING TICKETS



In [1]:

    
import pandas as pd

1. I want to make sure my Plate ID is a string. Can't lose the leading zeroes!



In [2]:

    
#As the datatypes in all of the columns vary, I decided to to make all the values, except for the ones I specify
#into str. This also takes care of questions 1. -> dtype=str

2. I don't think anyone's car was built in 0AD. Discard the '0's as NaN.



In [3]:

    
# The syntax for this is really very nice and clear, an example na_values= {'Vehicle Year' : ['0']}
# A lot more here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

3. I want the dates to be dates! Read the read_csv documentation to find out how to make pandas automatically parse dates.



In [4]:

    
#parse_dates=True,keep_date_col = True

import datetime
import datetime as dt

dt.datetime.strptime('08/04/2013', '%m/%d/%Y')
datetime.datetime(2013, 8, 4, 0, 0)

parser = lambda date: pd.datetime.strptime(date, '%m/%d/%Y')

#taken from here: http://stackoverflow.com/questions/23797491/parse-dates-in-pandas
#adding this, column counts starts at 0, without index, parse_dates=[4], date_parser=parser

#nrows=100000,



In [5]:

    
df = pd.read_csv("violations.csv", na_values= {'Vehicle Year' : ['0']}, parse_dates=[4], date_parser=parser, dtype=str)



In [6]:

    
df.head()









    Out[6]:






  
    
      
      Summons Number
      Plate ID
      Registration State
      Plate Type
      Issue Date
      Violation Code
      Vehicle Body Type
      Vehicle Make
      Issuing Agency
      Street Code1
      ...
      Vehicle Color
      Unregistered Vehicle?
      Vehicle Year
      Meter Number
      Feet From Curb
      Violation Post Code
      Violation Description
      No Standing or Stopping Violation
      Hydrant Violation
      Double Parking Violation
    
  
  
    
      0
      1283294138
      GBB9093
      NY
      PAS
      2013-08-04
      46
      SUBN
      AUDI
      P
      37250
      ...
      GY
      0
      2013
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      1283294151
      62416MB
      NY
      COM
      2013-08-04
      46
      VAN
      FORD
      P
      37290
      ...
      WH
      0
      2012
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      1283294163
      78755JZ
      NY
      COM
      2013-08-05
      46
      P-U
      CHEVR
      P
      37030
      ...
      NaN
      0
      NaN
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1283294175
      63009MA
      NY
      COM
      2013-08-05
      46
      VAN
      FORD
      P
      37270
      ...
      WH
      0
      2010
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      1283294187
      91648MC
      NY
      COM
      2013-08-08
      41
      TRLR
      GMC
      P
      37240
      ...
      BR
      0
      2012
      -
      0
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 43 columns



In [7]:

    
#df.info()

4. "Date first observed" is a pretty weird column, but it seems like it has a date hiding inside. Using a function with .apply, transform the string (e.g. "20140324") into a Python date. Make the 0's show up as NaN.



In [8]:

    
#df['Date First Observed'].value_counts()



In [9]:

    
import dateutil.parser



In [10]:

    
def date_transfer(x):
    x = str(x)
    if x == '0':
        return 'NaN'
    try:
        date = dateutil.parser.parse(x)
        return str(date.strftime('%d/%m/%Y'))
    except:
        return 'NaN'    
    #    dt.datetime.strptime('20130719', '%Y%m%d')
    #    datetime.datetime(2013, 7, 19, 0, 0)
    #    year_month_tag = dateutil.parser.parse(x)
    return x
print(date_transfer(20151118))



In [11]:

    
df['Date First Observed'].apply(date_transfer).head()









    Out[11]:





0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: Date First Observed, dtype: object

5. "Violation time" is... not a time. Make it a time.



In [12]:

    
#adding n_values = {'Violation Time' : ['0']}
#df = pd.read_csv("violations.csv", na_values= {'Vehicle Year' : ['0'], 'Violation Time' : ['0']}, parse_dates=[4], date_parser=parser, dtype=str)



In [13]:

    
df['Violation Time'].head(5)









    Out[13]:





0    0752A
1    1240P
2    1243P
3    0232P
4    1239P
Name: Violation Time, dtype: object



In [14]:

    
#df['Violation Time'].value_counts()



In [15]:

    
import dateutil

def pmam(x):
    x = str(x)
    #x = (':'.join(a+b for a,b in zip(x[::2], x[1::2])))
    if x == 'NaN':
        pass
    try:
        x = str(x[:2] + ':' + x[2:])
        date = dateutil.parser.parse(x)
        return str(date.strftime('%H:%M %p'))
    except:
        return 'NaN'

print(pmam('NaN'))
print(pmam("0752A"))
print(pmam("1240P"))
print(pmam("0232P"))









    



NaN
07:52 AM
12:40 PM
14:32 PM



In [16]:

    
df['Violation Time pmam'] = df['Violation Time'].apply(pmam)

6. There sure are a lot of colors of cars, too bad so many of them are the same. Make "BLK" and "BLACK", "WT" and "WHITE", and any other combinations that you notice.



In [17]:

    
df['Vehicle Color'].head(10)









    Out[17]:





0       GY
1       WH
2      NaN
3       WH
4       BR
5       RD
6       GN
7    WHITE
8    WHITE
9    BLACK
Name: Vehicle Color, dtype: object



In [18]:

    
def black_and_white(car_color):
    car_color = str(car_color)
    if car_color == 'NaN':
        return 'NaN'
    elif car_color == 'WHITE':
        return car_color
    elif car_color == 'BLACK':
        return car_color
    else:
        return car_color.replace("BK","BLACK").replace("WH", "WHITE")



In [19]:

    
def rainbow(car_color):
    car_color = str(car_color)
    if car_color == 'NaN':
        return car_color
    elif car_color == 'BROWN':
        return car_color
    elif car_color == 'BLUE':
        return car_color
    elif car_color == 'RED':
        return car_color
    elif car_color == 'GREY':
        return car_color
    elif car_color == 'GREEN':
        return car_color
    else:
        return car_color.replace("BR","BROWN").replace("RD", "RED").replace("GY", "GRAY").replace("TN", "TAN").replace("BLU", "BLUE").replace("GN", "GREEN")



In [20]:

    
df['Vehicle Color'].apply(black_and_white).apply(rainbow).head(5)









    Out[20]:





0     GRAY
1    WHITE
2      nan
3    WHITE
4    BROWN
Name: Vehicle Color, dtype: object

7.Join the data with the Parking Violations Code dataset from the NYC Open Data site.



In [21]:

    
DOF_Parking_Violation_Codes_df = pd.read_csv("Parking_Fine_Codes_Man_Quota.csv")
DOF_Parking_Violation_Codes_df.head()









    Out[21]:






  
    
      
      CODE
      DEFINITION
      Manhattan 96th St. & below
      All Other Areas
      Manhattan Quota
    
  
  
    
      0
      10
      Stopping, standing or parking where a sign, st...
      115.0
      115.0
      1.0
    
    
      1
      11
      Hotel Loading/Unloading: Standing or parking w...
      115.0
      115.0
      1.0
    
    
      2
      12
      Snow Emergency: Standing or parking where stan...
      95.0
      95.0
      1.0
    
    
      3
      13
      Taxi Stand: Standing or parking where standing...
      115.0
      115.0
      1.0
    
    
      4
      14
      General No Standing: Standing or parking where...
      115.0
      115.0
      1.0



In [22]:

    
def make_code_int(code):
    return int(code)



In [23]:

    
df['Violation Code int'] = df['Violation Code'].apply(make_code_int)



In [24]:

    
df_merged = df.merge(DOF_Parking_Violation_Codes_df, left_on='Violation Code int', right_on='CODE')



In [25]:

    
df_merged.head(1)









    Out[25]:






  
    
      
      Summons Number
      Plate ID
      Registration State
      Plate Type
      Issue Date
      Violation Code
      Vehicle Body Type
      Vehicle Make
      Issuing Agency
      Street Code1
      ...
      No Standing or Stopping Violation
      Hydrant Violation
      Double Parking Violation
      Violation Time pmam
      Violation Code int
      CODE
      DEFINITION
      Manhattan 96th St. & below
      All Other Areas
      Manhattan Quota
    
  
  
    
      0
      1283294138
      GBB9093
      NY
      PAS
      2013-08-04
      46
      SUBN
      AUDI
      P
      37250
      ...
      NaN
      NaN
      NaN
      07:52 AM
      46
      46
      Standing or parking on the roadway side of a v...
      115.0
      115.0
      1.0
    
  

1 rows × 50 columns

8. How much money did NYC make off of parking violations?



In [26]:

    
Mininum_Amount = df_merged['All Other Areas'].sum()



In [27]:

    
Mininum_Amount









    Out[27]:





300970265.0



In [28]:

    
print("NYC made at least: ${:,.2f} off parking fines.".format(Mininum_Amount))









    



NYC made at least: $300,970,265.00 off parking fines.

9. What's the most lucrative kind of parking violation? The most frequent?



In [29]:

    
def def_strip(definition):
    return definition.strip()



In [30]:

    
df_merged['DEFINITION_stripped'] = df_merged['DEFINITION'].apply(def_strip)



In [31]:

    
#Most frequent kind of parking violation
df_merged['DEFINITION'].value_counts().head(1)









    Out[31]:





Street Cleaning: No parking where parking is not allowed by sign, street marking or traffic control device.    706228
Name: DEFINITION, dtype: int64



In [32]:

    
count_by_violation_code = df_merged['CODE'].value_counts()



In [33]:

    
count_by_violation_code.head()









    Out[33]:





21    706228
38    674091
14    468987
37    407106
20    299391
Name: CODE, dtype: int64



In [34]:

    
df_merged.groupby('CODE')['All Other Areas'].sum().sort_values(ascending=False).head()









    Out[34]:





CODE
14    53933505.0
21    31780260.0
46    29633890.0
40    28869025.0
38    23593185.0
Name: All Other Areas, dtype: float64



In [35]:

    
Most_lucrative_amount = df_merged.groupby('CODE')['All Other Areas'].sum().sort_values(ascending=False).head(1)
Most_lucrative_amount









    Out[35]:





CODE
14    53933505.0
Name: All Other Areas, dtype: float64



In [36]:

    
Most_lucrative_fine_in_words = df_merged[df_merged['CODE'] == 14]



In [37]:

    
Most_lucrative_fine_in_words = Most_lucrative_fine_in_words['DEFINITION'].head(1)
Most_lucrative_fine_in_words









    Out[37]:





257686    General No Standing: Standing or parking where...
Name: DEFINITION, dtype: object

10. New Jersey has bad drivers, but does it have bad parkers, too? How much money does NYC make off of all non-New York vehicles?



In [38]:

    
df_merged['Registration State'].value_counts().head(10)









    Out[38]:





NY    3103838
NJ     439615
PA     112614
CT      70654
FL      57367
MA      40476
VA      31730
MD      26450
NC      24211
IN      17283
Name: Registration State, dtype: int64



In [39]:

    
Fines_for_New_York_Cars = df_merged.groupby('Registration State')['All Other Areas'].sum().sort_values(ascending=False).head(1)



In [40]:

    
Fines_for_New_York_Cars









    Out[40]:





Registration State
NY    221829545.0
Name: All Other Areas, dtype: float64



In [41]:

    
Mininum_Amount #Total fines









    Out[41]:





300970265.0



In [66]:

    
print("New York City makes ${:,.2f} off non-New-York cars.".format(Mininum_Amount - 221829545.0))









    



New York City makes $79,140,720.00 off non-New-York cars.



In [43]:

    
#Double Check:
df_merged[df_merged['Registration State'] != 'NY'].sum()









    Out[43]:





Violation Code int            3.158730e+07
CODE                          3.158730e+07
Manhattan 96th St. & below    9.059571e+07
All Other Areas               7.914072e+07
Manhattan Quota               1.329699e+06
dtype: float64

11. Make a chart of the top few.



In [44]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("ggplot")



In [45]:

    
df_merged['Registration State'].value_counts().sort_values(ascending=True).tail(10).plot(kind='barh')









    Out[45]:





<matplotlib.axes._subplots.AxesSubplot at 0x13da46e10>

12. What time of day do people usually get their tickets? You can break the day up into several blocks - for example 12am-6am, 6am-12pm, 12pm-6pm,6pm-12am.



In [46]:

    
import dateutil

def Time_of_day(x):
    x = str(x)
    #x = (':'.join(a+b for a,b in zip(x[::2], x[1::2])))
    if x == 'NaN':
        pass
    try:
        x = str(x[:2] + ':' + x[2:])
        date = dateutil.parser.parse(x)
        time = int(str(date.strftime('%H')))
        if time <= 6:
            return '12am - 6am'
        elif time <= 12:
            return '6am - 12pm'
        elif time <= 18:
            return '12pm - 6pm'
        elif time <= 24:
            return '6pm - 12am'
    except:
        return 'NaN'



In [47]:

    
df_merged['Time of Day'] = df['Violation Time'].apply(Time_of_day)



In [48]:

    
df_merged['Time of Day'].value_counts()









    Out[48]:





6am - 12pm    2056194
12pm - 6pm    1548470
12am - 6am     283503
6pm - 12am     246758
NaN              1255
Name: Time of Day, dtype: int64

14. Make a graph of the number of tickets per day.



In [49]:

    
df_merged['Issue Date'].head()









    Out[49]:





0   2013-08-04
1   2013-08-04
2   2013-08-05
3   2013-08-05
4   2013-07-23
Name: Issue Date, dtype: datetime64[ns]



In [50]:

    
#Filtering out all the dates that were in the future and before 2013, obviously wrong dates.
df_merged_only_2013_dates = df_merged[(df_merged['Issue Date'] > '2013-01-01') & (df_merged['Issue Date'] < '2013-12-31')]



In [51]:

    
#More on histograms
#http://stackoverflow.com/questions/27365467/python-pandas-plot-histogram-of-dates
#http://stackoverflow.com/questions/29672375/histogram-in-matplotlib-time-on-x-axis
#plt.axis('off')
import matplotlib.pyplot as plt

plt.plot(range(5))
plt.gca().xaxis.set_major_locator(plt.NullLocator())
df_merged_only_2013_dates.groupby('Issue Date')['Issue Date'].count().plot(kind="bar", figsize=(15, 6), legend=None, title='Amount of daily parking tickets in 2013')
plt.show()

15. Make a graph of the amount of revenue collected per day.



In [52]:

    
df_merged['Issue Date'].head()









    Out[52]:





0   2013-08-04
1   2013-08-04
2   2013-08-05
3   2013-08-05
4   2013-07-23
Name: Issue Date, dtype: datetime64[ns]



In [53]:

    
plt.plot(range(5))
plt.gca().xaxis.set_major_locator(plt.NullLocator())
df_merged_only_2013_dates.groupby('Issue Date')['All Other Areas'].sum().plot(kind="bar", figsize=(15, 6), legend=None, title='Total daily fine amount in 2013')
plt.show()

16. Manually construct a dataframe out of https://dmv.ny.gov/statistic/2015licinforce-web.pdf (only NYC boroughts - bronx, queens, manhattan, staten island, brooklyn), having columns for borough name, abbreviation, and number of licensed drivers.



In [54]:

    
df_boroughs = pd.read_csv("NYC_Boroughs3.csv")



In [55]:

    
df_boroughs.head()









    Out[55]:






  
    
      
      borough name
      abbreviation
      Total
    
  
  
    
      0
      Manhattan (New York)
      NY
      748583
    
    
      1
      Bronx
      BX
      456847
    
    
      2
      Brooklyn
      K
      983769
    
    
      3
      Queens
      Q
      1134064
    
    
      4
      Staten Island (Richmond)
      R
      304250

17. What's the parking-ticket-$-per-licensed-driver in each borough of NYC? Do this with pandas and the dataframe you just made, not with your head!



In [56]:

    
def county_harm(county):
    county = str(county)
    if county == 'NaN':
        return
    else:
        return county.replace("BRONX","BX").replace("KINGS", "K").replace("NYC", "NY").replace("QUEEN", "Q").replace("RC", "R").replace("RICH", "R")



In [57]:

    
df_merged['Violation County harm'] = df_merged['Violation County'].apply(county_harm)



In [58]:

    
summed_fine_amounts_by_county = df_merged.groupby('Violation County harm')['All Other Areas'].sum()



In [59]:

    
#creating dataframe
fine_amounts = pd.DataFrame(summed_fine_amounts_by_county)



In [60]:

    
#changing the index into column
fine_amounts.reset_index(inplace=True)



In [61]:

    
fine_amounts









    Out[61]:






  
    
      
      Violation County harm
      All Other Areas
    
  
  
    
      0
      103
      115.0
    
    
      1
      BX
      30048810.0
    
    
      2
      K
      64711810.0
    
    
      3
      NY
      148098375.0
    
    
      4
      Q
      53339930.0
    
    
      5
      R
      2253620.0
    
    
      6
      nan
      2517605.0



In [62]:

    
#Joining the two new datasets
df_ticket_dollars_per_licence = fine_amounts.merge(df_boroughs, left_on='Violation County harm', right_on='abbreviation')



In [63]:

    
df_ticket_dollars_per_licence









    Out[63]:






  
    
      
      Violation County harm
      All Other Areas
      borough name
      abbreviation
      Total
    
  
  
    
      0
      BX
      30048810.0
      Bronx
      BX
      456847
    
    
      1
      K
      64711810.0
      Brooklyn
      K
      983769
    
    
      2
      NY
      148098375.0
      Manhattan (New York)
      NY
      748583
    
    
      3
      Q
      53339930.0
      Queens
      Q
      1134064
    
    
      4
      R
      2253620.0
      Staten Island (Richmond)
      R
      304250



In [64]:

    
df_ticket_dollars_per_licence['Ticket Dollars per Licence'] = df_ticket_dollars_per_licence['All Other Areas'] / df_ticket_dollars_per_licence['Total']



In [65]:

    
#Dividing one dataset by the other
df_ticket_dollars_per_licence[['borough name', 'abbreviation', 'Ticket Dollars per Licence']]









    Out[65]:






  
    
      
      borough name
      abbreviation
      Ticket Dollars per Licence
    
  
  
    
      0
      Bronx
      BX
      65.774340
    
    
      1
      Brooklyn
      K
      65.779477
    
    
      2
      Manhattan (New York)
      NY
      197.838282
    
    
      3
      Queens
      Q
      47.034321
    
    
      4
      Staten Island (Richmond)
      R
      7.407132

	Summons Number	Plate ID	Registration State	Plate Type	Issue Date	Violation Code	Vehicle Body Type	Vehicle Make	Issuing Agency	Street Code1	...	Vehicle Color	Vehicle Year	Meter Number	Violation Post Code	Violation Description	No Standing or Stopping Violation	Hydrant Violation	Double Parking Violation
0	1283294138	GBB9093	NY	PAS	2013-08-04	46	SUBN	AUDI	P	37250	...	GY	2013	-	NaN	NaN	NaN	NaN	NaN
1	1283294151	62416MB	NY	COM	2013-08-04	46	VAN	FORD	P	37290	...	WH	2012	-	NaN	NaN	NaN	NaN	NaN
2	1283294163	78755JZ	NY	COM	2013-08-05	46	P-U	CHEVR	P	37030	...	NaN	NaN	-	NaN	NaN	NaN	NaN	NaN
3	1283294175	63009MA	NY	COM	2013-08-05	46	VAN	FORD	P	37270	...	WH	2010	-	NaN	NaN	NaN	NaN	NaN
4	1283294187	91648MC	NY	COM	2013-08-08	41	TRLR	GMC	P	37240	...	BR	2012	-	NaN	NaN	NaN	NaN	NaN

	CODE	DEFINITION	Manhattan 96th St. & below	All Other Areas	Manhattan Quota
0	10	Stopping, standing or parking where a sign, st...	115.0	115.0	1.0
1	11	Hotel Loading/Unloading: Standing or parking w...	115.0	115.0	1.0
2	12	Snow Emergency: Standing or parking where stan...	95.0	95.0	1.0
3	13	Taxi Stand: Standing or parking where standing...	115.0	115.0	1.0
4	14	General No Standing: Standing or parking where...	115.0	115.0	1.0

	borough name	abbreviation	Total
0	Manhattan (New York)	NY	748583
1	Bronx	BX	456847
2	Brooklyn	K	983769
3	Queens	Q	1134064
4	Staten Island (Richmond)	R	304250

	Violation County harm	All Other Areas
0	103	115.0
1	BX	30048810.0
2	K	64711810.0
3	NY	148098375.0
4	Q	53339930.0
5	R	2253620.0
6	nan	2517605.0

	borough name	abbreviation	Ticket Dollars per Licence
0	Bronx	BX	65.774340
1	Brooklyn	K	65.779477
2	Manhattan (New York)	NY	197.838282
3	Queens	Q	47.034321
4	Staten Island (Richmond)	R	7.407132