Examining Los Angeles Grand Theft Auto Dataset

Please unzip the csv file before running this script



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder

data = pd.read_csv("VIEW_-_Grand_theft_auto_2004-2014.csv")



In [2]:

    
data.columns









    Out[2]:





Index([u'CRIME_DATE', u'CRIME_YEAR', u'CRIME_CATEGORY_NUMBER', u'CRIME_CATEGORY_DESCRIPTION', u'STATISTICAL_CODE', u'STATISTICAL_CODE_DESCRIPTION', u'VICTIM_COUNT', u'STREET', u'CITY', u'STATE', u'ZIP', u'LATITUDE', u'LONGITUDE', u'GANG_RELATED', u'REPORTING_DISTRICT', u'STATION_IDENTIFIER', u'STATION_NAME', u'CRIME_IDENTIFIER', u'GEO_CRIME_LOCATION'], dtype='object')

Aha, so here are all of the features of the above dataset. This is important to note for modeling, visualization, and any analysis. Certain features are numeric (e.g. longitude,latitude,...) while others are categorical (e.g. city names, gang_related,...).



In [3]:

    
len(data)









    Out[3]:





157983

We furthemore see that this is quite a dense set of information with nearly 160k records.

Most Common Car Thefts Listed by City



In [4]:

    
cityThefts = Counter(data['CITY']).most_common(20)
cities = [i[0] for i in cityThefts]
thefts = [i[1] for i in cityThefts]
print cities
print thefts









    



['LOS ANGELES', 'COMPTON', 'LANCASTER', 'NORWALK', 'LYNWOOD', 'BELLFLOWER', 'PALMDALE', 'PARAMOUNT', 'CARSON', 'EAST LOS ANGELES', 'PICO RIVERA', 'WHITTIER', 'LAKEWOOD', 'COMMERCE', 'INDUSTRY', 'ROSEMEAD', 'CERRITOS', 'LA PUENTE', 'WEST HOLLYWOOD', 'SOUTH EL MONTE']
[17411, 12398, 7874, 7838, 7782, 6673, 6593, 6516, 6190, 5167, 4496, 4457, 4127, 3396, 3274, 3059, 2932, 2350, 1718, 1697]



In [5]:

    
plt.figure(figsize=(12,8))
plt.bar(range(len(thefts)),thefts)
plt.xticks(range(len(cities)),cities,rotation=68)
plt.title("Total GTA count per city from 2004-2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")
plt.show()

It is important to note here that the above chart only displays the total count recorded for the top 20 recorded cities. It does not adjust for population, so in some sense you can consider this a population chart more so than a car theft statistic. If the population count were given for each city for the time interval of 2004-2014, then it would be possible to construct a proper chart to display the crime rate per city (since it would be adjusted for population).



In [6]:

    
data2014 = data[data.CRIME_YEAR == 2014]
cityThefts2014 = Counter(data2014['CITY']).most_common(20)
cities2014 = [i[0] for i in cityThefts2014]
thefts2014 = [i[1] for i in cityThefts2014]



In [7]:

    
plt.figure(figsize=(18,6))
plt.subplot(1,2,1)
plt.bar(range(len(thefts2014)),thefts2014)
plt.xticks(range(len(cities2014)),cities2014,rotation=68)
plt.title("Total GTA count per city from 2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")

plt.subplot(1,2,2)
plt.bar(range(len(thefts)),thefts)
plt.xticks(range(len(cities)),cities,rotation=68)
plt.title("Total GTA count per city from 2004-2014")
plt.xlabel("Top 20 Cities")
plt.ylabel("Thefts")

plt.show()

The above charts would be better visualized if adjusted for population.

Encoding the Labels for Modeling

Let's look at the features again and see how many unique counts there are per feature to visualize how complex the dataset truly is at this time.



In [8]:

    
data.columns









    Out[8]:





Index([u'CRIME_DATE', u'CRIME_YEAR', u'CRIME_CATEGORY_NUMBER', u'CRIME_CATEGORY_DESCRIPTION', u'STATISTICAL_CODE', u'STATISTICAL_CODE_DESCRIPTION', u'VICTIM_COUNT', u'STREET', u'CITY', u'STATE', u'ZIP', u'LATITUDE', u'LONGITUDE', u'GANG_RELATED', u'REPORTING_DISTRICT', u'STATION_IDENTIFIER', u'STATION_NAME', u'CRIME_IDENTIFIER', u'GEO_CRIME_LOCATION'], dtype='object')



In [26]:

    
uniqueLists = []

for i in data:
    uniqueLists.append(np.unique(data[i],return_counts=True))

The above code is handily using the recent 1.9 version of numpy which has the 'return_counts' parameter which returns two lists - features, counts. I only recently updated my numpy package so the above counter code has remained in use.



In [32]:

    
for i in range(len(uniqueLists)):
    print data.columns[i],len(uniqueLists[i][0])









    



 CRIME_DATE 96966
CRIME_YEAR 11
CRIME_CATEGORY_NUMBER 1
CRIME_CATEGORY_DESCRIPTION 1
STATISTICAL_CODE 10
STATISTICAL_CODE_DESCRIPTION 10
VICTIM_COUNT 1
STREET 92406
CITY 286
STATE 10
ZIP 85272
LATITUDE 110303
LONGITUDE 110305
GANG_RELATED 2
REPORTING_DISTRICT 946
STATION_IDENTIFIER 39
STATION_NAME 42
CRIME_IDENTIFIER 157983
GEO_CRIME_LOCATION 103533

work in progress



In [ ]: