This notebook is going to demo a visualization for citations in the St. Louis area. Specifically, I was wondering if there was any interesting patterns between the location of a citation and where the defendent lives.
The data used in this demo is not attested by any authority. The conclusions here must be taken with salt.
In [1]:
## MAGIC & IMPORTS
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
print("pandas version: ", pd.__version__)
print("numpy version: ", np.__version__)
print("matplotlib version: ", matplotlib.__version__)
My two data sources are the dummy data files that Global Hack V teams received in mid-September. Two files were provided, citations.csv
and violations.csv
, and both are read into the program.
In [2]:
citations = pd.read_csv('citations.csv')
violations = pd.read_csv('violations.csv')
In [3]:
citations['defendant_city'] = citations['defendant_city__c']
citations['court_location'] = citations['court_location__c']
One thing that I was wondering about with the problem of the municipal court system, is whether or not there is a link between where someone lives and where the citation is made. I have listed the defendant_city
and the court_location
below.
In [4]:
citations[['defendant_city','court_location']].head(n=20)
Out[4]:
In [5]:
matching_cities = citations[citations['defendant_city'] == citations['court_location']]
print("Number of matching cities: ", len(matching_cities))
unmatching_cities = citations[citations['defendant_city'] != citations['court_location']]
print("Number of unmatching cities: ", len(unmatching_cities))
plt.figure(figsize=(8,8))
plt.pie([len(matching_cities), len(unmatching_cities)], labels=["Matching Cities: " + str(len(matching_cities)),
"Unmatching Cities " + str(len(unmatching_cities))])
Out[5]:
I was actually suprised by this result. Was there really only 11 records where the locations matched?
This could be the result of how the dummy data was generated or sourced. However, this entire demo is based around the assumption that this data is a representation of real data, so that doesn't raise any interesting questions.
So, what else is wrong could be causing this result? I have not done any processing at all on the data that I'm looking at, so it could be that the values in the two columns that I'm interested in don't align.
Let's check that.
In [6]:
defendant_cities = [defendant for defendant in citations['defendant_city'].astype("category").cat.categories]
court_cities = [court for court in citations['court_location'].astype("category").cat.categories]
print("Defendant cities: ", len(defendant_cities))
print("Court Cities", len(court_cities))
unmatched_court_cities = [city for city in court_cities if city not in defendant_cities]
unmatched_defendant_cities = [city for city in defendant_cities if city not in court_cities]
print("Cities that are in court_location but not in defendant_city: \n", unmatched_court_cities, "\n")
print("Cities that are in defandent_city but not in court_location: \n", unmatched_defendant_cities, "\n")
Well, this is interesting. There are two BERKELEY
court locations that are marked differently, and then there is the slight difference between TOWN AND COUNTRY
and TOWN & COUNTRY
.
However, the biggest discrepancy here is that the UNINCORPORATED
records in the court locations. Clearly, these areas have names that the locals use, but the county bundles them together.
In [7]:
court_location_value_counts = citations['court_location'].astype("category").value_counts()
print("Number of citations in unmatched court locations: ", court_location_value_counts[unmatched_court_cities].sum())
print("Breakdown by location:\n")
print(court_location_value_counts[unmatched_court_cities])
In [8]:
defendant_city_value_counts = citations['defendant_city'].astype("category").value_counts()
print("Number of citations in unmatched defendant cities: ", defendant_city_value_counts[unmatched_defendant_cities].sum())
print("Breakdown by location:\n")
print(defendant_city_value_counts[unmatched_defendant_cities])
Take a look at Town and Country's data. It says in our data that people who have a home address in Town and Country were cited 4 times for traffic violations. However, there were 12 citations issued from that same police department.
The total citations issued in the two Berkeleys is 24, and the number of citations to Berkeley home addresses is 8.
I cannot draw true conclusions from either of these observations. I don't know how to calculate statistical significance, but more importantly, there is no way that these are enough observations.
The next thing to do is to construct a matrix. One axis is going to be the court locations, and the other is going to be defendant cities. The value of each cell is going to count how many citations were issued.
In [ ]: