OpenDataSTL demo

This notebook is going to demo a visualization for citations in the St. Louis area. Specifically, I was wondering if there was any interesting patterns between the location of a citation and where the defendent lives.

WARNING

The data used in this demo is not attested by any authority. The conclusions here must be taken with salt.


In [1]:
## MAGIC & IMPORTS
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

print("pandas version: ", pd.__version__)
print("numpy version: ", np.__version__)
print("matplotlib version: ", matplotlib.__version__)


pandas version:  0.17.0
numpy version:  1.8.2
matplotlib version:  1.4.2

My two data sources are the dummy data files that Global Hack V teams received in mid-September. Two files were provided, citations.csv and violations.csv, and both are read into the program.


In [2]:
citations = pd.read_csv('citations.csv')
violations = pd.read_csv('violations.csv')

In [3]:
citations['defendant_city'] = citations['defendant_city__c']
citations['court_location'] = citations['court_location__c']

Relation between defendant's residence and location of citation

One thing that I was wondering about with the problem of the municipal court system, is whether or not there is a link between where someone lives and where the citation is made. I have listed the defendant_city and the court_location below.


In [4]:
citations[['defendant_city','court_location']].head(n=20)


Out[4]:
defendant_city court_location
0 HAZELWOOD ST. LOUIS CITY
1 PASADENA PARK ST. LOUIS CITY
2 NORTHWOODS ST. LOUIS CITY
3 MAPLEWOOD ST. LOUIS CITY
4 CALVERTON PARK OLIVETTE
5 FENTON OLIVETTE
6 CLAYTON OLIVETTE
7 CRESTWOOD OAKLAND
8 CRYSTAL LAKE PARK WELLSTON
9 MACKENZIE WELLSTON
10 BRECKENRIDGE HILLS OAKLAND
11 CREVE COEUR WELLSTON
12 BRECKENRIDGE HILLS OAKLAND
13 MOLINE ACRES OAKLAND
14 CLAYTON OAKLAND
15 ROCK HILL WELLSTON
16 OLIVETTE WELLSTON
17 JENNINGS OAKLAND
18 CHESTERFIELD OAKLAND
19 BELLA VILLA WELLSTON

Suppose they are mostly equal

If citations are mostly issued by citizens of the municipality, I expect a large percentage of the defendant_city to contain the same text as court_location. There are municipalities that do not have their own courts, so keep that in mind.


In [5]:
matching_cities = citations[citations['defendant_city'] == citations['court_location']]
print("Number of matching cities: ", len(matching_cities))

unmatching_cities = citations[citations['defendant_city'] != citations['court_location']]
print("Number of unmatching cities: ", len(unmatching_cities))

plt.figure(figsize=(8,8))
plt.pie([len(matching_cities), len(unmatching_cities)], labels=["Matching Cities: " + str(len(matching_cities)), 
                                                                "Unmatching Cities " + str(len(unmatching_cities))])


Number of matching cities:  11
Number of unmatching cities:  987
Out[5]:
([<matplotlib.patches.Wedge at 0x7fcc90533dd8>,
  <matplotlib.patches.Wedge at 0x7fcc9053da20>],
 [<matplotlib.text.Text at 0x7fcc9053d588>,
  <matplotlib.text.Text at 0x7fcc90546208>])

Suppose that my intuition is way off, or that I don't know what I'm doing

I was actually suprised by this result. Was there really only 11 records where the locations matched?

This could be the result of how the dummy data was generated or sourced. However, this entire demo is based around the assumption that this data is a representation of real data, so that doesn't raise any interesting questions.

So, what else is wrong could be causing this result? I have not done any processing at all on the data that I'm looking at, so it could be that the values in the two columns that I'm interested in don't align.

Let's check that.


In [6]:
defendant_cities = [defendant for defendant in citations['defendant_city'].astype("category").cat.categories]
court_cities = [court for court in citations['court_location'].astype("category").cat.categories]
print("Defendant cities: ", len(defendant_cities))
print("Court Cities", len(court_cities))

unmatched_court_cities = [city for city in court_cities if city not in defendant_cities]
unmatched_defendant_cities = [city for city in defendant_cities if city not in court_cities]
print("Cities that are in court_location but not in defendant_city: \n", unmatched_court_cities, "\n")
print("Cities that are in defandent_city but not in court_location: \n", unmatched_defendant_cities, "\n")


Defendant cities:  91
Court Cities 85
Cities that are in court_location but not in defendant_city: 
 ['BERKELEY 1', 'BERKELEY 2', 'ST. JOHN', 'TOWN AND COUNTRY', 'UNINCORPORATED CENTRAL ST. LOUIS COUNTY', 'UNINCORPORATED NORTH ST. LOUIS COUNTY', 'UNINCORPORATED SOUTH ST. LOUIS COUNTY', 'UNINCORPORATED WEST ST. LOUIS COUNTY'] 

Cities that are in defandent_city but not in court_location: 
 ['BELLERIVE', 'BERKELEY', 'CHAMP', 'COUNTRY LIFE ACRES', 'CRYSTAL LAKE PARK', 'GLEN ECHO PARK', 'GREEN PARK', 'HUNTLEIGH', 'NORWOOD COURT', 'ST. GEORGE', 'TOWN & COUNTRY', 'TWIN OAKS', 'WESTWOOD', 'WILBUR PARK'] 

Well, this is interesting. There are two BERKELEY court locations that are marked differently, and then there is the slight difference between TOWN AND COUNTRY and TOWN & COUNTRY.

However, the biggest discrepancy here is that the UNINCORPORATED records in the court locations. Clearly, these areas have names that the locals use, but the county bundles them together.


In [7]:
court_location_value_counts = citations['court_location'].astype("category").value_counts()
print("Number of citations in unmatched court locations: ", court_location_value_counts[unmatched_court_cities].sum())
print("Breakdown by location:\n")
print(court_location_value_counts[unmatched_court_cities])


Number of citations in unmatched court locations:  80
Breakdown by location:

BERKELEY 1                                 14
BERKELEY 2                                 10
ST. JOHN                                    5
TOWN AND COUNTRY                           13
UNINCORPORATED CENTRAL ST. LOUIS COUNTY    11
UNINCORPORATED NORTH ST. LOUIS COUNTY      12
UNINCORPORATED SOUTH ST. LOUIS COUNTY       7
UNINCORPORATED WEST ST. LOUIS COUNTY        8
dtype: int64

In [8]:
defendant_city_value_counts = citations['defendant_city'].astype("category").value_counts()
print("Number of citations in unmatched defendant cities: ", defendant_city_value_counts[unmatched_defendant_cities].sum())
print("Breakdown by location:\n")
print(defendant_city_value_counts[unmatched_defendant_cities])


Number of citations in unmatched defendant cities:  144
Breakdown by location:

BELLERIVE              5
BERKELEY               8
CHAMP                 10
COUNTRY LIFE ACRES     8
CRYSTAL LAKE PARK     21
GLEN ECHO PARK         5
GREEN PARK            15
HUNTLEIGH              1
NORWOOD COURT          8
ST. GEORGE             7
TOWN & COUNTRY         4
TWIN OAKS             11
WESTWOOD              15
WILBUR PARK           26
dtype: int64

Incomplete, Inconclusive conclusions

Take a look at Town and Country's data. It says in our data that people who have a home address in Town and Country were cited 4 times for traffic violations. However, there were 12 citations issued from that same police department.

The total citations issued in the two Berkeleys is 24, and the number of citations to Berkeley home addresses is 8.

I cannot draw true conclusions from either of these observations. I don't know how to calculate statistical significance, but more importantly, there is no way that these are enough observations.

The next thing to do is to construct a matrix. One axis is going to be the court locations, and the other is going to be defendant cities. The value of each cell is going to count how many citations were issued.


In [ ]: