In [1]:
%matplotlib inline
import as read
import diogenes.display as display
import diogenes.modify as modify
import diogenes.utils as utils
import diogenes.grid_search as grid_search
import numpy as np


  1. Data obtained from the Citizens Police Data Project.
  2. This data includes only the FOIA dataset from 2011 to present (i.e. the Bond and Moore datasets have been removed).
  3. This was accomplished by entering FOIA in the search bar.
  4. The resulting table was saved to GitHub as a .xslx.
  5. The Allegations, Complaining Witnesses, and Officer Profile tabs were then saved as allegations.csv, citizens.csv, and officers.csv respectively.


The following disclaimer is included with the data by the Invisible Institute.

This dataset is compiled from three lists of allegations against Chicago Police Department officers, spanning approximately 2002 - 2008 and 2010 - 2014, produced by the City of Chicago in response to litigation and to FOIA requests.

The City of Chicago's production of this information is accompanied by a disclaimer that not all information contained in the City's database may be correct.

No independent verification of the City's records has taken place and this dataset does not purport to be an accurate reflection of either the City's database or its veracity.

In [120]:
#Record arrays

allegations = read.open_csv_url('',parse_datetimes=['IncidentDate','StartDate','EndDate'])
citizens = read.open_csv_url('')
officers = read.open_csv_url('')

What data do we have?

We can see the column names for the three tables below.

  1. The Allegations table includes data on each allegation, including an ID for the complaint witness, the officer, and the outcome of the allegation.
  2. The Citizens table includes additional information for each complaint witness.
  3. The Officers table includes additional information for each officer.

In [121]:
#I shouldn't have to nest function calls just to get a summary of my data. This needs to be a single call.
#Most of the data isn't numeric, so we should find a way to be more helpful than this.
#Also, what is the "None" printing at the end of this?
print display.pprint_sa(display.describe_cols(allegations))
print display.pprint_sa(display.describe_cols(citizens))
print display.pprint_sa(display.describe_cols(officers))

          Column Name   Count          Mean Standard Dev Minimum     Maximum
 0               CRID 28575.0 1072720.56308 916800.77455  1074.0 107000321.0
 1          OfficerID 28575.0           nan          nan     1.0      9172.0
 2        OfficeFirst     nan           nan          nan     nan         nan
 3        OfficerLast     nan           nan          nan     nan         nan
 4     AllegationCode     nan           nan          nan     nan         nan
 5           Category     nan           nan          nan     nan         nan
 6         Allegation     nan           nan          nan     nan         nan
 7 RecommendedFinding     nan           nan          nan     nan         nan
 8 RecommendedOutcome 28575.0           nan          nan     0.0       800.0
 9       FinalFinding     nan           nan          nan     nan         nan
10       FinalOutcome 28575.0           nan          nan     0.0       999.0
11            Finding     nan           nan          nan     nan         nan
12            Outcome     nan           nan          nan     nan         nan
13               Beat 28575.0           nan          nan     nan         nan
14           Location     nan           nan          nan     nan         nan
15               Add1 28575.0           nan          nan     nan         nan
16               Add2     nan           nan          nan     nan         nan
17               City     nan           nan          nan     nan         nan
18       IncidentDate     nan           nan          nan     nan         nan
19          StartDate     nan           nan          nan     nan         nan
20            EndDate     nan           nan          nan     nan         nan
21       Investigator     nan           nan          nan     nan         nan
  Column Name   Count          Mean  Standard Dev   Minimum     Maximum
0        CRID 17016.0 1083452.33662 1429359.24633 1043909.0 107000321.0
1      Gender     nan           nan           nan       nan         nan
2        Race     nan           nan           nan       nan         nan
   Column Name  Count          Mean  Standard Dev Minimum Maximum
0    OfficerID 7743.0 4319.36820354 2511.64217192     1.0  9172.0
1 OfficerFirst    nan           nan           nan     nan     nan
2  OfficerLast    nan           nan           nan     nan     nan
3       Gender    nan           nan           nan     nan     nan
4         Race    nan           nan           nan     nan     nan
5     ApptDate 7743.0           nan           nan 20323.0 42058.0
6         Unit 7743.0           nan           nan     1.0   712.0
7         Rank    nan           nan           nan     nan     nan
8         Star 7743.0           nan           nan     0.0 21797.0

For this analysis, we will be removing several columns for the following reasons:

  1. To anonymize our data, names of officers and investiagtors have been removed.
  2. Many of the columns in Allegations are redundant as they code for other columns. We will preserve only the human readable columns.
  3. The Beat column has no data, so it will be removed.
  4. We will only focus on final outcomes, so the "recommended" columns have been removed from Allegations.
  5. We will be limiting our geographic analysis to Location, so the address information has been removed.

We will also translate ApptDate, which specifies the number of days between the hire date and 1900-1-1, to the number of years working.

In [122]:
import datetime

#TODO: there is a typo in the "OfficerFirst" column in allegations.
#Should pass this on to Kalven at Invisible Institute along with questions about data.
allegations = utils.remove_cols(allegations,['OfficeFirst','OfficerLast','Investigator','AllegationCode','RecommendedFinding','RecommendedOutcome','FinalFinding','FinalOutcome','Beat','Add1','Add2','City'])  
officers = utils.remove_cols(officers,['OfficerFirst','OfficerLast','Star'])

#Convert appointment date days since 1900-1-1 to years prior to today
def tenure(vector):
    today = datetime.datetime.strftime(,'%Y-%m-%d')
    started = np.add(np.datetime64('1900-01-01'),map(lambda x: np.timedelta64(int(x), 'D'),vector))
    tenure = np.subtract(np.datetime64(today),started)
    return np.divide(tenure,np.timedelta64(1,'D')) / 365

#Impute median date for missing values
officers['ApptDate'] = modify.replace_missing_vals(officers['ApptDate'], strategy='median')
tenure_days = modify.combine_cols(officers,tenure,['ApptDate'])
officers = utils.append_cols(officers,[tenure_days],['Tenure'])

For ease of use, let's join our tables.

In [123]:
master = utils.join(allegations,citizens,'left',['CRID'],['CRID'])

#Rename Race and Gender, since citizens and officers have these columns
temp_col_names = list(master.dtype.names)
gender_index = temp_col_names.index("Gender")
race_index = temp_col_names.index("Race")
temp_col_names[gender_index] = "CitizenGender"
temp_col_names[race_index] = "CitizenRace"
master.dtype.names = tuple(temp_col_names)

master = utils.join(master,officers,'left',['OfficerID'],['OfficerID'])

temp_col_names = list(master.dtype.names)
gender_index = temp_col_names.index("Gender")
race_index = temp_col_names.index("Race")
temp_col_names[gender_index] = "OfficerGender"
temp_col_names[race_index] = "OfficerRace"
master.dtype.names = tuple(temp_col_names)

There are some allegations where no officer ID was provided. For this analysis, we will discard those allegations.

In [124]:
#This is a pretty awkward way to remove nan, is there a better way I missed?
master = modify.choose_rows_where(master,[{'func': modify.row_val_between, 'col_name': 'OfficerID', 'vals': [-np.inf,np.inf]}])

Now, let's encode our data numerically

In [125]:
#Unit is interpreted as numeric, but we really want to analyze it categorically
#There should be an easier way to treat a numeric column as categorical data
master = utils.append_cols(master,master['Unit'].astype('|S10'),['UnitCat'])
master = utils.remove_cols(master,['Unit'])
master_data, master_classes = modify.label_encode(master)

For convenience, we'll build every possible categorical directive

In [126]:

def cat_directives(array,classes):
    cat_directives = {}
    for column in classes:
        cat_directives[column] = {v:[{'func': modify.row_val_eq, 'col_name': column, 'vals': i}] for i,v in enumerate(classes[column])}
    return cat_directives

where = cat_directives(master_data,master_classes)

Now, we can build intuitive masks as combinations of our human-readable directives

In [127]:

female_officers = modify.where_all_are_true(master_data,where['OfficerGender']['F'])
male_officers = modify.where_all_are_true(master_data,where['OfficerGender']['M'])

female_citizens = modify.where_all_are_true(master_data,where['CitizenGender']['F'])
male_citizens = modify.where_all_are_true(master_data,where['CitizenGender']['M'])

white_officers = modify.where_all_are_true(master_data,where['OfficerRace']['White'])
black_officers = modify.where_all_are_true(master_data,where['OfficerRace']['Black'])
hispanic_officers = modify.where_all_are_true(master_data,where['OfficerRace']['Hispanic'])

white_citizens = modify.where_all_are_true(master_data,where['CitizenRace']['White'])
black_citizens = modify.where_all_are_true(master_data,where['CitizenRace']['Black'])
hispanic_citizens = modify.where_all_are_true(master_data,where['CitizenRace']['Hispanic'])

white_M_officers_black_F_citizens = modify.where_all_are_true(master_data,where['OfficerRace']['White']+

Let's generate a potentially interesting new feature from our existing data, and pull out all non-numeric data

In [128]:
duration = modify.combine_cols(master_data,np.subtract,['EndDate','StartDate'])
durationDays = duration / np.timedelta64(1, 'D')
duration_data = utils.append_cols(master_data,[durationDays],['InvestigationDuration'])
numeric_data = utils.remove_cols(master_data,['StartDate','EndDate','IncidentDate'])

We understand what data we have, and we have some tools to easily slice and dice. Let's dive in and learn something.

In [129]:
#Ex 1: What percentage of allegations have a black female citizen and a white male officer?
print np.sum(white_M_officers_black_F_citizens.astype(np.float))/np.size(white_M_officers_black_F_citizens.astype(np.float))


In [130]:
#Ex 2: What is the breakdown of officers with complaints by race?
#This seems a little clunky to me
#Would be nice if plot_simple_histogram could handle categorical labels for me
display.plt.xticks(range(len(master_classes['OfficerRace'])), master_classes['OfficerRace'])

([<matplotlib.axis.XTick at 0x10a668b90>,
  <matplotlib.axis.XTick at 0x10f7ea8d0>,
  <matplotlib.axis.XTick at 0x110052cd0>,
  <matplotlib.axis.XTick at 0x10fc28810>,
  <matplotlib.axis.XTick at 0x10fc28d90>,
  <matplotlib.axis.XTick at 0x10fba32d0>],
 <a list of 6 Text xticklabel objects>)

In [131]:
#Ex 3: What does the distribution of complaints look like?
complaint_counter = display.Counter(numeric_data['OfficerID']) 
officer_list, complaint_counts = zip(*complaint_counter.items())    


In [132]:
#Ex 4: What can we learn from the 100 officers who receive the most complaints?
#FYI: Wikipedia says 12,244 officers total, so this is roughly the top 1% of all Chicago officers.
#Obviously, all officers do not have the same quantity and quality of interactions with citizens.
#Need to account for this fact for any real analysis.
#Median imputation makes histogram look unnatural
#Top 100 Officers
top_100 = counts.most_common(100)
top_100_officers = map(lambda x: x[0],top_100)

#We should add this to for categorical data
def row_val_in(M,col_name,boundary):
    return [x in boundary for x in M[col_name]]

top_100_profile = modify.choose_rows_where(officers,[{'func': row_val_in, 'col_name': 'OfficerID', 'vals': top_100_officers}])

#Can't check this against CPDB, their allegation counts are for the whole time period
#Not just 2011 - present.


In [12]:
#Ex 5: What does the distribution of outcomes look like?
#Hastily written, possibly not useful. Just curious.
#Almost everything is unknown or no action taken
def sortedFrequencies(array,classes,col_name):
    if col_name not in classes:
        raise ValueError('col_name must be categorical')
    counts = display.Counter(array[col_name])
    total = float(sum(counts.values()))
    for key in counts:
         counts[key] /= total
    count_dict = {}    
    for value in counts:
        count_dict[classes[col_name][value]] = counts[value]
    return sorted(count_dict.items(), key=lambda x: x[1],reverse=True)

print sortedFrequencies(numeric_data,master_classes,'Outcome')

[('No Action Taken', 0.7631110897466712), ('Unknown', 0.19896168821804547), ('Reprimand', 0.010623467769071769), ('Violation Noted', 0.006874008556458203), ('1 Day Suspension', 0.00475892900062491), ('Resigned', 0.0038936691823294716), ('5 Day Suspension', 0.002355429505359804), ('2 Day Suspension', 0.002018939576022689), ('Penalty Not Served', 0.0016343796567802721), ('3 Day Suspension', 0.001201749747632553), ('10 Day Suspension', 0.0007691198384848339), ('15 Day Suspension', 0.0007210498485795318), ('', 0.0004326299091477191), ('20 Day Suspension', 0.0004326299091477191), ('30 Day Suspension', 0.00038455991924241697), ('25 Day Suspension', 0.00033648992933711484), ('Administrative Termination', 0.00033648992933711484), ('4 Day Suspension', 0.0002884199394318127), ('Separation', 0.00014420996971590636), ('45 Day Suspension', 0.00014420996971590636), ('Suspended over 30 Days', 0.00014420996971590636), ('Reinstated by Police Board', 9.613997981060424e-05), ('21 Day Suspension', 9.613997981060424e-05), ('7 Day Suspension', 4.806998990530212e-05), ('23 Day Suspension', 4.806998990530212e-05), ('12 Day Suspension', 4.806998990530212e-05), ('8 Day Suspension', 4.806998990530212e-05), ('6 Day Suspension', 4.806998990530212e-05)]

In [133]:
#Ex 6: What has the number of complaints over time been like?
#Looks seasonal (peaking in summer), and declining over time (coud the decline just be a collection issue?)
def numpy_to_month(dt64):
    ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
    dt = datetime.datetime.utcfromtimestamp(ts)
    d =, dt.month, 1) #round to month
    return d

months, counts = zip(*display.Counter(map(numpy_to_month,duration_data['IncidentDate'])).items())


[<matplotlib.lines.Line2D at 0x11200d890>]

In [134]:
#How does it look to split complaints by location?
#Very disproportionate. Locations 17,19,3,4 have almost all complaints.
display.plt.xticks(range(len(master_classes['Location'])), master_classes['Location'])
#Still uneven, but more even than location.
display.plt.xticks(range(len(master_classes['UnitCat'])), master_classes['UnitCat'])

([<matplotlib.axis.XTick at 0x10aab6710>,
  <matplotlib.axis.XTick at 0x10abc1f50>,
  <matplotlib.axis.XTick at 0x10fb57610>,
  <matplotlib.axis.XTick at 0x10c7c6b10>,
  <matplotlib.axis.XTick at 0x10c7d0290>,
  <matplotlib.axis.XTick at 0x10c7d09d0>,
  <matplotlib.axis.XTick at 0x10c7da150>,
  <matplotlib.axis.XTick at 0x10c7da890>,
  <matplotlib.axis.XTick at 0x10c7dafd0>,
  <matplotlib.axis.XTick at 0x10c7e6750>,
  <matplotlib.axis.XTick at 0x10c7e6e90>,
  <matplotlib.axis.XTick at 0x10c7ef610>,
  <matplotlib.axis.XTick at 0x10c7efd50>,
  <matplotlib.axis.XTick at 0x10c7f94d0>,
  <matplotlib.axis.XTick at 0x10c7f9c10>,
  <matplotlib.axis.XTick at 0x10c804390>,
  <matplotlib.axis.XTick at 0x10c804ad0>,
  <matplotlib.axis.XTick at 0x10d2cb250>,
  <matplotlib.axis.XTick at 0x10d2cb990>,
  <matplotlib.axis.XTick at 0x10d2d5110>,
  <matplotlib.axis.XTick at 0x10d2d5850>,
  <matplotlib.axis.XTick at 0x10d2d5f90>,
  <matplotlib.axis.XTick at 0x10d2e1710>,
  <matplotlib.axis.XTick at 0x10d2e1e50>,
  <matplotlib.axis.XTick at 0x10d2e95d0>,
  <matplotlib.axis.XTick at 0x10d2e9d10>,
  <matplotlib.axis.XTick at 0x10d2f2490>,
  <matplotlib.axis.XTick at 0x10d2f2bd0>,
  <matplotlib.axis.XTick at 0x10d2fd350>,
  <matplotlib.axis.XTick at 0x10d2fda90>,
  <matplotlib.axis.XTick at 0x10d308210>,
  <matplotlib.axis.XTick at 0x10d308950>,
  <matplotlib.axis.XTick at 0x10d3120d0>,
  <matplotlib.axis.XTick at 0x10d312810>,
  <matplotlib.axis.XTick at 0x10d312f50>,
  <matplotlib.axis.XTick at 0x10d31c6d0>,
  <matplotlib.axis.XTick at 0x10d31ce10>,
  <matplotlib.axis.XTick at 0x10d327590>,
  <matplotlib.axis.XTick at 0x10d327cd0>,
  <matplotlib.axis.XTick at 0x10d330450>,
  <matplotlib.axis.XTick at 0x10d330b90>,
  <matplotlib.axis.XTick at 0x10d33d310>,
  <matplotlib.axis.XTick at 0x10d33da50>,
  <matplotlib.axis.XTick at 0x10d3af1d0>,
  <matplotlib.axis.XTick at 0x10d3af910>,
  <matplotlib.axis.XTick at 0x10d3b7090>,
  <matplotlib.axis.XTick at 0x10d3b77d0>,
  <matplotlib.axis.XTick at 0x10d3b7f10>,
  <matplotlib.axis.XTick at 0x10d3c2690>,
  <matplotlib.axis.XTick at 0x10d3c2dd0>,
  <matplotlib.axis.XTick at 0x10d3cc550>,
  <matplotlib.axis.XTick at 0x10d3ccc90>,
  <matplotlib.axis.XTick at 0x10d3d8410>,
  <matplotlib.axis.XTick at 0x10d3d8b50>,
  <matplotlib.axis.XTick at 0x10d3e12d0>,
  <matplotlib.axis.XTick at 0x10d3e1a10>,
  <matplotlib.axis.XTick at 0x10d3ec190>,
  <matplotlib.axis.XTick at 0x10d3ec8d0>,
  <matplotlib.axis.XTick at 0x10d3f6050>,
  <matplotlib.axis.XTick at 0x10d3f6790>,
  <matplotlib.axis.XTick at 0x10d3f6ed0>,
  <matplotlib.axis.XTick at 0x10d401650>,
  <matplotlib.axis.XTick at 0x10d401d90>,
  <matplotlib.axis.XTick at 0x10d40b510>,
  <matplotlib.axis.XTick at 0x10d40bc50>,
  <matplotlib.axis.XTick at 0x10d4153d0>,
  <matplotlib.axis.XTick at 0x10d415b10>,
  <matplotlib.axis.XTick at 0x10d41f290>,
  <matplotlib.axis.XTick at 0x10d41f9d0>,
  <matplotlib.axis.XTick at 0x10d42a150>,
  <matplotlib.axis.XTick at 0x10d42a890>,
  <matplotlib.axis.XTick at 0x10d42afd0>,
  <matplotlib.axis.XTick at 0x10d434750>,
  <matplotlib.axis.XTick at 0x10d434e90>,
  <matplotlib.axis.XTick at 0x10d43f610>,
  <matplotlib.axis.XTick at 0x10d43fd50>,
  <matplotlib.axis.XTick at 0x10d4494d0>,
  <matplotlib.axis.XTick at 0x10d449c10>,
  <matplotlib.axis.XTick at 0x10d452390>,
  <matplotlib.axis.XTick at 0x10d452ad0>,
  <matplotlib.axis.XTick at 0x10d45e250>,
  <matplotlib.axis.XTick at 0x10d45e990>,
  <matplotlib.axis.XTick at 0x10d468110>,
  <matplotlib.axis.XTick at 0x10d468850>,
  <matplotlib.axis.XTick at 0x10d468f90>,
  <matplotlib.axis.XTick at 0x10fae7710>,
  <matplotlib.axis.XTick at 0x10fae7e50>,
  <matplotlib.axis.XTick at 0x10faf05d0>,
  <matplotlib.axis.XTick at 0x10faf0d10>,
  <matplotlib.axis.XTick at 0x10fafa490>,
  <matplotlib.axis.XTick at 0x10fafabd0>,
  <matplotlib.axis.XTick at 0x10fb04350>,
  <matplotlib.axis.XTick at 0x10fb04a90>,
  <matplotlib.axis.XTick at 0x10fb11210>,
  <matplotlib.axis.XTick at 0x10fb11950>,
  <matplotlib.axis.XTick at 0x10fb1a0d0>,
  <matplotlib.axis.XTick at 0x10fb1a810>,
  <matplotlib.axis.XTick at 0x10fb1af50>,
  <matplotlib.axis.XTick at 0x10fc656d0>,
  <matplotlib.axis.XTick at 0x10fc65e10>,
  <matplotlib.axis.XTick at 0x10fc70590>,
  <matplotlib.axis.XTick at 0x10fc70cd0>,
  <matplotlib.axis.XTick at 0x10fc7b450>,
  <matplotlib.axis.XTick at 0x10fc7bb90>,
  <matplotlib.axis.XTick at 0x10fc85310>,
  <matplotlib.axis.XTick at 0x10fc85a50>,
  <matplotlib.axis.XTick at 0x10fc901d0>,
  <matplotlib.axis.XTick at 0x10fc90910>,
  <matplotlib.axis.XTick at 0x10fc9b090>,
  <matplotlib.axis.XTick at 0x10fc9b7d0>,
  <matplotlib.axis.XTick at 0x10fc9bf10>,
  <matplotlib.axis.XTick at 0x110243690>,
  <matplotlib.axis.XTick at 0x110243dd0>],
 <a list of 113 Text xticklabel objects>)

In [135]:
#Are there officers getting a lot of complaints not from the high yield locations?
#What does the social network of concomitant officers look like?

In [ ]: