Fatal Encounters. Justified?

Problem

Using the fatal encounters dataset, create a classifier that takes as input text and other attributes and tries to predict the target variable Official disposition of death (justified or other).

Questions

  • What are the relevant socially sensitive (including protected_class_attribute) in the dataset?
  • Does the data contain potentially discriminatory patterns? With respect to which definition of fairness?
  • Which features are being chosen to predict the target?

Get Data


In [7]:
import numpy as np
import pandas as pd
import seaborn as sns

from IPython.display import Markdown, display

sns.set_style("white")

%matplotlib inline

In [8]:
data = pd.read_csv("data/fatal_encounters_dataset.csv")
# clean data column names
data.columns = (
    data.columns
    .str.replace("'", "")
    .str.replace("[^a-zA-Z]", "_")
    .str.replace("_+", "_")
    .str.strip("_")
    .str.lower()
    .str.strip()
)
data = data[data.columns[~data.columns.str.startswith("unnamed")]]

In [9]:
def examine(df, n_sample=3):
    return (
        df.describe(include="all").T
        [["count", "unique", "mean", "std"]]
        .merge(
            df.apply(
                lambda s: s.sample(
                    n_sample, random_state=90).reset_index(drop=True))
                .T.rename(columns={
                    i: "sample_%s" % (i + 1) for i in range(n_sample)}),
            how="left", left_index=True, right_index=True))

examine(data, n_sample=2)


Out[9]:
count unique mean std sample_1 sample_2
comments 23428 NaN 11714.5 6763.23 620 11227
subjects_name 23429 22444 NaN NaN Richard Lawrence Holtz Tamon Robinson
subjects_age 22716 122 NaN NaN 45 27
subjects_gender 23365 4 NaN NaN Male Male
subjects_race 23428 7 NaN NaN African-American/Black African-American/Black
url_of_image_of_deceased 8133 8054 NaN NaN NaN http://i.huffpost.com/gen/584683/thumbs/o-TAMO...
date_of_injury_resulting_in_death_month_day_year 23429 6254 NaN NaN 12/07/2000 04/12/2012
location_of_injury_address 22910 21502 NaN NaN 399 Monmouth St. Seaview Avenue and E. 102nd St.
location_of_death_city 23391 5311 NaN NaN East Windsor Brooklyn
location_of_death_state 23428 51 NaN NaN NJ NY
location_of_death_zip_code 23239 NaN 58550.6 28092.8 8520 11236
location_of_death_county 23415 1442 NaN NaN Mercer Kings
full_address 23428 22148 NaN NaN 399 Monmouth St. East Windsor NJ 08520 Mercer Seaview Avenue and E. 102nd St. Brooklyn NY 11...
latitude 23427 NaN 36.7663 5.12076 40.2709 40.6363
longitude 23427 NaN -95.4954 16.3897 -74.5071 -73.886
agency_responsible_for_death 23154 6411 NaN NaN Fort Lee Police Department City of New York Police Department
cause_of_death 23428 14 NaN NaN Gunshot Vehicle
a_brief_description_of_the_circumstances_surrounding_the_death 23428 22317 NaN NaN Richard Holtz was suspected of stabbing to dea... Robinson was collecting paving stones from a y...
official_disposition_of_death_justified_or_other 23393 38 NaN NaN Justified Unreported
link_to_news_article_or_photo_of_official_document 23427 21751 NaN NaN http://www.nytimes.com/2000/12/10/nyregion/pol... http://www.nydailynews.com/new-york/brooklyn/c...
symptoms_of_mental_illness 23370 4 NaN NaN No No
video 1 1 NaN NaN NaN NaN
date_description 23394 22361 NaN NaN 12/7/2000: Richard Holtz was suspected of stab... 4/12/2012: Robinson was collecting paving ston...
26 NaN 23095.4 98.6392 NaN NaN
unique_identifier 23428 NaN 11714.5 6763.23 620 11227
date_year 23429 NaN 2009.54 5.0848 2000 2012

Exploration


In [10]:
# TARGET VARIABLE
JUSTIFIED = "official_disposition_of_death_justified_or_other"

# Features of interest
SENSITIVE_ATTRIBUTES = [
    "subjects_name",
    "subjects_age",
    "subjects_gender",
    "subjects_race",
    "url_of_image_of_deceased",
    "symptoms_of_mental_illness"
]

FEATURES = [
    "agency_responsible_for_death",
    "cause_of_death",
    "a_brief_description_of_the_circumstances_surrounding_the_death",
    "location_of_death_city",
    "location_of_death_state",
    "location_of_death_zip_code",
    "location_of_death_county",
]

In [11]:
def plot_categorical(s, top_n=15, **kwargs):
    ax = s.value_counts().sort_values().tail(top_n).plot.barh(**kwargs)
    ax.set_xlabel("frequency");
    sns.despine()
    return ax


plot_categorical(data[JUSTIFIED], figsize=(8, 7));


Preprocess

encode target variable into buckets: justified, other, and unknown. For modeling throw unknown data out for the first pass.

Target Variable


In [12]:
JUSTIFIED_STRINGS = [
    "Justified",
    "Justifed",
    "Jusified",
    "Justified by internal review",
    "Justified by outside agency",
    "Justified by District Attorney",
    "Other justified (Civilian board/Prosecutor/District Attorney/Coroner)"
]
UNKNOWN_STRINGS = [
    "Unreported",
    "Unknown",
]

RACE = "subjects_race"
GENDER = "subjects_gender"


def encode_target(s):
    if pd.isnull(s):
        return "UNKNOWN"
    s = s.strip()
    if s in JUSTIFIED_STRINGS:
        return "JUSTIFIED"
    elif s in UNKNOWN_STRINGS:
        return "UNKNOWN"
    else:
        return "OTHER"

gender_encoding_map = {
    "Female": "FEMALE",
    "Femalr": "FEMALE",
    "Transgender": "TRANSGENDER",
    "Male": "MALE",
}
race_encoding_map = {
    "Race unspecified": "RACE_UNSPECIFIED",
    "European-American/White": "WHITE",
    "African-American/Black": "BLACK",
    "Hispanic/Latino": "LATINO",
    "Asian/Pacific Islander": "ASIAN_PACIFIC_ISLANDER",
    "Native American/Alaskan": "NATIVE_AMERICAN_ALASKAN",
    "Middle Eastern": "MIDDLE_EASTERN",
}

clean_data = data.copy()
clean_data[JUSTIFIED] = data[JUSTIFIED].map(encode_target)
clean_data[JUSTIFIED].value_counts().to_frame()
clean_data[GENDER] = data[GENDER].map(gender_encoding_map)
clean_data[RACE] = data[RACE].map(race_encoding_map)

# exclude records with "UNKNOWN" disposition and "UNSPECIFIED RACE"
clean_data = clean_data[clean_data[JUSTIFIED] != "UNKNOWN"]
clean_data = clean_data[clean_data[RACE] != "RACE_UNSPECIFIED"]

clean_data[JUSTIFIED].value_counts().to_frame()


Out[12]:
official_disposition_of_death_justified_or_other
OTHER 5986
JUSTIFIED 2957

Sensitive Attributes: Gender and Race


In [13]:
clean_data.subjects_gender.value_counts().to_frame()


Out[13]:
subjects_gender
MALE 8120
FEMALE 806
TRANSGENDER 6

In [14]:
clean_data.subjects_race.value_counts().to_frame()


Out[14]:
subjects_race
WHITE 4433
BLACK 2616
LATINO 1550
ASIAN_PACIFIC_ISLANDER 179
NATIVE_AMERICAN_ALASKAN 136
MIDDLE_EASTERN 29

Features


In [15]:
examine(clean_data[FEATURES])


Out[15]:
count unique mean std sample_1 sample_2 sample_3
agency_responsible_for_death 8933 3417 NaN NaN Davenport Police Department Madison Police Department. Corona Police Department
cause_of_death 8943 14 NaN NaN Vehicle Gunshot Gunshot
a_brief_description_of_the_circumstances_surrounding_the_death 8943 8642 NaN NaN Just after 7 a.m., police began following a Je... Charles Carll was armed with a knife and repor... A man was shot and killed by a Corona police o...
location_of_death_city 8933 3021 NaN NaN Davenport Madison Norco
location_of_death_state 8943 51 NaN NaN IA WI CA
location_of_death_zip_code 8897 NaN 59714.8 27746 52802 53711 92880
location_of_death_county 8935 1106 NaN NaN Scott Dane Riverside

In [16]:
clean_data.cause_of_death.value_counts().to_frame()


Out[16]:
cause_of_death
Gunshot 7534
Vehicle 758
Tasered 266
Asphyxiated/Restrained 99
Beaten/Bludgeoned with instrument 71
Medical emergency 70
Other 35
Drug overdose 32
Drowned 16
Stabbed 16
Fell from a height 15
Undetermined 14
Chemical agent/Pepper spray 10
Burned/Smoke inhalation 7

TODO: tokenize a_brief_description_of_the_circumstances_surrounding_the_death so that text is represented as a word vector.

Assess Potentially Discriminatory (PD) Patterns

Get mean_difference score for the following sensitive attributes:

  • subjects_gender
  • subjects_race

In [17]:
from themis_ml.metrics import mean_difference, mean_confidence_interval

    
def report_mean_difference(y, s_list):
    report = []
    index = []
    for s_name, s in s_list:
        s_notnull = s.notnull()
        report.append(
            map(lambda x: x * 100, mean_difference(y[s_notnull], s[s_notnull])))
        index.append("{s_name} vs. NOT {s_name}".format(s_name=s_name))
    return pd.DataFrame(
        report, columns=["mean difference", "lower bound", "upper bound"],
        index=index)


is_justified = clean_data[JUSTIFIED] == "JUSTIFIED"
gender_vectors = [
    (g, (clean_data.subjects_gender == g).astype(int))
    for g in clean_data.subjects_gender.dropna().unique()]
gender_report = report_mean_difference(is_justified, gender_vectors)
gender_report


Out[17]:
mean difference lower bound upper bound
MALE vs. NOT MALE -13.666524 -17.028157 -10.304891
FEMALE vs. NOT FEMALE 13.159141 9.764708 16.553574
TRANSGENDER vs. NOT TRANSGENDER 33.087166 -4.569341 70.743672

In [18]:
def plot_report(report):
    margin = (report["mean difference"] - report["lower bound"]).abs()
    ax = report[["mean difference"]].plot(
        kind="barh", xerr=margin, legend=False)
    ax.axvline(0, color="k")
    ax.set_xlabel("mean difference")
    sns.despine(bottom=True, left=True)
    
plot_report(gender_report)


If mean difference is negative with respect to some sensitive attribute value $s \in \{d, a\}$ and some outcome $y \in \{y^{+}, y^{-}\}$ , it implies that the members of the putatively disadvantaged class $d$ experiences the beneficial outcome $y^{+}$ more often compared to the advantaged class $a$.

Conversely, if mean difference is positive with respect to some sensitive attribute value $s \in \{d, a\}$ and some outcome $y \in \{y^{+}, y^{-}\}$ , it implies that members the putatively disadvantaged class $d$ experiences the harmful outcome $y^{-}$ more often compared to the advantaged class $a$.

Interestingly, MALEs experience JUSTIFIED fatal encounters more than their NON MALE counterparts


In [19]:
race_vectors = [
    (r, (clean_data.subjects_race == r).astype(int))
    for r in clean_data.subjects_race.dropna().unique()]
race_report = report_mean_difference(is_justified, race_vectors)
race_report


Out[19]:
mean difference lower bound upper bound
BLACK vs. NOT BLACK -6.755053 -8.894059 -4.616047
WHITE vs. NOT WHITE 3.434001 1.484907 5.383095
NATIVE_AMERICAN_ALASKAN vs. NOT NATIVE_AMERICAN_ALASKAN 3.709616 -4.258527 11.677759
ASIAN_PACIFIC_ISLANDER vs. NOT ASIAN_PACIFIC_ISLANDER 0.106199 -6.856564 7.068962
LATINO vs. NOT LATINO 3.239307 0.663956 5.814657
MIDDLE_EASTERN vs. NOT MIDDLE_EASTERN 5.496584 -11.655494 22.648663

In [20]:
plot_report(race_report)



In [21]:
mental_illness_vectors = [
    (r, (clean_data.symptoms_of_mental_illness == r).astype(int))
    for r in clean_data.symptoms_of_mental_illness.dropna().unique()]
mental_illness_report = report_mean_difference(
    is_justified, mental_illness_vectors)
mental_illness_report


Out[21]:
mean difference lower bound upper bound
Unknown vs. NOT Unknown 2.123112 -0.552404 4.798628
Drug or alcohol use vs. NOT Drug or alcohol use -8.490975 -12.735800 -4.246151
No vs. NOT No 6.085401 4.053744 8.117058
Yes vs. NOT Yes -10.050927 -12.851683 -7.250171

In [22]:
plot_report(mental_illness_report)


Interestingly, MALE and BLACK people experience JUSTIFIED fatal encounters more than their NON MALE and NON BLACK counterparts, respectively.

This leads me to suspect that the labels official_disposition_of_death_justified_or_other are somehow skewed against these two sensitive attribute value.

WHO LABELLED THESE RECORDS?

Train Models

Train a logistic regression model to predict JUSTIFIED = 1, OTHER = 0.


In [23]:
import itertools
import numpy as np
import pandas as pd

from sklearn.model_selection import RepeatedKFold, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from themis_ml.linear_model import LinearACFClassifier

from sklearn.metrics import (
    accuracy_score, roc_auc_score, f1_score)

In [24]:
FAIRNESS_UNAWARE_FEATURES = [
    ("subjects_age", "NUMERIC"),
    ("subjects_gender", "CATEGORICAL"),
    ("subjects_race", "CATEGORICAL"),
    ("symptoms_of_mental_illness", "CATEGORICAL"),
    ("agency_responsible_for_death", "CATEGORICAL"),
    ("cause_of_death", "CATEGORICAL"),
    ("location_of_death_city", "CATEGORICAL"),
    ("location_of_death_state", "CATEGORICAL"),
    ("location_of_death_zip_code", "CATEGORICAL"),
    ("location_of_death_county", "CATEGORICAL"),
]


training_data = []
for feature, dtype in FAIRNESS_UNAWARE_FEATURES:
    if dtype == "NUMERIC":
        f = clean_data[feature].str.replace("[^0-9]", "").astype(float)
        training_data.append(f.where(f.notnull(), f.mean()))
    elif dtype == "CATEGORICAL":
        training_data.append(pd.get_dummies(clean_data[[feature]].fillna("NULL")))
training_data = pd.concat(training_data, axis=1)
features = training_data.columns
training_data = training_data.assign(
    target=(clean_data[JUSTIFIED] == "JUSTIFIED").astype(int))
assert training_data.notnull().all().all()
training_data.head()


Out[24]:
subjects_age subjects_gender_FEMALE subjects_gender_MALE subjects_gender_NULL subjects_gender_TRANSGENDER subjects_race_ASIAN_PACIFIC_ISLANDER subjects_race_BLACK subjects_race_LATINO subjects_race_MIDDLE_EASTERN subjects_race_NATIVE_AMERICAN_ALASKAN ... location_of_death_county_Yalobusha location_of_death_county_Yamhill location_of_death_county_Yavapai location_of_death_county_Yellowstone location_of_death_county_Yolo location_of_death_county_York location_of_death_county_Yuba location_of_death_county_Yuma location_of_death_county_kings target
4 45.0 0 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
5 20.0 0 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 19.0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
15 35.0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 36.0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 13016 columns


In [32]:
cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=10)
estimators = [
    ("logistic_regression", LogisticRegression()),
    ("linear_acf", LinearACFClassifier()),
]
X = training_data[features].values
y = training_data["target"].values
s = training_data["subjects_race_BLACK"].values
strata = training_data["target"].astype(int).astype(str).str.cat(
    training_data["subjects_race_BLACK"].astype(int).astype(str), sep="_")
preds = []
for i, (train, test) in enumerate(cv.split(X, strata, groups=strata)):
    print("."),
    X_train, X_test = X[train], X[test]
    y_train, y_test = y[train], y[test]
    s_train, s_test = s[train], s[test]
    for est_name, estimator in estimators:
        fit_args = (X_train, y_train, s_train) if est_name == "linear_acf" \
            else (X_train, y_train)
        predict_args = (X_test, s_test) if est_name == "linear_acf" \
            else (X_test, )
        estimator.fit(*fit_args)
        preds.append(
            pd.DataFrame({
                "pred_y": estimator.predict_proba(*predict_args)[:, 1],
                "pred_label": estimator.predict(*predict_args).astype(int),
                "true_y": y_test.astype(int),
                "sensitive_attribute": s_test,
                "rep_fold": i,
                "estimator": est_name,
            }))
preds = pd.concat(preds)


 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [76]:
def compute_metrics(df):
    accuracy = accuracy_score(df.true_y, df.pred_label)
    mean_diff, lower, upper = mean_difference(df.pred_label, df.sensitive_attribute)
    return pd.Series({
        "accuracy": accuracy,
        "mean difference": mean_diff,
    })

metrics = (
    preds
    .groupby(["estimator", "rep_fold"])
    .apply(compute_metrics)
    .reset_index(0)
    .pipe(pd.melt, id_vars="estimator", var_name="metric",
          value_name="value")
)
sns.factorplot(
    x="value", y="estimator",
    hue="metric",
    row="metric",
    sharex=False,
    data=metrics,
    size=3, aspect=1.5,
    join=False);



In [79]:
(
    metrics
    .groupby(["metric", "estimator"])
    .agg([np.mean, np.std]))


Out[79]:
value
mean std
metric estimator
accuracy linear_acf 0.663043 0.006527
logistic_regression 0.701297 0.005744
mean difference linear_acf -0.074298 0.025480
logistic_regression -0.111183 0.022025