In this tutorial, we'll discuss how to formulate a research question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn
package in Python to model the data you're familiar with from previous tutorials.
This tutorial is based on chapter 6 of Big Data and Social Science.
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with numpy
, pandas
, and psycopg2
from previous tutorials. Here we'll also be using scikit-learn
to fit modeling.
In [ ]:
%pylab inline
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier,
AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
sns.set_style("white")
In [ ]:
db_name =
hostname =
conn = psycopg2.connect(database=db_name, host = hostname) #database connection
The database connection allows us to make queries to a database from Python.
In [ ]:
df_tables = pd.read_sql("""SELECT table_schema, table_name
FROM information_schema.tables
order by table_schema, table_name;""", conn)
In [ ]:
df_tables.head()
The table information_schema.tables
contains information about all tables in the database. This comes in handy when you forget a tablename.
First, turning something into a real objective function. What do you care about? Do you have data on that thing? What action can you take based on your findings? Do you risk introducing any bias based on the way you model something?
Of all prisoners released we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Dec 31, 2004 and we want to identify which prisoners are likely to re-enter jail between now and 2015. We can run our predictive model and identify who is most likely at risk of recidivism. The is an example of a binary prediction classification problem.
Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. The outcome window will depend on how often you receive new data -- there is no sense in making the same predictions on the same data -- how accurate your predictions are for a given time period or on what time-scale you can use the output of the data.
We have already explored the data in the first module and database modules. In order to predict recidivism, we will be using data from the ildoc_admit
and ildoc_exit
table to create labels and features.
We need to munge our dataset into features (predictors, or dependent variables, or $X$ variables) and labels (independent variables, or $Y$ variables). For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "X_
", and names of variables that pertain to outcome variables will start with "y_
".
Labels are the indendent variables, or Y variables, that we are trying to predict. In the machine learning framework, your labels are usually binary: true or false, encoded as 1 or 0.
We can write SQL code in psql
, dbeaver
, pgAdmin
, or programmaticaly generate the SQL and pass to the DB using psycopg2
to create the labels. Here is the SQL code for developing labels:
In [ ]:
def create_labels(as_of_today, prediction_range,conn, overwrite=False):
"""
Generate a list of label and return the
table as a dataframe.
Parameters
----------
as_of_today: int
year to make prediction from
(e.g., 2005 corresponds to Dec 31, 2005)
prediction_range: ls[int]
range of years to predict (e.g. [2006-2010])
overwrite: bool
if True, runs the query if table does
not exist
conn: obj
psycopg2 conection object to database
Returns
-------
df_labels: DataFrame
Dataframe of labels
"""
begin_range, end_range = prediction_range
cursor = conn.cursor()
query = """
select exists(select * from information_schema.tables
where table_name=\'recidivism_labels_{as_of_today}_{end_range}\')
""".format(as_of_today=as_of_today,
end_range=end_range)
cursor.execute(query)
if not(cursor.rowcount) or overwrite:
print('generating labels')
sql_script="""
""".format(as_of_today=as_of_today,
begin_range=begin_range,
end_range=end_range)
cursor.execute(sql_script)
else:
print('Table already generated')
cursor.close()
df_label = pd.read_sql('select * from recidivism_labels_{as_of_today}_{end_range}'.format(
as_of_today=as_of_today,
end_range=end_range), conn)
return df_label
In [ ]:
#df_label_2005_2010 = create_labels(2005, [2006,2008], conn)
#df_label_2010_2015 = create_labels(2008, [2009, 2011], conn)
df_label_2005_2010 = create_labels(2005, [2006,2010], conn, overwrite=True)
df_label_2010_2015 = create_labels(2010, [2011, 2015], conn)
In [ ]:
df_label_2005_2010.head()
Our features are our dependent variables or predictors. Good features make machine learning systems effective. The better the features the easier it is the capture the structure of the data. You generate features by a combination of domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters.
Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem.
Example of feature engineering are:
In [ ]:
def create_features(as_of_today, prediction_range, train_or_test, conn, overwrite=False):
"""
Generate a list of features and return the
table as a dataframe.
Note: There has to be a table of labels that
correspond with the same time period.
Parameters
----------
as_of_today: int
year to make prediction from
(e.g., 2005 corresponds to Dec 31, 2005)
prediction_range: ls[int]
range of years to predict when someone
reenters prison. (e.g. [2006-f010])
train_or_test: str
training or testing matrix
conn: obj
psycopg2 conection object to database
overwrite: bool
If True will run SQL script if tables
do not exist.
Returns
-------
df_labels: DataFrame
Dataframe of labels
"""
begin_range, end_range = prediction_range
print(train_or_test)
cursor = conn.cursor()
query = """
select exists(select * from information_schema.tables
where table_name=\'{train_or_test}_set_1989_{as_of_today}\')
""".format(train_or_test=train_or_test,
as_of_today=as_of_today)
cursor.execute(query)
if not(cursor.rowcount) or overwrite:
sql_script="""
""".format(as_of_today=as_of_today,
begin_range=begin_range,
end_range=end_range,
train_or_test=train_or_test)
cursor.execute(sql_script)
cursor.close()
print('created {train_or_test}_set_1989_{as_of_today}'.format(
train_or_test=train_or_test,
as_of_today=as_of_today))
In [ ]:
#create_features(2005,[2006,2008],'training',conn)
#create_features(2008,[2009,2011], 'testing',conn)
create_features(2005,[2006,2010],'training',conn)
create_features(2010,[2011,2015], 'testing',conn)
It's not enough to just build the model; we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the most challenging part of an analysis. Making repeatable, well-documented work with clear success metrics makes all the difference.
To convince ourselves - and others - that our modeling results will generalize, we need to hold some data back (not using it to train the model), then apply our model to that hold-out set and "blindly" predict, comparing the model's predictions to what we actually observed. This is called cross-validation, and it's the best way we have to estimate how a model will perform on entirely novel data. We call the data used to build the model the training set, and the rest the test set.
In general, we'd like our training set to be as large as possible, to give our model more information. However, you also want to be as confident as possible that your model will be applicable to new data, or else the model is useless. In practice, you'll have to balance these two objectives in a reasonable way.
There are also many ways to split up your data into training and testing sets. Since you're trying to evaluate how your model will perform in practice, it's best to emulate the true use case of your model as closely as possible when you decide how to evaluate it. A good tutorial on cross-validation can be found on the scikit-learn
site.
One simple and commonly used method is k-fold cross-validation, which entails splitting up our dataset into k groups, holding out one group while training a model on the rest of the data, evaluating model performance on the held-out "fold," and repeating this process k times (we'll get back to this in the text-analysis tutorial). Another method is temporal cross-validation, which involves building a model using all the data up until a given point in time, and then testing the model on observations that happened after that point. Our problem of recidivism is a problem in time where we are trying to predcit an event in the future. Generally, if you use the future to predict the past there will be temporal effects that will help the accuracy of your predictions. We cannot use the future to predict the past in real life, so it is important to use temporal cross-validation
and create our training and test sets accordingly.
In [ ]:
#grab the training and testing set.
df_training = pd.read_sql('select * from training_set_1989_2005', conn)
df_testing = pd.read_sql('select * from testing_set_1989_2010', conn)
In [ ]:
isnan_training_rows = df_training.isnull().any(axis=1) # Find the rows where there are NaNs
In [ ]:
df_training[isnan_training_rows].head()
In [ ]:
nrows_training = df_training.shape[0]
nrows_training_isnan = df_training[isnan_training_rows].shape[0]
In [ ]:
print('%of frows with NaNs {} '.format(float(nrows_training_isnan)/nrows_training))
In [ ]:
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['recidivism'].value_counts(normalize=True)
Let's take a look at our testing set.
In [ ]:
df_testing.head()
In [ ]:
isnan_testing_rows = df_testing.isnull().any(axis=1) # Find the rows where there are NaNs
nrows_testing = df_testing.shape[0]
nrows_testing_isnan = df_testing[isnan_testing_rows].shape[0]
print('%of rows with NaNs {} '.format(float(nrows_testing_isnan)/nrows_testing))
We see that about 2% of the rows in our testing set have missing values. This matches what we'd expect based on what we saw in the training set.
In [ ]:
df_testing[isnan_testing_rows].head()
In [ ]:
df_testing = df_testing[~isnan_testing_rows]
In [ ]:
mask = ~(df_testing['age_first_admit'] == 1)
df_testing = df_testing[mask]
In [ ]:
df_testing.head()
In [ ]:
print('Number of rows: {}'.format(df_testing.shape[0]))
df_testing['recidivism'].value_counts(normalize=True)
In [ ]:
sel_features =
sel_label =
In [ ]:
sel_features =
In [ ]:
X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_testing[sel_features].values
y_test = df_testing[sel_label].values
In this phase, you take the predictors from your test set and apply your model to them, then assess the quality of the model by comparing the predicted values to the actual values for each record in your testing data set.
Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:
Python's scikit-learn
is a commonly used, well documented Python library for machine learning. This library can help you split your data into training and test sets, fit models and use them to predict results on new data, and evaluate your results.
We will start with the simplest LogisticRegression
model and see how well that does.
You can use any number of metrics to judge your models (see model evaluation), but we'll use accuracy_score()
(ratio of correct predictions to total number of predictions) as our measure.
In [ ]:
# Let's fit the model
from sklearn import linear_model
model = linear_model.LogisticRegression(penalty='l1', C=1e5)
model.fit( X_train, y_train )
print(model)
When we print the model results, we see different parameters we can adjust as we refine the model based on running it against test data (values such as intercept_scaling
, max_iters
, penalty
, and solver
). Example output:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr',
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0)
To adjust these parameters, one would alter the call that creates the LogisticRegression()
model instance, passing it one or more of these parameters with a value other than the default. So, to re-fit the model with max_iter
of 1000, intercept_scaling
of 2, and solver
of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:
model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )
The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. An obvious drawback is that you can also overfit to your test set; in this case, you can alter your method of cross-validation.
Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score (that can sometimes be interpreted a a probabilty) between 0 and 1, which lets you more finely rank all of the examples from most likely to least likely to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff.
In [ ]:
# from our "predictors" usin the model.
y_score = model.predict_proba(X_test)[:,1]
In [ ]:
y_score
Let's take a look at the distribution of scores and see if it makes sense to us.
In [ ]:
sns.distplot(y_score, kde=False, rug=False)
In [ ]:
df_testing['y_score'] = y_score
In [ ]:
df_testing[['docnbr', 'y_score']].head()
Tools like sklearn
often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5.
In [ ]:
calc_threshold = lambda x,y: 0 if x < y else 1
predicted = np.array( [calc_threshold(score,0.7) for score in y_score] )
expected = y_test
Once we have tuned our scores to 0 or 1 for classification, we create a confusion matrix, which has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.
In [ ]:
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(expected,predicted)
In [ ]:
conf_matrix
The count of true negatives is conf_matrix[0,0]
, false negatives conf_matrix[1,0]
, true positives conf_matrix[1,1]
, and false_positives conf_matrix[0,1]
.
Accuracy is the ratio of the correct predictions (both positive and negative) to all predictions. $$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
In [ ]:
# generate an accuracy score by comparing expected to predicted.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )
Two additional metrics that are often used are precision and recall.
Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive.
$$ Precision = \frac{TP}{TP+FN}$$Recall measures the accuracy of the classifier to find positive examples in the data.
$$ Recall = \frac{TP}{TP+FN} $$By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is very sure, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall.
In [ ]:
from sklearn.metrics import precision_score, recall_score
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))
If we care about our whole precision-recall space, we can optimize for a metric known as the area under the curve (AUC-PR), which is the area under the precision-recall curve. The maximum AUC-PR is 1.
In [ ]:
def plot_precision_recall(y_true,y_score):
"""
Plot a precision recall curve
Parameters
----------
y_true: ls
ground truth labels
y_score: ls
score output from model
"""
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
plt.plot(recall_curve, precision_curve)
plt.xlabel('Recall')
plt.ylabel('Precision')
auc_val = auc(recall_curve,precision_curve)
print('AUC-PR: {0:1f}'.format(auc_val))
plt.show()
plt.clf()
In [ ]:
plot_precision_recall(expected, y_score)
In [ ]:
def plot_precision_recall_n(y_true, y_prob, model_name):
"""
y_true: ls
ls of ground truth labels
y_prob: ls
ls of predic proba from model
model_name: str
str of model name (e.g, LR_123)
"""
from sklearn.metrics import precision_recall_curve
y_score = y_prob
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
precision_curve = precision_curve[:-1]
recall_curve = recall_curve[:-1]
pct_above_per_thresh = []
number_scored = len(y_score)
for value in pr_thresholds:
num_above_thresh = len(y_score[y_score>=value])
pct_above_thresh = num_above_thresh / float(number_scored)
pct_above_per_thresh.append(pct_above_thresh)
pct_above_per_thresh = np.array(pct_above_per_thresh)
plt.clf()
fig, ax1 = plt.subplots()
ax1.plot(pct_above_per_thresh, precision_curve, 'b')
ax1.set_xlabel('percent of population')
ax1.set_ylabel('precision', color='b')
ax1.set_ylim(0,1.05)
ax2 = ax1.twinx()
ax2.plot(pct_above_per_thresh, recall_curve, 'r')
ax2.set_ylabel('recall', color='r')
ax2.set_ylim(0,1.05)
name = model_name
plt.title(name)
plt.show()
plt.clf()
In [ ]:
def precision_at_k(y_true, y_scores,k):
threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
return precision_score(y_true, y_pred)
In [ ]:
plot_precision_recall_n(expected,y_score, 'LR')
In [ ]:
p_at_5 = precision_at_k(expected,y_score, 0.05)
print('Precision at 5%: {:.2f}'.format(p_at_5))
When working on machine learning projects, it is a good idea to structure your code as a modular pipeline, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
In [ ]:
clfs = {'RF': RandomForestClassifier(n_estimators=50, n_jobs=-1),
'ET': ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy'),
'LR': LogisticRegression(penalty='l1', C=1e5),
'SGD':SGDClassifier(loss='log'),
'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=10),
'NB': GaussianNB()}
In [ ]:
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB']
In [ ]:
max_p_at_k = 0
for clfNM in sel_clfs:
clf = clfs[clfNM]
clf.fit( X_train, y_train )
print clf
y_score = clf.predict_proba(X_test)[:,1]
predicted = np.array(y_score)
expected = np.array(y_test)
plot_precision_recall_n(expected,predicted, clfNM)
p_at_5 = precision_at_k(expected,y_score, 0.05)
if max_p_at_k < p_at_5:
max_p_at_k = p_at_5
print('Precision at 5%: {:.2f}'.format(p_at_5))
It is important to check our model against a reasonable baseline to know how well our model is doing. Without any context, 83% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone a non-recividist, which would be stupid (not to mention useless) model.
A good place to start is checking against a random baseline, assigning every example a label (positive or negative) completely at random.
In [ ]:
max_p_at_k
In [ ]:
random_score = [random.uniform(0,1) for i in enumerate(y_test)]
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k()
Another good practice is checking against an "expert" or rule of thumb baseline.
In [ ]:
recidivism_predicted = np.array([ 1 if nadmit > 1 else 0 for nadmit in df_testing.nadmits.values ])
recidivism_p_at_5 = precision_at_k(expected,predicted,0.05)
In [ ]:
all_non_recidivist = np.array([0 for nadmit in df_testing.nadmits.values])
all_non_recidivist_p_at_5 = precision_at_k()
In [ ]:
sns.set_style("whitegrid")
fig, ax = plt.subplots(1, figsize=(22,12))
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
sns.barplot([],
[],
palette=['#6F777D','#6F777D','#6F777D','#800000'])
plt.ylim(0,1)
plt.ylabel('precision at 5%')