In this tutorial, we'll discuss how to formulate a research question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the scikit-learn package in Python to model the data you're familiar with from previous tutorials.
This tutorial is based on chapter 6 of Big Data and Social Science.
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with numpy, pandas, and psycopg2 from previous tutorials. Here we'll also be using scikit-learn to fit modeling.
In [ ]:
    
%pylab inline
import pandas as pd
import psycopg2
import sklearn
import seaborn as sns
from sklearn.metrics import precision_recall_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, 
AdaBoostClassifier)
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sqlalchemy import create_engine
sns.set_style("white")
    
In [ ]:
    
db_name = 
hostname = 
conn = psycopg2.connect(database=db_name, host = hostname) #database connection
    
The database connection allows us to make queries to a database from Python.
In [ ]:
    
df_tables = pd.read_sql("""SELECT table_schema, table_name
                          FROM information_schema.tables
                          order by table_schema, table_name;""", conn)
    
In [ ]:
    
df_tables.head()
    
The table information_schema.tables contains information about all tables in the database. This comes in handy when you forget a tablename.
First, turning something into a real objective function. What do you care about? Do you have data on that thing? What action can you take based on your findings? Do you risk introducing any bias based on the way you model something?
Of all prisoners released we would like to predict who is likely to reenter jail within 5 years of the day we make our prediction. For instance, say it is Dec 31, 2004 and we want to identify which prisoners are likely to re-enter jail between now and 2015. We can run our predictive model and identify who is most likely at risk of recidivism. The is an example of a binary prediction classification problem.
Note the outcome window of 5 years is completely arbitrary. You could use a window of 5, 3, 1 years or 1 day. The outcome window will depend on how often you receive new data -- there is no sense in making the same predictions on the same data -- how accurate your predictions are for a given time period or on what time-scale you can use the output of the data.
We have already explored the data in the first module and database modules. In order to predict recidivism, we will be using data from the ildoc_admit and ildoc_exit table to create labels and features.
We need to munge our dataset into features (predictors, or dependent variables, or $X$ variables) and labels (independent variables, or $Y$ variables).  For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "X_", and names of variables that pertain to outcome variables will start with "y_".
Labels are the indendent variables, or Y variables, that we are trying to predict. In the machine learning framework, your labels are usually binary: true or false, encoded as 1 or 0.
We can write SQL code in psql, dbeaver, pgAdmin, or programmaticaly generate the SQL and pass to the DB using psycopg2 to create the labels. Here is the SQL code for developing labels:
In [ ]:
    
def create_labels(as_of_today, prediction_range,conn, overwrite=False):
    """
    Generate a list of label and return the 
    table as a dataframe.
    
    Parameters
    ----------
    as_of_today: int
        year to make prediction from
        (e.g., 2005 corresponds to Dec 31, 2005)
    prediction_range: ls[int]
        range of years to predict (e.g. [2006-2010])
    overwrite: bool
        if True, runs the query if table does
        not exist
    conn: obj
        psycopg2 conection object to database
        
    Returns
    -------
    df_labels: DataFrame
        Dataframe of labels
    """
    begin_range, end_range = prediction_range
    
    
    cursor = conn.cursor()
    query = """
            select exists(select * from information_schema.tables 
            where table_name=\'recidivism_labels_{as_of_today}_{end_range}\')
            """.format(as_of_today=as_of_today,
                      end_range=end_range)
    cursor.execute(query)
    
    
    
    if not(cursor.rowcount) or overwrite:
        print('generating labels')
        sql_script="""
    """.format(as_of_today=as_of_today, 
               begin_range=begin_range,
               end_range=end_range)
    
        cursor.execute(sql_script)
    else:
        print('Table already generated')
    
    cursor.close()
    df_label = pd.read_sql('select * from recidivism_labels_{as_of_today}_{end_range}'.format(
                                                                                    as_of_today=as_of_today,
                                                                                    end_range=end_range), conn)
    
    return df_label
    
In [ ]:
    
#df_label_2005_2010 = create_labels(2005, [2006,2008], conn)
#df_label_2010_2015 = create_labels(2008, [2009, 2011], conn)
df_label_2005_2010 = create_labels(2005, [2006,2010], conn, overwrite=True)
df_label_2010_2015 = create_labels(2010, [2011, 2015], conn)
    
In [ ]:
    
df_label_2005_2010.head()
    
Our features are our dependent variables or predictors. Good features make machine learning systems effective. The better the features the easier it is the capture the structure of the data. You generate features by a combination of domain knowledge. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand rather then extensively searching for the "right" model and "right" set of parameters.
Machine Learning Algorithms learn a solution to a problem from sample data. The set of features is the best representation of the sample data to learn a solution to a problem.
Example of feature engineering are:
In [ ]:
    
def create_features(as_of_today, prediction_range, train_or_test, conn, overwrite=False):
    """
    Generate a list of features and return the 
    table as a dataframe.
    
    Note: There has to be a table of labels that
    correspond with the same time period. 
    
    Parameters
    ----------
    as_of_today: int
        year to make prediction from
        (e.g., 2005 corresponds to Dec 31, 2005)
    prediction_range: ls[int]
        range of years to predict when someone
        reenters prison. (e.g. [2006-f010])
    train_or_test: str
        training or testing matrix
    conn: obj
        psycopg2 conection object to database
    overwrite: bool
        If True will run SQL script if tables
        do not exist. 
        
    Returns
    -------
    df_labels: DataFrame
        Dataframe of labels
    """
    begin_range, end_range = prediction_range
    
    print(train_or_test)
    
    cursor = conn.cursor()
    query = """
            select exists(select * from information_schema.tables 
            where table_name=\'{train_or_test}_set_1989_{as_of_today}\')
            """.format(train_or_test=train_or_test,
                      as_of_today=as_of_today)
    cursor.execute(query)
    
    if not(cursor.rowcount) or overwrite:
    
        sql_script="""
    
    """.format(as_of_today=as_of_today, 
               begin_range=begin_range,
               end_range=end_range,
               train_or_test=train_or_test)
    
    
        cursor.execute(sql_script)
    
    cursor.close()
    
    print('created {train_or_test}_set_1989_{as_of_today}'.format(
        train_or_test=train_or_test,
        as_of_today=as_of_today))
    
In [ ]:
    
#create_features(2005,[2006,2008],'training',conn)
#create_features(2008,[2009,2011], 'testing',conn)
create_features(2005,[2006,2010],'training',conn)
create_features(2010,[2011,2015], 'testing',conn)
    
It's not enough to just build the model; we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the most challenging part of an analysis. Making repeatable, well-documented work with clear success metrics makes all the difference.
To convince ourselves - and others - that our modeling results will generalize, we need to hold some data back (not using it to train the model), then apply our model to that hold-out set and "blindly" predict, comparing the model's predictions to what we actually observed. This is called cross-validation, and it's the best way we have to estimate how a model will perform on entirely novel data. We call the data used to build the model the training set, and the rest the test set.
In general, we'd like our training set to be as large as possible, to give our model more information. However, you also want to be as confident as possible that your model will be applicable to new data, or else the model is useless. In practice, you'll have to balance these two objectives in a reasonable way.
There are also many ways to split up your data into training and testing sets. Since you're trying to evaluate how your model will perform in practice, it's best to emulate the true use case of your model as closely as possible when you decide how to evaluate it. A good tutorial on cross-validation can be found on the scikit-learn site.
One simple and commonly used method is k-fold cross-validation, which entails splitting up our dataset into k groups, holding out one group while training a model on the rest of the data, evaluating model performance on the held-out "fold," and repeating this process k times (we'll get back to this in the text-analysis tutorial). Another method is temporal cross-validation, which involves building a model using all the data up until a given point in time, and then testing the model on observations that happened after that point. Our problem of recidivism is a problem in time where we are trying to predcit an event in the future. Generally, if you use the future to predict the past there will be temporal effects that will help the accuracy of your predictions. We cannot use the future to predict the past in real life, so it is important to use temporal cross-validation and create our training and test sets accordingly.
In [ ]:
    
#grab the training and testing set. 
df_training = pd.read_sql('select * from training_set_1989_2005', conn)
df_testing = pd.read_sql('select * from testing_set_1989_2010', conn)
    
In [ ]:
    
isnan_training_rows = df_training.isnull().any(axis=1) # Find the rows where there are NaNs
    
In [ ]:
    
df_training[isnan_training_rows].head()
    
In [ ]:
    
nrows_training = df_training.shape[0]
nrows_training_isnan = df_training[isnan_training_rows].shape[0]
    
In [ ]:
    
print('%of frows with NaNs {} '.format(float(nrows_training_isnan)/nrows_training))
    
In [ ]:
    
print('Number of rows: {}'.format(df_training.shape[0]))
df_training['recidivism'].value_counts(normalize=True)
    
Let's take a look at our testing set.
In [ ]:
    
df_testing.head()
    
In [ ]:
    
isnan_testing_rows = df_testing.isnull().any(axis=1) # Find the rows where there are NaNs
nrows_testing = df_testing.shape[0]
nrows_testing_isnan = df_testing[isnan_testing_rows].shape[0]
print('%of rows with NaNs {} '.format(float(nrows_testing_isnan)/nrows_testing))
    
We see that about 2% of the rows in our testing set have missing values. This matches what we'd expect based on what we saw in the training set.
In [ ]:
    
df_testing[isnan_testing_rows].head()
    
In [ ]:
    
df_testing = df_testing[~isnan_testing_rows]
    
In [ ]:
    
mask = ~(df_testing['age_first_admit'] == 1)
df_testing = df_testing[mask]
    
In [ ]:
    
df_testing.head()
    
In [ ]:
    
print('Number of rows: {}'.format(df_testing.shape[0]))
df_testing['recidivism'].value_counts(normalize=True)
    
In [ ]:
    
sel_features = 
sel_label =
    
In [ ]:
    
sel_features =
    
In [ ]:
    
X_train = df_training[sel_features].values
y_train = df_training[sel_label].values
X_test = df_testing[sel_features].values
y_test = df_testing[sel_label].values
    
In this phase, you take the predictors from your test set and apply your model to them, then assess the quality of the model by comparing the predicted values to the actual values for each record in your testing data set.
Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:
Python's scikit-learn is a commonly used, well documented Python library for machine learning. This library can help you split your data into training and test sets, fit models and use them to predict results on new data, and evaluate your results.
We will start with the simplest LogisticRegression model and see how well that does.
You can use any number of metrics to judge your models (see model evaluation), but we'll use accuracy_score() (ratio of correct predictions to total number of predictions) as our measure.
In [ ]:
    
# Let's fit the model
from sklearn import linear_model
model = linear_model.LogisticRegression(penalty='l1', C=1e5)
model.fit( X_train, y_train )
print(model)
    
When we print the model results, we see different parameters we can adjust as we refine the model based on running it against test data (values such as intercept_scaling, max_iters, penalty, and solver).  Example output:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
      intercept_scaling=1, max_iter=100, multi_class='ovr',
      penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
      verbose=0)
To adjust these parameters, one would alter the call that creates the LogisticRegression() model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with max_iter of 1000, intercept_scaling of 2, and solver of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:
model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )
The basic way to choose values for, or "tune," these parameters is the same as the way you choose a model: fit the model to your training data with a variety of parameters, and see which perform the best on the test set. An obvious drawback is that you can also overfit to your test set; in this case, you can alter your method of cross-validation.
Machine learning models usually do not produce a prediction (0 or 1) directly. Rather, models produce a score (that can sometimes be interpreted a a probabilty) between 0 and 1, which lets you more finely rank all of the examples from most likely to least likely to have label 1 (positive). This score is then turned into a 0 or 1 based on a user-specified threshold. For example, you might label all examples that have a score greater than 0.5 (1/2) as positive (1), but there's no reason that has to be the cutoff.
In [ ]:
    
#  from our "predictors" usin the model.
y_score = model.predict_proba(X_test)[:,1]
    
In [ ]:
    
y_score
    
Let's take a look at the distribution of scores and see if it makes sense to us.
In [ ]:
    
sns.distplot(y_score, kde=False, rug=False)
    
In [ ]:
    
df_testing['y_score'] = y_score
    
In [ ]:
    
df_testing[['docnbr', 'y_score']].head()
    
Tools like sklearn often have a default threshold of 0.5, but a good threshold is selected based on the data, model and the specific problem you are solving. As a trial run, let's set a threshold of 0.5.
In [ ]:
    
calc_threshold = lambda x,y: 0 if x < y else 1 
predicted = np.array( [calc_threshold(score,0.7) for score in y_score] )
expected = y_test
    
Once we have tuned our scores to 0 or 1 for classification, we create a confusion matrix, which has four cells: true negatives, true positives, false negatives, and false positives. Each data point belongs in one of these cells, because it has both a ground truth and a predicted label. If an example was predicted to be negative and is negative, it's a true negative. If an example was predicted to be positive and is positive, it's a true positive. If an example was predicted to be negative and is positive, it's a false negative. If an example was predicted to be positive and is negative, it's a false negative.
In [ ]:
    
from sklearn.metrics import confusion_matrix
conf_matrix = confusion_matrix(expected,predicted)
    
In [ ]:
    
conf_matrix
    
The count of true negatives is conf_matrix[0,0], false negatives conf_matrix[1,0], true positives conf_matrix[1,1], and false_positives conf_matrix[0,1].
Accuracy is the ratio of the correct predictions (both positive and negative) to all predictions. $$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$
In [ ]:
    
# generate an accuracy score by comparing expected to predicted.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )
    
Two additional metrics that are often used are precision and recall.
Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive.
$$ Precision = \frac{TP}{TP+FN}$$Recall measures the accuracy of the classifier to find positive examples in the data.
$$ Recall = \frac{TP}{TP+FN} $$By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is very sure, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall.
In [ ]:
    
from sklearn.metrics import precision_score, recall_score
precision = precision_score(expected, predicted)
recall = recall_score(expected, predicted)
print( "Precision = " + str( precision ) )
print( "Recall= " + str(recall))
    
If we care about our whole precision-recall space, we can optimize for a metric known as the area under the curve (AUC-PR), which is the area under the precision-recall curve. The maximum AUC-PR is 1.
In [ ]:
    
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score)
    plt.plot(recall_curve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    plt.clf()
    
In [ ]:
    
plot_precision_recall(expected, y_score)
    
In [ ]:
    
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()
    plt.clf()
    
In [ ]:
    
def precision_at_k(y_true, y_scores,k):
    
    threshold = np.sort(y_scores)[::-1][int(k*len(y_scores))]
    y_pred = np.asarray([1 if i >= threshold else 0 for i in y_scores ])
    return precision_score(y_true, y_pred)
    
In [ ]:
    
plot_precision_recall_n(expected,y_score, 'LR')
    
In [ ]:
    
p_at_5 = precision_at_k(expected,y_score, 0.05)
print('Precision at 5%: {:.2f}'.format(p_at_5))
    
When working on machine learning projects, it is a good idea to structure your code as a modular pipeline, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
In [ ]:
    
clfs = {'RF': RandomForestClassifier(n_estimators=50, n_jobs=-1),
       'ET': ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='entropy'),
        'LR': LogisticRegression(penalty='l1', C=1e5),
        'SGD':SGDClassifier(loss='log'),
        'GB': GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=10),
        'NB': GaussianNB()}
    
In [ ]:
    
sel_clfs = ['RF', 'ET', 'LR', 'SGD', 'GB', 'NB']
    
In [ ]:
    
max_p_at_k = 0
for clfNM in sel_clfs:
    clf = clfs[clfNM]
    clf.fit( X_train, y_train )
    print clf
    y_score = clf.predict_proba(X_test)[:,1]
    predicted = np.array(y_score)
    expected = np.array(y_test)
    plot_precision_recall_n(expected,predicted, clfNM)
    p_at_5 = precision_at_k(expected,y_score, 0.05)
    if max_p_at_k < p_at_5:
        max_p_at_k = p_at_5
    print('Precision at 5%: {:.2f}'.format(p_at_5))
    
It is important to check our model against a reasonable baseline to know how well our model is doing. Without any context, 83% accuracy can sound really great... but it's not so great when you remember that you could do almost that well by declaring everyone a non-recividist, which would be stupid (not to mention useless) model.
A good place to start is checking against a random baseline, assigning every example a label (positive or negative) completely at random.
In [ ]:
    
max_p_at_k
    
In [ ]:
    
random_score = [random.uniform(0,1) for i in enumerate(y_test)] 
random_predicted = np.array( [calc_threshold(score,0.5) for score in random_score] )
random_p_at_5 = precision_at_k()
    
Another good practice is checking against an "expert" or rule of thumb baseline.
In [ ]:
    
recidivism_predicted = np.array([ 1 if nadmit > 1 else 0 for nadmit in df_testing.nadmits.values ])
recidivism_p_at_5 = precision_at_k(expected,predicted,0.05)
    
In [ ]:
    
all_non_recidivist = np.array([0 for nadmit in df_testing.nadmits.values])
all_non_recidivist_p_at_5 = precision_at_k()
    
In [ ]:
    
sns.set_style("whitegrid")
fig, ax = plt.subplots(1, figsize=(22,12))
sns.set_context("poster", font_scale=2.25, rc={"lines.linewidth":2.25, "lines.markersize":8})
sns.barplot([],
            [],
            palette=['#6F777D','#6F777D','#6F777D','#800000'])
plt.ylim(0,1)
plt.ylabel('precision at 5%')