Project 2: Supervised Learning

Building a Student Intervention System

1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

2. Exploring the Data

Let's go ahead and read in the student dataset first.

To execute a code cell, click inside it and press Shift+Enter.


In [1]:
# Import libraries
import numpy as np
import pandas as pd

In [2]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"

# Note: The last column 'passed' is the target/label, all other are feature columns


Student data read successfully!

Now, can you find out the following facts about the dataset?

  • Total number of students
  • Number of students who passed
  • Number of students who failed
  • Graduation rate of the class (%)
  • Number of features

Use the code block below to compute these values. Instructions/steps are marked using TODOs.


In [3]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1]-1

y_df = student_data['passed']
n_passed = y_df[y_df=='yes'].shape[0]
n_failed = n_students - n_passed
grad_rate = 100.0 * n_passed / n_students
    
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)


Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.09%

3. Preparing the Data

In this section, we will prepare the data for modeling, training and testing.

Identify feature and target columns

It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.
Note: For this dataset, the last column ('passed') is the target or label we are trying to predict.


In [4]:
# %%capture 
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]           # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]              # feature values for all students
y_all = student_data[target_col]                # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()                              # print the first 5 rows


Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       yes      yes        no       5         3     3    1    1      3   
2   ...       yes      yes        no       4         3     2    2    3      3   
3   ...       yes      yes       yes       3         2     2    1    1      5   
4   ...       yes       no        no       4         3     2    1    2      5   

  absences  
0        6  
1        4  
2       10  
3        2  
4        4  

[5 rows x 30 columns]

Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes/no, e.g. internet. These can be reasonably converted into 1/0 (binary) values.

Other columns, like Mjob and Fjob, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher, Fjob_other, Fjob_services, etc.), and assign a 1 to one of them and 0 to all others.

These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies() function to perform this transformation.


In [5]:
# Preprocess feature columns
def preprocess_features(X):

    # output dataframe, initially empty
    outX = pd.DataFrame(index=X.index)

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int
        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
# X_all = pd.get_dummies(X_all)

print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))


Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Split data into training and test sets

So far, we have converted all categorical features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.


In [6]:
from sklearn.cross_validation import train_test_split
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300                  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=num_train, random_state=11)

# Preserve this train/test split for final evaluation of model F1 score
X_train_initial, X_test_initial, y_train_initial, y_test_initial = X_train, X_test, y_train, y_test

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data


Training set: 300 samples
Test set: 95 samples

4. Training and Evaluating Models

Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

  • What are the general applications of this model? What are its strengths and weaknesses?
  • Given what you know about the data so far, why did you choose this model to apply?
  • Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F1 score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F1 score on training set and F1 score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.


In [7]:
# Train a model
import time
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score


def train_classifier(clf, X_train, y_train):
    print "\nTraining {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    duration = end - start
    print "Training time (secs): {:.4f}".format(duration)
    return duration

def predict_labels(clf, features, target):
    # print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Prediction time (secs): {:.4f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

def train_predict(clf, X_train, y_train, X_test, y_test):
    print "----------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "Training set:"
    train_f1_score = predict_labels(clf, X_train, y_train)
    print "Testing set:"
    test_f1_score = predict_labels(clf, X_test, y_test)
    print "F1 score for training set: {}".format(train_f1_score)
    print "F1 score for test set: {}".format(test_f1_score)    
    return train_f1_score, test_f1_score


# TODO: Choose a model, import it and instantiate an object
# TODO: Run the helper function above for desired subsets of training data

clfs = [DecisionTreeClassifier(random_state=42),
        KNeighborsClassifier(),
        LogisticRegression(random_state=42)]

for clf in clfs:
    print "============================================="
    
    # Fit model to training data
    train_classifier(clf, X_train, y_train)  # note: using entire training set here

    # Predict on training & testing set and compute F1 score
    train_f1_score = predict_labels(clf, X_train, y_train)
    test_f1_score  = predict_labels(clf, X_test,  y_test)
    print "F1 score for training set: {}".format(train_f1_score)
    print "F1 score for test set: {}".format(test_f1_score)
    
    for idx, train_size in enumerate([100, 200, 300]):
        X_train_temp = X_train.iloc[:train_size]
        y_train_temp = y_train.iloc[:train_size]
        train_predict(clf, X_train_temp, y_train_temp, X_test, y_test)

print "============================================="


=============================================

Training DecisionTreeClassifier...
Training time (secs): 0.0019
Prediction time (secs): 0.0002
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.738461538462
----------
Training set size: 100

Training DecisionTreeClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.764705882353
----------
Training set size: 200

Training DecisionTreeClassifier...
Training time (secs): 0.0009
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.705882352941
----------
Training set size: 300

Training DecisionTreeClassifier...
Training time (secs): 0.0013
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.738461538462
=============================================

Training KNeighborsClassifier...
Training time (secs): 0.0008
Prediction time (secs): 0.0042
Prediction time (secs): 0.0014
F1 score for training set: 0.86230248307
F1 score for test set: 0.814814814815
----------
Training set size: 100

Training KNeighborsClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0008
Testing set:
Prediction time (secs): 0.0008
F1 score for training set: 0.832214765101
F1 score for test set: 0.802816901408
----------
Training set size: 200

Training KNeighborsClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0020
Testing set:
Prediction time (secs): 0.0011
F1 score for training set: 0.871794871795
F1 score for test set: 0.797202797203
----------
Training set size: 300

Training KNeighborsClassifier...
Training time (secs): 0.0006
Training set:
Prediction time (secs): 0.0041
Testing set:
Prediction time (secs): 0.0016
F1 score for training set: 0.86230248307
F1 score for test set: 0.814814814815
=============================================

Training LogisticRegression...
Training time (secs): 0.0019
Prediction time (secs): 0.0001
Prediction time (secs): 0.0002
F1 score for training set: 0.830275229358
F1 score for test set: 0.794117647059
----------
Training set size: 100

Training LogisticRegression...
Training time (secs): 0.0007
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0002
F1 score for training set: 0.90780141844
F1 score for test set: 0.753846153846
----------
Training set size: 200

Training LogisticRegression...
Training time (secs): 0.0013
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.87417218543
F1 score for test set: 0.764705882353
----------
Training set size: 300

Training LogisticRegression...
Training time (secs): 0.0017
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.830275229358
F1 score for test set: 0.794117647059
=============================================

In [8]:
# %%capture
# test the effect of training sample size on F1 score with a finer interval of 20 instead of 100
# the resutls are visualized in the next cell
# output in this cell is suppressed

train_f1_scores = []
test_f1_scores = []

for clf in clfs:
    print "=============================================" 
    # Fit model to training data
    # note: using entire training set here
    train_classifier(clf, X_train, y_train)  

    # Predict on training & testing set and compute F1 score
    train_f1_score = predict_labels(clf, X_train, y_train)
    test_f1_score = predict_labels(clf, X_test, y_test)
    print "F1 score for training set: {}".format(train_f1_score)
    print "F1 score for test set: {}".format(test_f1_score)
    
    # Train and predict using different training set sizes
    train_sizes = np.arange(20, X_train.shape[0]+1, 20)
    train_f1_score = np.zeros(train_sizes.shape)
    test_f1_score = np.zeros(train_sizes.shape)
    
    for idx, train_size in enumerate(train_sizes):
        X_train_temp = X_train.iloc[:train_size]
        y_train_temp = y_train.iloc[:train_size]
        train_f1_score[idx], test_f1_score[idx] = train_predict(clf, X_train_temp, y_train_temp, X_test, y_test)
    
    # Collect f1 scores for each classifier     
    train_f1_scores.append(train_f1_score)
    test_f1_scores.append(test_f1_score)    
        
print "============================================="


=============================================

Training DecisionTreeClassifier...
Training time (secs): 0.0018
Prediction time (secs): 0.0002
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.738461538462
----------
Training set size: 20

Training DecisionTreeClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.469387755102
----------
Training set size: 40

Training DecisionTreeClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.655737704918
----------
Training set size: 60

Training DecisionTreeClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.666666666667
----------
Training set size: 80

Training DecisionTreeClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0002
F1 score for training set: 1.0
F1 score for test set: 0.746268656716
----------
Training set size: 100

Training DecisionTreeClassifier...
Training time (secs): 0.0006
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.764705882353
----------
Training set size: 120

Training DecisionTreeClassifier...
Training time (secs): 0.0006
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.728682170543
----------
Training set size: 140

Training DecisionTreeClassifier...
Training time (secs): 0.0007
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.758064516129
----------
Training set size: 160

Training DecisionTreeClassifier...
Training time (secs): 0.0007
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.75
----------
Training set size: 180

Training DecisionTreeClassifier...
Training time (secs): 0.0008
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.661417322835
----------
Training set size: 200

Training DecisionTreeClassifier...
Training time (secs): 0.0009
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.705882352941
----------
Training set size: 220

Training DecisionTreeClassifier...
Training time (secs): 0.0010
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.634146341463
----------
Training set size: 240

Training DecisionTreeClassifier...
Training time (secs): 0.0010
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.692913385827
----------
Training set size: 260

Training DecisionTreeClassifier...
Training time (secs): 0.0011
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.725806451613
----------
Training set size: 280

Training DecisionTreeClassifier...
Training time (secs): 0.0012
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.714285714286
----------
Training set size: 300

Training DecisionTreeClassifier...
Training time (secs): 0.0013
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.738461538462
=============================================

Training KNeighborsClassifier...
Training time (secs): 0.0005
Prediction time (secs): 0.0041
Prediction time (secs): 0.0015
F1 score for training set: 0.86230248307
F1 score for test set: 0.814814814815
----------
Training set size: 20

Training KNeighborsClassifier...
Training time (secs): 0.0002
Training set:
Prediction time (secs): 0.0003
Testing set:
Prediction time (secs): 0.0004
F1 score for training set: 0.666666666667
F1 score for test set: 0.246913580247
----------
Training set size: 40

Training KNeighborsClassifier...
Training time (secs): 0.0002
Training set:
Prediction time (secs): 0.0004
Testing set:
Prediction time (secs): 0.0005
F1 score for training set: 0.771929824561
F1 score for test set: 0.752
----------
Training set size: 60

Training KNeighborsClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0005
Testing set:
Prediction time (secs): 0.0006
F1 score for training set: 0.818181818182
F1 score for test set: 0.791044776119
----------
Training set size: 80

Training KNeighborsClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0007
Testing set:
Prediction time (secs): 0.0007
F1 score for training set: 0.845528455285
F1 score for test set: 0.802919708029
----------
Training set size: 100

Training KNeighborsClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0008
Testing set:
Prediction time (secs): 0.0008
F1 score for training set: 0.832214765101
F1 score for test set: 0.802816901408
----------
Training set size: 120

Training KNeighborsClassifier...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0013
Testing set:
Prediction time (secs): 0.0011
F1 score for training set: 0.845714285714
F1 score for test set: 0.776978417266
----------
Training set size: 140

Training KNeighborsClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0013
Testing set:
Prediction time (secs): 0.0010
F1 score for training set: 0.861244019139
F1 score for test set: 0.771428571429
----------
Training set size: 160

Training KNeighborsClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0014
Testing set:
Prediction time (secs): 0.0012
F1 score for training set: 0.840707964602
F1 score for test set: 0.785714285714
----------
Training set size: 180

Training KNeighborsClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0017
Testing set:
Prediction time (secs): 0.0010
F1 score for training set: 0.855018587361
F1 score for test set: 0.785714285714
----------
Training set size: 200

Training KNeighborsClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0020
Testing set:
Prediction time (secs): 0.0011
F1 score for training set: 0.871794871795
F1 score for test set: 0.797202797203
----------
Training set size: 220

Training KNeighborsClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0023
Testing set:
Prediction time (secs): 0.0012
F1 score for training set: 0.868035190616
F1 score for test set: 0.805555555556
----------
Training set size: 240

Training KNeighborsClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0026
Testing set:
Prediction time (secs): 0.0012
F1 score for training set: 0.867208672087
F1 score for test set: 0.808510638298
----------
Training set size: 260

Training KNeighborsClassifier...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0032
Testing set:
Prediction time (secs): 0.0013
F1 score for training set: 0.84693877551
F1 score for test set: 0.817518248175
----------
Training set size: 280

Training KNeighborsClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0035
Testing set:
Prediction time (secs): 0.0014
F1 score for training set: 0.858490566038
F1 score for test set: 0.814285714286
----------
Training set size: 300

Training KNeighborsClassifier...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0041
Testing set:
Prediction time (secs): 0.0015
F1 score for training set: 0.86230248307
F1 score for test set: 0.814814814815
=============================================

Training LogisticRegression...
Training time (secs): 0.0020
Prediction time (secs): 0.0001
Prediction time (secs): 0.0001
F1 score for training set: 0.830275229358
F1 score for test set: 0.794117647059
----------
Training set size: 20

Training LogisticRegression...
Training time (secs): 0.0003
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 1.0
F1 score for test set: 0.627450980392
----------
Training set size: 40

Training LogisticRegression...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.979591836735
F1 score for test set: 0.689655172414
----------
Training set size: 60

Training LogisticRegression...
Training time (secs): 0.0004
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.948717948718
F1 score for test set: 0.731707317073
----------
Training set size: 80

Training LogisticRegression...
Training time (secs): 0.0005
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.923076923077
F1 score for test set: 0.734375
----------
Training set size: 100

Training LogisticRegression...
Training time (secs): 0.0006
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.90780141844
F1 score for test set: 0.753846153846
----------
Training set size: 120

Training LogisticRegression...
Training time (secs): 0.0009
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.874251497006
F1 score for test set: 0.8
----------
Training set size: 140

Training LogisticRegression...
Training time (secs): 0.0011
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.872549019608
F1 score for test set: 0.802919708029
----------
Training set size: 160

Training LogisticRegression...
Training time (secs): 0.0011
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.850877192982
F1 score for test set: 0.782608695652
----------
Training set size: 180

Training LogisticRegression...
Training time (secs): 0.0011
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.860377358491
F1 score for test set: 0.8
----------
Training set size: 200

Training LogisticRegression...
Training time (secs): 0.0012
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.87417218543
F1 score for test set: 0.764705882353
----------
Training set size: 220

Training LogisticRegression...
Training time (secs): 0.0013
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.865030674847
F1 score for test set: 0.779411764706
----------
Training set size: 240

Training LogisticRegression...
Training time (secs): 0.0014
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.867605633803
F1 score for test set: 0.780141843972
----------
Training set size: 260

Training LogisticRegression...
Training time (secs): 0.0013
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.845144356955
F1 score for test set: 0.791366906475
----------
Training set size: 280

Training LogisticRegression...
Training time (secs): 0.0014
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.837772397094
F1 score for test set: 0.808823529412
----------
Training set size: 300

Training LogisticRegression...
Training time (secs): 0.0017
Training set:
Prediction time (secs): 0.0001
Testing set:
Prediction time (secs): 0.0001
F1 score for training set: 0.830275229358
F1 score for test set: 0.794117647059
=============================================

In [9]:
# visualize F1 score vs training sample size
# seaborn settings from [http://bebi103.caltech.edu/2015/tutorials/t0b_intro_to_jupyter_notebooks.html]

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}

rc = {'lines.linewidth': 2, 
      'axes.labelsize': 14, 
      'axes.titlesize': 14, 
      'axes.facecolor': 'DFDFE5'}
sns.set_context('notebook', font_scale=1.2, rc=rc)
sns.set_style('darkgrid', rc=rc)

plt.figure(1, figsize=(20, 5), dpi=300)
idx_subplot = 1
for idx, clf in enumerate(clfs):
    
    # each subplot corresponds to a classifier    
    plt.subplot(1, len(clfs),idx_subplot)
    plt.plot(train_sizes, train_f1_scores[idx], marker='o', label='F1 score ( train )')
    plt.plot(train_sizes, test_f1_scores[idx],  marker='s', label='F1 score ( test )')

    if idx_subplot == 1: plt.ylabel('F1 score', fontweight='bold')
    plt.xlabel('Training samples', fontweight='bold')
    plt.title('%s' % clf.__class__.__name__, fontweight='bold')
    plt.xlim(0, X_train.shape[0]+15)
    plt.ylim(0.3, 1.05)
    plt.yticks(np.arange(0.3, 1.05, 0.1))
    plt.legend(loc='lower right')

    idx_subplot += 1

plt.savefig('./F1_vs_training_size.pdf')


5. Choosing the Best Model

  • Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
  • In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
  • Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
  • What is the model's final F1 score?

In [10]:
%%capture
# Takes around 6 mins to run on a 4 Ghz, quad-core machine

# TODO: Fine-tune your model and report the best F1 score
import time
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# time the script
start = time.time()

# calc_scores (f1_score, accuracy_score, recall_score, precision_score)
def calc_scores(y, y_pred):
    return (f1_score       (y, y_pred),
            accuracy_score (y, y_pred),
            recall_score   (y, y_pred),
            precision_score(y, y_pred))

# import data
student_data = pd.read_csv("student-data.csv")

# extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])
target_col   = student_data.columns[-1]
le_y         = LabelEncoder()

X_all = pd.get_dummies(student_data[feature_cols])
y_all = student_data[target_col]
y_all = le_y.fit_transform(y_all)

# initialize classifiers for evaluations of performance
clfs_set = [AdaBoostClassifier(),
            DecisionTreeClassifier(),
            KNeighborsClassifier(),
            LogisticRegression(),
            SVC(),
            SGDClassifier(),
            RandomForestClassifier()]

clfs_best    = []
train_scores = []
test_scores  = []

# building param_grids for GridSearchCV
ada_grid = {'algorithm': ['SAMME', 'SAMME.R'],
            'n_estimators': np.linspace(1, 6, num=5).astype(int),
            'learning_rate': (0.001, 0.01, 0.1, 1, 10)}

dt_grid = {'criterion': ['gini', 'entropy'],
           'max_features': ['auto', 'sqrt', 'log2'],
           'max_depth': np.linspace(1, 10, num=10),
           'min_samples_split': np.linspace(2, 10, 1),
           'min_samples_leaf': (1, 2, 3, 4, 5)}

knn_grid = {'n_neighbors': (3, 4, 5, 6, 7, 8, 9),
            'algorithm': ['auto', 'ball_tree', 'kd_tree'],
            'p': (1, 2, 3, 4),
            'leaf_size': (10, 20, 30, 40, 50),
            'weights': ['uniform', 'distance']}

lr_grid = {'C': np.linspace(0.01, 0.2, num=200),
           'penalty': ['l1', 'l2']}

svc_grid = {'kernel': ['rbf', 'poly'],
            'gamma': np.linspace(0.01, 1, num=100)}

sgd_grid = {'loss': ['squared_hinge', 'hinge'],
            'penalty': ['l2', 'l1'],
            'alpha': np.linspace(0.001, 0.01, num=100)}

rf_grid = {'n_estimators': (10, 11, 12, 13, 14, 15, 16),
           'max_features': ['auto'],
           'criterion': ['gini', 'entropy'],
           'max_depth': (3, 4, 5, 6),
           'min_samples_split': (2, 3, 4, 5, 6)}

param_grids = [ada_grid, dt_grid, knn_grid, lr_grid, svc_grid, sgd_grid, rf_grid]

# run GridSearchCV for each classifier (maximizing f1-score)
# increase the train size to 80% sample size

num_runs   = 25
num_clfs   = len(clfs_set)
num_scores = 4
train_size = 0.80

for num_run in np.arange(num_runs):

    # randomize train_split for each run     
    X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, train_size=train_size)
    
    print('===============================================================================')
    print('Run #%d' % (num_run+1))
    for clf, param_grid in zip(clfs_set, param_grids):
        print("%s" % clf.__class__.__name__)

        clf_opt = GridSearchCV(estimator=clf,
                               param_grid=param_grid,
                               scoring='f1',
                               n_jobs=-1)
        clf_opt.fit(X_train, y_train)
        
        y_train_pred = clf_opt.predict(X_train)
        y_test_pred  = clf_opt.predict(X_test)
        
        # collect the bset estimator for each run
        clfs_best.append(clf_opt.best_estimator_)

        # calculate performance scores     
        train_scores.append(calc_scores(y_train, y_train_pred))
        test_scores.append (calc_scores(y_test,  y_test_pred))

        print('Training set: F1 score %.3f | Accuracy %.3f | Recall %.3f | Precision %.3f '
              % calc_scores(y_train, y_train_pred))

        print('Training set: F1 score %.3f | Accuracy %.3f | Recall %.3f | Precision %.3f\n '
              % calc_scores(y_test, y_test_pred))
print('===============================================================================')

train_scores = np.array(train_scores).reshape(num_runs, num_clfs, num_scores)
test_scores  = np.array(test_scores ).reshape(num_runs, num_clfs, num_scores)

# time the script
end = time.time()

In [11]:
print('\nTime elapsed: %.3f mins' % ((end-start)/60))


Time elapsed: 6.162 mins

In [12]:
# box plots of ['F1 score', 'Accuracy', 'Recall', 'Precision'] for both training and testing set

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

score_labels = ['F1 score', 'Accuracy', 'Recall', 'Precision']
clf_labels   = [s.__class__.__name__ for s in clfs_set]
for idx_score, score_label in enumerate(score_labels):
    plt.figure(figsize=[14, 4])
    plt.subplot(1, 2, 1)
    ax = sns.boxplot(data=train_scores [:,:,idx_score], palette="RdBu")
    ax.set_ylim(0.5, 1.05)
    ax.set_xticklabels(())
    ax.set_title(score_label+' ( train )')
    plt.xticks(np.arange(num_clfs), clf_labels, rotation='45')  
        
    plt.subplot(1, 2, 2)
    ax = sns.boxplot(data=test_scores [:,:,idx_score], palette="RdBu")
    ax.set_ylim(0.5, 1.05)
    ax.set_xticklabels(())
    ax.set_title(score_label+' ( test )')
    plt.xticks(np.arange(num_clfs), clf_labels, rotation='45')



In [13]:
# print statistics
for idx_score, score_label in enumerate(score_labels):
    print('=====================================================================')
    print(score_label)
    print('')
    print('=== training set ===')
    print(pd.DataFrame(train_scores[:, :, idx_score], columns=clf_labels).describe().T[['count', 'mean', 'std', 'min', 'max']])
    print('')
    print('=== testing set ===')
    print(pd.DataFrame(test_scores [:, :, idx_score], columns=clf_labels).describe().T[['count', 'mean', 'std', 'min', 'max']])
print('=====================================================================')


=====================================================================
F1 score

=== training set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.816024  0.011136  0.795556  0.837719
DecisionTreeClassifier   25.0  0.800925  0.016662  0.743649  0.826087
KNeighborsClassifier     25.0  0.872261  0.066856  0.816415  1.000000
LogisticRegression       25.0  0.824706  0.009545  0.811040  0.846307
SVC                      25.0  0.944144  0.043792  0.835341  0.990385
SGDClassifier            25.0  0.741779  0.139501  0.296296  0.850299
RandomForestClassifier   25.0  0.866271  0.028207  0.824242  0.918803

=== testing set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.802154  0.039678  0.733945  0.866142
DecisionTreeClassifier   25.0  0.786271  0.045299  0.705882  0.861314
KNeighborsClassifier     25.0  0.789420  0.043317  0.678571  0.852459
LogisticRegression       25.0  0.807905  0.033598  0.736842  0.854962
SVC                      25.0  0.814394  0.044809  0.706897  0.865672
SGDClassifier            25.0  0.723079  0.174088  0.148148  0.852713
RandomForestClassifier   25.0  0.808803  0.047674  0.705882  0.874074
=====================================================================
Accuracy

=== training set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.724937  0.016391  0.680380  0.765823
DecisionTreeClassifier   25.0  0.688861  0.018034  0.648734  0.721519
KNeighborsClassifier     25.0  0.808734  0.099951  0.731013  1.000000
LogisticRegression       25.0  0.730633  0.012267  0.712025  0.756329
SVC                      25.0  0.917975  0.069018  0.740506  0.987342
SGDClassifier            25.0  0.674304  0.086684  0.398734  0.762658
RandomForestClassifier   25.0  0.794557  0.047931  0.721519  0.879747

=== testing set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.704810  0.047605  0.620253  0.797468
DecisionTreeClassifier   25.0  0.664304  0.056623  0.556962  0.759494
KNeighborsClassifier     25.0  0.681013  0.052191  0.544304  0.772152
LogisticRegression       25.0  0.702785  0.041199  0.620253  0.759494
SVC                      25.0  0.704304  0.058223  0.569620  0.772152
SGDClassifier            25.0  0.654177  0.109424  0.392405  0.759494
RandomForestClassifier   25.0  0.701266  0.062548  0.556962  0.784810
=====================================================================
Recall

=== training set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.914904  0.042952  0.864734  1.000000
DecisionTreeClassifier   25.0  0.941104  0.062025  0.785366  1.000000
KNeighborsClassifier     25.0  0.952551  0.029923  0.903846  1.000000
LogisticRegression       25.0  0.949825  0.012127  0.927184  0.975728
SVC                      25.0  0.998862  0.003144  0.985782  1.000000
SGDClassifier            25.0  0.787181  0.239134  0.183486  0.976303
RandomForestClassifier   25.0  0.991392  0.007502  0.971292  1.000000

=== testing set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.884990  0.060832  0.779661  1.000000
DecisionTreeClassifier   25.0  0.916337  0.083177  0.689655  1.000000
KNeighborsClassifier     25.0  0.882794  0.041431  0.813559  0.962264
LogisticRegression       25.0  0.922322  0.043338  0.830508  0.981132
SVC                      25.0  0.959265  0.033941  0.857143  1.000000
SGDClassifier            25.0  0.764813  0.251923  0.085106  0.981132
RandomForestClassifier   25.0  0.934526  0.051566  0.821429  1.000000
=====================================================================
Precision

=== training set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.738300  0.026062  0.680380  0.792531
DecisionTreeClassifier   25.0  0.699810  0.025531  0.655063  0.757848
KNeighborsClassifier     25.0  0.807834  0.100443  0.731343  1.000000
LogisticRegression       25.0  0.728874  0.013429  0.705263  0.757143
SVC                      25.0  0.897829  0.073916  0.724739  0.980952
SGDClassifier            25.0  0.757921  0.063606  0.679181  0.954545
RandomForestClassifier   25.0  0.770275  0.044217  0.702703  0.849802

=== testing set ===
                        count      mean       std       min       max
AdaBoostClassifier       25.0  0.737879  0.059319  0.617647  0.833333
DecisionTreeClassifier   25.0  0.695387  0.064147  0.565789  0.800000
KNeighborsClassifier     25.0  0.717075  0.063322  0.558824  0.812500
LogisticRegression       25.0  0.722151  0.055351  0.605634  0.806452
SVC                      25.0  0.710037  0.061656  0.569444  0.777778
SGDClassifier            25.0  0.733371  0.068961  0.571429  0.819672
RandomForestClassifier   25.0  0.717061  0.068539  0.558442  0.803030
=====================================================================

In [14]:
print('Best F1 score:\n')
print('=== training set ===')
print(pd.DataFrame(train_scores[:, :, 0], columns=clf_labels).describe().T['max'])
print('')
print('=== testing set ===')
print(pd.DataFrame(test_scores [:, :, 0], columns=clf_labels).describe().T['max'])


Best F1 score:

=== training set ===
AdaBoostClassifier        0.837719
DecisionTreeClassifier    0.826087
KNeighborsClassifier      1.000000
LogisticRegression        0.846307
SVC                       0.990385
SGDClassifier             0.850299
RandomForestClassifier    0.918803
Name: max, dtype: float64

=== testing set ===
AdaBoostClassifier        0.866142
DecisionTreeClassifier    0.861314
KNeighborsClassifier      0.852459
LogisticRegression        0.854962
SVC                       0.865672
SGDClassifier             0.852713
RandomForestClassifier    0.874074
Name: max, dtype: float64

6. Training logistic regression model with the whole training set


In [15]:
# Extract the best logistic regression model from clfs_best
# Since 25 independent runs generate similar optimal parameters for logistic regression, 
# the first parameter set is selected. 
lr_best = (np.array(clfs_best).reshape(num_runs, num_clfs))[:,3][0]

# fit the model with all the whole dataset
# le_y is the label encoder to transform "yes/no" to "1/0" for the target set
lr_best.fit(X_train_initial, le_y.transform(y_train_initial))

print("The final F1 socre using all data points as training set is %.3f. " 
      %f1_score(le_y.transform(y_test_initial), lr_best.predict(X_test_initial)))


The final F1 socre using all data points as training set is 0.838.