Project 2: Supervised Learning

Building a Student Intervention System

1. Classification vs Regression

Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?

2. Exploring the Data

Let's go ahead and read in the student dataset first.

To execute a code cell, click inside it and press Shift+Enter.



In [35]:

    
# Import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn as skl
import matplotlib.pyplot as plt



In [36]:

    
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns









    



Student data read successfully!

Now, can you find out the following facts about the dataset?

Total number of students
Number of students who passed
Number of students who failed
Graduation rate of the class (%)
Number of features

Use the code block below to compute these values. Instructions/steps are marked using TODOs.



In [37]:

    
student_data[student_data.passed=='yes'].shape[0]









    Out[37]:





265



In [38]:

    
student_data.dtypes









    Out[38]:





school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
passed        object
dtype: object



In [39]:

    
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1]-1
n_passed = student_data[student_data.passed=='yes'].shape[0]
n_failed = student_data[student_data.passed=='no'].shape[0]
grad_rate = 100 * n_passed / (n_passed + n_failed)
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)









    



Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 30
Graduation rate of the class: 67.00%

3. Preparing the Data

In this section, we will prepare the data for modeling, training and testing.

Identify feature and target columns

It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.
Note: For this dataset, the last column ('passed') is the target or label we are trying to predict.



In [40]:

    
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows









    



Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       yes      yes        no       5         3     3    1    1      3   
2   ...       yes      yes        no       4         3     2    2    3      3   
3   ...       yes      yes       yes       3         2     2    1    1      5   
4   ...       yes       no        no       4         3     2    1    2      5   

  absences  
0        6  
1        4  
2       10  
3        2  
4        4  

[5 rows x 30 columns]

Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes/no, e.g. internet. These can be reasonably converted into 1/0 (binary) values.

Other columns, like Mjob and Fjob, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher, Fjob_other, Fjob_services, etc.), and assign a 1 to one of them and 0 to all others.

These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies() function to perform this transformation.



In [41]:

    
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

preproc_sd = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(preproc_sd.columns), list(preproc_sd.columns))









    



Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Split data into training and test sets

So far, we have converted all categorical features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.



In [42]:

    
# First, decide how many training vs test samples you want
num_all = preproc_sd.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train
shuffled_preproc_sd = preproc_sd.reindex(np.random.permutation(preproc_sd.index))
# Change indices on the labels to match the shuffling.
shuffled_indices = shuffled_preproc_sd.index.values
shuffled_labels = y_all.reindex(shuffled_indices)



In [43]:

    
# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train = shuffled_preproc_sd.head(num_train).values
y_train = shuffled_labels.head(num_train).values
X_test = shuffled_preproc_sd.tail(num_test).values
y_test = shuffled_labels.tail(num_test).values
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data









    



Training set: 300 samples
Test set: 95 samples

I think there are a couple features that might be the most important based on my experience with teaching. First, attendance is key! Another good feature to examine would be the school and family support they receive.

4. Training and Evaluating Models

Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

What are the general applications of this model? What are its strengths and weaknesses?
Given what you know about the data so far, why did you choose this model to apply?
Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F₁ score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F₁ score on training set and F₁ score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.



In [47]:

    
from sklearn.metrics import f1_score
import time

# Function for training a model
def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)
    
# Predict on training set and compute F1 score
def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target, y_pred, pos_label='yes')



In [48]:

    
# Taining data partitioning
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]
X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

Decision Tree

The DT classifier can handle multiple inputs and if used with an entropy-based criterion, will split according to the highest information gained from the attribute. A couple of its key strengths are its simplicity and ability to handle multiple types of data. However, DTs are prone to over fitting and are sensitive to data. For example, the structure of the tree may change greatly between training runs.

I chose this model first because it intuitively aligned with the problem: we have a lot of features, so we could ask multiple questions to determine whether a student should get assistance.



In [49]:

    
from sklearn import tree
dtc = tree.DecisionTreeClassifier(criterion="entropy")

Aside: Grid Search Testing



In [50]:

    
# Load up the GridSearch
from sklearn.grid_search import GridSearchCV

To understand grid search a little better, I tried it out on the single DT classifier with the following parameter selections:

criterion = "gini", "entropy"
min_samples_split = 2, 4, 8, 16
max_features = None, sqrt, log2
max_depth = array of integers, [1 ... 30]

Note: I made a lambda function to set that the positive label for the f1 scoring metric should be the string 'yes'. By passing this parameter, I do not need to convert the label data into 1s and 0s.



In [51]:

    
dtc_params = {'criterion':("gini","entropy"), 'min_samples_split':(2,4,8,16), 'max_features':("auto","sqrt","log2"),
             'max_depth':np.arange(1,31,1)}
f1scorer = skl.metrics.make_scorer( lambda yt, yp : skl.metrics.f1_score(yt, yp, pos_label='yes') ) 
tuned_dtc = GridSearchCV(dtc, dtc_params, f1scorer)
tuned_dtc.fit(X_train, y_train)
tuned_dtc.best_estimator_
# print "GridSearch DT Classifier"
# train_predict(tuned_dtc, X_train_100, y_train_100, X_test, y_test)
# train_predict(tuned_dtc, X_train_200, y_train_200, X_test, y_test)
# train_predict(tuned_dtc, X_train, y_train, X_test, y_test)









    Out[51]:





DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=1,
            max_features='sqrt', max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=4, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

...now back to evaluating the DT classifier.

Evaluation DT Classifier with Varying Sized Training Data

From the data below, when I use entropy for splits, I get a lower F1 score with lower data (makes sense). The score increases with more data, which should happen as more data is added.



In [52]:

    
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

print "Non-tuned DT Classifier"
train_predict(dtc, X_train_100, y_train_100, X_test, y_test)
train_predict(dtc, X_train_200, y_train_200, X_test, y_test)
train_predict(dtc, X_train, y_train, X_test, y_test)









    



Non-tuned DT Classifier
------------------------------------------
Training set size: 100
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.001
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.755244755245
------------------------------------------
Training set size: 200
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.002
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.763358778626
------------------------------------------
Training set size: 300
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.002
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.734375

Summary Results - DTC

Data Size	Training Time (s)	F1 Train	F1 Test
100	0.001	1.0	0.6935
200	0.002	1.0	0.6865
300	0.002	1.0	0.7218

Bayesian Model

We say the application of Naive Bayes models in email spam filters. In that case, we were trying to compute the likelihood that a particular email was spam based on the input data. Bayes learning, in essence, will let us switch cause and effect so that we can determine what sets of data make an outcome (pass/fail) liekly. Naive Bayes algorithms are relatively fast compared to other supervised learning techniques, since it makes the conditional independence assumption. The main disadvantage is caused by this assumption: we cannot leverage the interactions among the features. In practice, Naive Bayes works well with minimal tuning.

One can think of this problem like the spam filter example: given a passing (or, failing) student classification, what was the effect of different features on the likelihood of that result. The feature data provides a chain of evidence to help derive the likelihood of correctly classifying the student.



In [53]:

    
from sklearn.naive_bayes import GaussianNB
# GaussianNB can accept sigma and theta as parameters, but I will try it empty.
gnb = GaussianNB()
train_predict(gnb, X_train_100, y_train_100, X_test, y_test)
train_predict(gnb, X_train_200, y_train_200, X_test, y_test)
train_predict(gnb, X_train, y_train, X_test, y_test)









    



------------------------------------------
Training set size: 100
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.828125
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.727272727273
------------------------------------------
Training set size: 200
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.769230769231
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.757575757576
------------------------------------------
Training set size: 300
Training GaussianNB...
Done!
Training time (secs): 0.001
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.782828282828
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.80303030303

It works pretty well with no tuning. The F1 scores get better with more data.

Summary Results - Gaussian Naive Bayes

Data Size	Training Time (s)	F1 Train	F1 Test
100	0.001	0.6732	0.3720
200	0.001	0.8218	0.7727
300	0.001	0.7922	0.7969

Bagged Ensemble Model

The BaggingClassifier in sklearn uses one type of classification algorithm and generates a set of learners (default value is 10) that train on a subset of the data and features. Each of the trained learning algorithms is built to classify based on a subset of the data and their results are averaged to come up with a classification over all running classifiers. One advantage of this method is the ability to construct a complex learner from a set of relatively simple learning algorithms. However, bagging increases the computational complexity, especially for tree based classifiers.

My last evaluated model will use a bagging ensemble of single classifiers. The data set has lots of features that take different values. Much like the email examples in the lectures, this project could benefit from an ensemble of simpler classifiers. The implementation of the bagged classifier will use my DT classifier as its simpler model, as the sklearn documentation mentioned that BaggingClassifiers work better with DT algorithms.



In [54]:

    
# Bagger!
from sklearn.ensemble import BaggingClassifier
# I selected to have smaller sample and feature sets, but more estimators. The DT will use the entropy criterion.
baggingClf_DT = BaggingClassifier(tree.DecisionTreeClassifier(criterion="entropy"), max_samples=0.3, max_features=0.3)
train_predict(baggingClf_DT, X_train_100, y_train_100, X_test, y_test)
train_predict(baggingClf_DT, X_train_200, y_train_200, X_test, y_test)
train_predict(baggingClf_DT, X_train, y_train, X_test, y_test)









    



------------------------------------------
Training set size: 100
Training BaggingClassifier...
Done!
Training time (secs): 0.021
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.878378378378
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.751773049645
------------------------------------------
Training set size: 200
Training BaggingClassifier...
Done!
Training time (secs): 0.019
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.870588235294
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.787878787879
------------------------------------------
Training set size: 300
Training BaggingClassifier...
Done!
Training time (secs): 0.018
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 0.859154929577
Predicting labels using BaggingClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.769230769231

Summary Results - Bagged DT Classifier

Data Size	Training Time (s)	Prediction Time (s)	F1 Train	F1 Test
100	0.037	0.002	0.9022	0.7022
200	0.032	0.002	0.9096	0.7919
300	0.034	0.002	0.8888	0.7702

The performance looks similar to the single DT classifiers.

5. Choosing the Best Model

Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
What is the model's final F₁ score?

Model Recommendations

Based on the coded tests above, I recommend using a Bagged Decision Tree Classifier (DTC) for identifying students in need of assistance. The basic entropy based DTC has decent performance with minimal tuning: F1 scores over 0.7 for larger data sets. A simple Gaussian Naive Bayes (GNB) classifier trained quickly, but its performance was outmatched by the alternative DTC. The tables and code snippets above demonstrate how these single classifiers do not generalize as well as the Bagged DTC. This particulare classifier used an entropy DTC to start, but generated 20 different models over smaller sets of samples and features.

Although the Bagged DTC takes longer to train and predicts 2x slower, its overall performance across data sets was better than the single classifiers. In this application, the added training time and execution time increase are worth the added accuracy. If we have too many false positives reported, it would potentially drain human resources more than computing resources.

Bagged Decision Tree Classification

The proposed model is generated from the concept of an ensemble learner: it is a single, complex learner composed of many simple learners. Bagging ensemble models average the results from their constituent classifiers. Each simple classifier is built up using a subset of the features and a subset of the data. To improve performance, it is recommended that this classifier be run through a grid search over the max_samples and max_features parameters to determine a higher performing combination of simple learners.

At the core of my bagging classifier is the use of a DTC that uses entropy for its splitting criterion. Since I am more familiar with the leveraging of information gain, I chose to use the entropy based DTC. In general, DTC ask a series of questions over the data set, splitting the data into categories as each question is asked. Entropy is a measure of how random a collection of data points are. An entropy based chooses to split on attributes that reduce this randomness. For example, an ideal attribute would be one that splits the entire data set into two perfect subsets.

Tuning the Bagged DT Classifier

The main attributes we can tune for a bagging classifier are the number of estimators as well as the maximum samples and features used in each estimator. I chose to explore across a broad range of sample sizes, but I limited the maximum number of features to 60%. I wanted to see if the resulting classifier would prefer to use more data to learn more features.



In [55]:

    
bagged_params = {'max_samples':np.arange(0.1,1,0.1), 'max_features':np.arange(0.1,0.7,0.1),'n_estimators':np.arange(1,16,1)}
basicBaggedClf = BaggingClassifier(tree.DecisionTreeClassifier(criterion="entropy"))
tunedBaggedClf = GridSearchCV(basicBaggedClf, bagged_params, f1scorer)
tunedBaggedClf.fit(X_train, y_train)
tunedBaggedClf.best_estimator_









    Out[55]:





BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
         bootstrap=True, bootstrap_features=False,
         max_features=0.10000000000000001, max_samples=0.30000000000000004,
         n_estimators=13, n_jobs=1, oob_score=False, random_state=None,
         verbose=0, warm_start=False)

Running this tuned algorithm on the test data gave the following results:



In [56]:

    
predict_labels(tunedBaggedClf, X_test, y_test)









    



Predicting labels using GridSearchCV...
Done!
Prediction time (secs): 0.002






    Out[56]:





0.79746835443037978

The tuned bagging classifier has a slight edge over the single version. This result shows that it performs better than the single DT classifier and its prediction time (for the full data set) is a little longer, i.e. the single DTC executed in less than 1ms and the bagged DTC executed in 2ms.