In [35]:
# Import libraries
%matplotlib inline
import numpy as np
import pandas as pd
import sklearn as skl
import matplotlib.pyplot as plt
In [36]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns
Now, can you find out the following facts about the dataset?
Use the code block below to compute these values. Instructions/steps are marked using TODOs.
In [37]:
student_data[student_data.passed=='yes'].shape[0]
Out[37]:
In [38]:
student_data.dtypes
Out[38]:
In [39]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students = student_data.shape[0]
n_features = student_data.shape[1]-1
n_passed = student_data[student_data.passed=='yes'].shape[0]
n_failed = student_data[student_data.passed=='no'].shape[0]
grad_rate = 100 * n_passed / (n_passed + n_failed)
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)
In this section, we will prepare the data for modeling, training and testing.
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.
Let's first separate our data into feature and target columns, and see if any features are non-numeric.
Note: For this dataset, the last column ('passed'
) is the target or label we are trying to predict.
In [40]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1]) # all columns but last are features
target_col = student_data.columns[-1] # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)
X_all = student_data[feature_cols] # feature values for all students
y_all = student_data[target_col] # corresponding targets/labels
print "\nFeature values:-"
print X_all.head() # print the first 5 rows
As you can see, there are several non-numeric columns that need to be converted! Many of them are simply yes
/no
, e.g. internet
. These can be reasonably converted into 1
/0
(binary) values.
Other columns, like Mjob
and Fjob
, have more than two values, and are known as categorical variables. The recommended way to handle such a column is to create as many columns as possible values (e.g. Fjob_teacher
, Fjob_other
, Fjob_services
, etc.), and assign a 1
to one of them and 0
to all others.
These generated columns are sometimes called dummy variables, and we will use the pandas.get_dummies()
function to perform this transformation.
In [41]:
# Preprocess feature columns
def preprocess_features(X):
outX = pd.DataFrame(index=X.index) # output dataframe, initially empty
# Check each column
for col, col_data in X.iteritems():
# If data type is non-numeric, try to replace all yes/no values with 1/0
if col_data.dtype == object:
col_data = col_data.replace(['yes', 'no'], [1, 0])
# Note: This should change the data type for yes/no columns to int
# If still non-numeric, convert to one or more dummy variables
if col_data.dtype == object:
col_data = pd.get_dummies(col_data, prefix=col) # e.g. 'school' => 'school_GP', 'school_MS'
outX = outX.join(col_data) # collect column(s) in output dataframe
return outX
preproc_sd = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(preproc_sd.columns), list(preproc_sd.columns))
In [42]:
# First, decide how many training vs test samples you want
num_all = preproc_sd.shape[0] # same as len(student_data)
num_train = 300 # about 75% of the data
num_test = num_all - num_train
shuffled_preproc_sd = preproc_sd.reindex(np.random.permutation(preproc_sd.index))
# Change indices on the labels to match the shuffling.
shuffled_indices = shuffled_preproc_sd.index.values
shuffled_labels = y_all.reindex(shuffled_indices)
In [43]:
# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset
X_train = shuffled_preproc_sd.head(num_train).values
y_train = shuffled_labels.head(num_train).values
X_test = shuffled_preproc_sd.tail(num_test).values
y_test = shuffled_labels.tail(num_test).values
print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data
I think there are a couple features that might be the most important based on my experience with teaching. First, attendance is key! Another good feature to examine would be the school and family support they receive.
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:
Produce a table showing training time, prediction time, F1 score on training set and F1 score on test set, for each training set size.
Note: You need to produce 3 such tables - one for each model.
In [47]:
from sklearn.metrics import f1_score
import time
# Function for training a model
def train_classifier(clf, X_train, y_train):
print "Training {}...".format(clf.__class__.__name__)
start = time.time()
clf.fit(X_train, y_train)
end = time.time()
print "Done!\nTraining time (secs): {:.3f}".format(end - start)
# Predict on training set and compute F1 score
def predict_labels(clf, features, target):
print "Predicting labels using {}...".format(clf.__class__.__name__)
start = time.time()
y_pred = clf.predict(features)
end = time.time()
print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
return f1_score(target, y_pred, pos_label='yes')
In [48]:
# Taining data partitioning
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]
X_train_200 = X_train[:200]
y_train_200 = y_train[:200]
The DT classifier can handle multiple inputs and if used with an entropy-based criterion, will split according to the highest information gained from the attribute. A couple of its key strengths are its simplicity and ability to handle multiple types of data. However, DTs are prone to over fitting and are sensitive to data. For example, the structure of the tree may change greatly between training runs.
I chose this model first because it intuitively aligned with the problem: we have a lot of features, so we could ask multiple questions to determine whether a student should get assistance.
In [49]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier(criterion="entropy")
In [50]:
# Load up the GridSearch
from sklearn.grid_search import GridSearchCV
To understand grid search a little better, I tried it out on the single DT classifier with the following parameter selections:
Note: I made a lambda function to set that the positive label for the f1 scoring metric should be the string 'yes'
. By passing this parameter, I do not need to convert the label data into 1s and 0s.
In [51]:
dtc_params = {'criterion':("gini","entropy"), 'min_samples_split':(2,4,8,16), 'max_features':("auto","sqrt","log2"),
'max_depth':np.arange(1,31,1)}
f1scorer = skl.metrics.make_scorer( lambda yt, yp : skl.metrics.f1_score(yt, yp, pos_label='yes') )
tuned_dtc = GridSearchCV(dtc, dtc_params, f1scorer)
tuned_dtc.fit(X_train, y_train)
tuned_dtc.best_estimator_
# print "GridSearch DT Classifier"
# train_predict(tuned_dtc, X_train_100, y_train_100, X_test, y_test)
# train_predict(tuned_dtc, X_train_200, y_train_200, X_test, y_test)
# train_predict(tuned_dtc, X_train, y_train, X_test, y_test)
Out[51]:
...now back to evaluating the DT classifier.
In [52]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
print "------------------------------------------"
print "Training set size: {}".format(len(X_train))
train_classifier(clf, X_train, y_train)
print "F1 score for training set: {}".format(predict_labels(clf, X_train, y_train))
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))
print "Non-tuned DT Classifier"
train_predict(dtc, X_train_100, y_train_100, X_test, y_test)
train_predict(dtc, X_train_200, y_train_200, X_test, y_test)
train_predict(dtc, X_train, y_train, X_test, y_test)
We say the application of Naive Bayes models in email spam filters. In that case, we were trying to compute the likelihood that a particular email was spam based on the input data. Bayes learning, in essence, will let us switch cause and effect so that we can determine what sets of data make an outcome (pass/fail) liekly. Naive Bayes algorithms are relatively fast compared to other supervised learning techniques, since it makes the conditional independence assumption. The main disadvantage is caused by this assumption: we cannot leverage the interactions among the features. In practice, Naive Bayes works well with minimal tuning.
One can think of this problem like the spam filter example: given a passing (or, failing) student classification, what was the effect of different features on the likelihood of that result. The feature data provides a chain of evidence to help derive the likelihood of correctly classifying the student.
In [53]:
from sklearn.naive_bayes import GaussianNB
# GaussianNB can accept sigma and theta as parameters, but I will try it empty.
gnb = GaussianNB()
train_predict(gnb, X_train_100, y_train_100, X_test, y_test)
train_predict(gnb, X_train_200, y_train_200, X_test, y_test)
train_predict(gnb, X_train, y_train, X_test, y_test)
The BaggingClassifier
in sklearn uses one type of classification algorithm and generates a set of learners (default value is 10) that train on a subset of the data and features. Each of the trained learning algorithms is built to classify based on a subset of the data and their results are averaged to come up with a classification over all running classifiers. One advantage of this method is the ability to construct a complex learner from a set of relatively simple learning algorithms. However, bagging increases the computational complexity, especially for tree based classifiers.
My last evaluated model will use a bagging ensemble of single classifiers. The data set has lots of features that take different values. Much like the email examples in the lectures, this project could benefit from an ensemble of simpler classifiers. The implementation of the bagged classifier will use my DT classifier as its simpler model, as the sklearn documentation mentioned that BaggingClassifiers
work better with DT algorithms.
In [54]:
# Bagger!
from sklearn.ensemble import BaggingClassifier
# I selected to have smaller sample and feature sets, but more estimators. The DT will use the entropy criterion.
baggingClf_DT = BaggingClassifier(tree.DecisionTreeClassifier(criterion="entropy"), max_samples=0.3, max_features=0.3)
train_predict(baggingClf_DT, X_train_100, y_train_100, X_test, y_test)
train_predict(baggingClf_DT, X_train_200, y_train_200, X_test, y_test)
train_predict(baggingClf_DT, X_train, y_train, X_test, y_test)
The performance looks similar to the single DT classifiers.
Based on the coded tests above, I recommend using a Bagged Decision Tree Classifier (DTC) for identifying students in need of assistance. The basic entropy based DTC has decent performance with minimal tuning: F1 scores over 0.7 for larger data sets. A simple Gaussian Naive Bayes (GNB) classifier trained quickly, but its performance was outmatched by the alternative DTC. The tables and code snippets above demonstrate how these single classifiers do not generalize as well as the Bagged DTC. This particulare classifier used an entropy DTC to start, but generated 20 different models over smaller sets of samples and features.
Although the Bagged DTC takes longer to train and predicts 2x slower, its overall performance across data sets was better than the single classifiers. In this application, the added training time and execution time increase are worth the added accuracy. If we have too many false positives reported, it would potentially drain human resources more than computing resources.
The proposed model is generated from the concept of an ensemble learner: it is a single, complex learner composed of many simple learners. Bagging ensemble models average the results from their constituent classifiers. Each simple classifier is built up using a subset of the features and a subset of the data. To improve performance, it is recommended that this classifier be run through a grid search over the max_samples
and max_features
parameters to determine a higher performing combination of simple learners.
At the core of my bagging classifier is the use of a DTC that uses entropy for its splitting criterion. Since I am more familiar with the leveraging of information gain, I chose to use the entropy based DTC. In general, DTC ask a series of questions over the data set, splitting the data into categories as each question is asked. Entropy is a measure of how random a collection of data points are. An entropy based chooses to split on attributes that reduce this randomness. For example, an ideal attribute would be one that splits the entire data set into two perfect subsets.
The main attributes we can tune for a bagging classifier are the number of estimators as well as the maximum samples and features used in each estimator. I chose to explore across a broad range of sample sizes, but I limited the maximum number of features to 60%. I wanted to see if the resulting classifier would prefer to use more data to learn more features.
In [55]:
bagged_params = {'max_samples':np.arange(0.1,1,0.1), 'max_features':np.arange(0.1,0.7,0.1),'n_estimators':np.arange(1,16,1)}
basicBaggedClf = BaggingClassifier(tree.DecisionTreeClassifier(criterion="entropy"))
tunedBaggedClf = GridSearchCV(basicBaggedClf, bagged_params, f1scorer)
tunedBaggedClf.fit(X_train, y_train)
tunedBaggedClf.best_estimator_
Out[55]:
Running this tuned algorithm on the test data gave the following results:
In [56]:
predict_labels(tunedBaggedClf, X_test, y_test)
Out[56]:
The tuned bagging classifier has a slight edge over the single version. This result shows that it performs better than the single DT classifier and its prediction time (for the full data set) is a little longer, i.e. the single DTC executed in less than 1ms and the bagged DTC executed in 2ms.