Building a Student Intervention System

This is a classification problem where we will split the group of students into students who need early intervention and those who don't need it.

Exploring the Data


In [1]:
# Import libraries
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score
from IPython.display import display
import visuals as vs
from IPython.display import display
import sklearn.learning_curve as curves
import matplotlib.pyplot as pl
rstate = 10

%matplotlib inline

# Read student data
student_data = pd.read_csv("student-data.csv")
display(student_data.head())
print "Student data read successfully!"


school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... internet romantic famrel freetime goout Dalc Walc health absences passed
0 GP F 18 U GT3 A 4 4 at_home teacher ... no no 4 3 4 1 1 3 6 no
1 GP F 17 U GT3 T 1 1 at_home other ... yes no 5 3 3 1 1 3 4 no
2 GP F 15 U LE3 T 1 1 at_home other ... yes no 4 3 2 2 3 3 10 yes
3 GP F 15 U GT3 T 4 2 health services ... yes yes 3 2 2 1 1 5 2 yes
4 GP F 16 U GT3 T 3 3 other other ... no no 4 3 2 1 2 5 4 yes

5 rows × 31 columns

Student data read successfully!

Implementation: Data Exploration

Let's begin by investigating the dataset to determine how many students we have information on, and learn about the graduation rate among these students. In the code cell below, we will compute the following:

  • The total number of students, n_students.
  • The total number of features for each student, n_features.
  • The number of those students who passed, n_passed.
  • The number of those students who failed, n_failed.
  • The graduation rate of the class, grad_rate, in percent (%).

In [2]:
from collections import Counter
# Calculate number of students
n_students = student_data.shape[0]

# Calculate number of features
n_features = student_data.shape[1]-1

# Calculate passing students

passed_col = student_data["passed"].tolist()
passed_map = Counter(passed_col)

print "number passed "+ str(passed_map["yes"])+"\n"
print "number not passed "+ str(passed_map["no"])+"\n"



pd.crosstab(index=student_data["passed"], columns="count")
n_passed = passed_map["yes"]
# Calculate failing students
n_failed = passed_map["no"]

# TODO: Calculate graduation rate
grad_rate = (float(n_passed)*100)/n_students

# Print the results
print "Total number of students: {}".format(n_students)
print "Number of features: {}".format(n_features)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)
print "67% graudation rate indicates that data is imbalanced, there are more positive labels than negative lables."


number passed 265

number not passed 130

Total number of students: 395
Number of features: 30
Number of students who passed: 265
Number of students who failed: 130
Graduation rate of the class: 67.09%
67% graudation rate indicates that data is imbalanced, there are more positive labels than negative lables.

Preparing the Data

In this section, we will prepare the data for modeling, training and testing.

Identify feature and target columns

Code cell below will separate the student data into feature and target columns to see if any features are non-numeric.


In [3]:
# Extract feature columns
feature_cols = list(student_data.columns[:-1])

# Extract target column 'passed'
target_col = student_data.columns[-1] 

# Show the list of columns
print "Feature columns:\n{}".format(feature_cols)
print "\nTarget column: {}".format(target_col)

# Separate the data into feature data and target data (X_all and y_all, respectively)
X_all = student_data[feature_cols]
y_all = student_data[target_col]

# Show the feature information by printing the first five rows
print "\nFeature values:"
print X_all.head()


Feature columns:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Target column: passed

Feature values:
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...       yes      yes        no       5         3     3    1    1      3   
2   ...       yes      yes        no       4         3     2    2    3      3   
3   ...       yes      yes       yes       3         2     2    1    1      5   
4   ...       yes       no        no       4         3     2    1    2      5   

  absences  
0        6  
1        4  
2       10  
3        2  
4        4  

[5 rows x 30 columns]

Preprocess Feature Columns

There are several non-numeric columns that need to be converted! Many of them are simply yes/no, e.g. internet. These can be reasonably converted into 1/0 (binary) values.

Other columns, like Mjob and Fjob, have more than two values, and are known as categorical variables. I will create as many columns as possible values (e.g. Fjob_teacher, Fjob_other, Fjob_services, etc.), and assign a 1 to one of them and 0 to all others.


In [4]:
def preprocess_features(X):
    ''' Preprocesses the student data and converts non-numeric binary variables into
        binary (0/1) variables. Converts categorical variables into dummy variables. '''
    
    # Initialize new output DataFrame
    output = pd.DataFrame(index = X.index)

    # Investigate each feature column for the data
    for col, col_data in X.iteritems():
        
        # If data type is non-numeric, replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])

        # If data type is categorical, convert to dummy variables
        if col_data.dtype == object:
            # Example: 'school' => 'school_GP' and 'school_MS'
            col_data = pd.get_dummies(col_data, prefix = col)  
        
        # Collect the revised columns
        output = output.join(col_data)
    
    return output

X_all = preprocess_features(X_all)
print "Processed feature columns ({} total features):\n{}".format(len(X_all.columns), list(X_all.columns))


Processed feature columns (48 total features):
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']

Implementation: Training and Testing Data Split

So far, we have converted all categorical features into numeric values. For the next step, we split the data (both features and corresponding labels) into training and test sets. In the following code cell below, I will:

  • Randomly shuffle and split the data (X_all, y_all) into training and testing subsets.
    • Use 300 training points (approximately 75%) and 95 testing points (approximately 25%).
    • Set a random_state for the function(s) you use, if provided.
    • Store the results in X_train, X_test, y_train, and y_test.

In [5]:
# Import any additional functionality you may need here
from sklearn.cross_validation import ShuffleSplit
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit

# Set the number of training points
num_train = 300

# Set the number of testing points
num_test = X_all.shape[0] - num_train

# Shuffle and split the dataset into the number of training and testing points above
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=95, random_state=rstate)
#cv = StratifiedShuffleSplit(y_all, n_iter=10, test_size=95, random_state=10)
#print len(cv)

# Show the results of the split
print "Training set has {} samples.".format(X_train.shape[0])
print "Testing set has {} samples.".format(X_test.shape[0])


Training set has 300 samples.
Testing set has 95 samples.

Training and Evaluating Models

In this section, I will choose 3 supervised learning models that are appropriate for this problem and available in scikit-learn. I will first discuss the reasoning behind choosing these three models by considering my knowledge about the data and each model's strengths and weaknesses. I will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F1 score. I will produce three tables (one for each model) that shows the training set size, training time, prediction time, F1 score on the training set, and F1 score on the testing set.

Model Application

We will use following three methods: 1) Decision Trees 2) Logistic regression 3) SVM General application of these models is to find a classification boundary to separate the data into various classes.

Logistic Regression: Logistic regression is easy to understand conceptually, it is very similar to linear regression. However, logistic regression gives only linear boundary and some times that may not be very useful as the data may not be linearly separable. Training and prediction for logistic regression can be very small. Training time for logistic regression could be much smaller than the training time for decision trees or SVM. It is very easy to control to complexity of logistic regression boundary.

Decision Trees: Decision trees are a little bit more complex than Logistic regression. Decision tree classifier can give a highly nonlinear boundary and it can easily overfit to the training data if one does not carefully optimize parameters such as maximum depth of a tree. Training time of the classifier will depend on the depth of the tree. We have a better control over the complexity of the decision tree classifier than the control over the complexity of SVC.

Support Vector Classifier: Training time for a simple support vector classifier can be much larger than the training time for logistic regression or decission tree classifier. However, SVC can give a very good accuracy compared to logistic regression. Using Kernel methods, SVC can give a nonlinear boundary. We can easily control the complexity of an SVC. SVC also ignores the outliers and maximizes the distances of observations from the classification boundary.

Since there are many features in the data, all three models, logistic regression, decision trees, and SVC are known to do well in such situations.

As the target label is binary and also most of the input features are binary, logistic regression is a natural choice.

DTs can also be used for binary classification for such catagorical features as input. DTs use entropy rather than feature values for determining the decision boundary. Additionally, as they are able to learn a nonlinear decision, they provide an advantage over logistic regression.

They transform the features into a latent space to determine the decision boundary, they are also suitable for training on these kinds of features.

Setup

In the code cell below, I will initialize three helper functions for training and testing the three supervised learning models chosen above. The functions are as follows:

  • train_classifier - takes as input a classifier and training data and fits the classifier to the data.
  • predict_labels - takes as input a fit classifier, features, and a target labeling and makes predictions using the F1 score.
  • train_predict - takes as input a classifier, and the training and testing data, and performs train_clasifier and predict_labels.
    • This function will report the F1 score for both the training and testing data separately.

In [6]:
def train_classifier(clf, X_train, y_train):
    ''' Fits a classifier to the training data. '''
    
    # Start the clock, train the classifier, then stop the clock
    start = time()
    clf.fit(X_train, y_train)
    end = time()
    
    # Print the results
    print "Trained model in {:.4f} seconds".format(end - start)

    
def predict_labels(clf, features, target):
    ''' Makes predictions using a fit classifier based on F1 score. '''
    
    # Start the clock, make predictions, then stop the clock
    start = time()
    y_pred = clf.predict(features)
    end = time()
    
    # Print and return results
    print "Made predictions in {:.4f} seconds.".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')


def train_predict(clf, X_train, y_train, X_test, y_test):
    ''' Train and predict using a classifer based on F1 score. '''
    
    # Indicate the classifier and the training set size
    print "Training a {} using a training set size of {}. . .".format(clf.__class__.__name__, len(X_train))
    
    # Train the classifier
    train_classifier(clf, X_train, y_train)
    
    # Print the results of prediction for both training and testing
    print "F1 score for training set: {:.4f}.".format(predict_labels(clf, X_train, y_train))
    print "F1 score for test set: {:.4f}.".format(predict_labels(clf, X_test, y_test))
    print " "

Implementation: Model Performance Metrics


In [7]:
# Import the three supervised learning models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC


# Initialize the three models
clf_A = LogisticRegression(random_state=rstate)
clf_B = DecisionTreeClassifier(random_state=rstate)
clf_C = SVC(random_state=rstate)

# Set up the training set sizes
X_train_100 = X_train[:100]
y_train_100 = y_train[:100]

X_train_200 = X_train[:200]
y_train_200 = y_train[:200]

X_train_300 = X_train[:300]
y_train_300 = y_train[:300]

# Execute the 'train_predict' function for each classifier and each training set size
for clf in [clf_A, clf_B, clf_C]:
    for size in [100, 200, 300]:
        train_predict(clf, X_train[:size], y_train[:size], X_test, y_test)
        print '\n'


Training a LogisticRegression using a training set size of 100. . .
Trained model in 0.0077 seconds
Made predictions in 0.0028 seconds.
F1 score for training set: 0.8593.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.7612.
 


Training a LogisticRegression using a training set size of 200. . .
Trained model in 0.0030 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 0.8444.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7591.
 


Training a LogisticRegression using a training set size of 300. . .
Trained model in 0.0044 seconds
Made predictions in 0.0003 seconds.
F1 score for training set: 0.8263.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.8169.
 


Training a DecisionTreeClassifier using a training set size of 100. . .
Trained model in 0.0028 seconds
Made predictions in 0.0011 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0003 seconds.
F1 score for test set: 0.6870.
 


Training a DecisionTreeClassifier using a training set size of 200. . .
Trained model in 0.0016 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.7059.
 


Training a DecisionTreeClassifier using a training set size of 300. . .
Trained model in 0.0023 seconds
Made predictions in 0.0002 seconds.
F1 score for training set: 1.0000.
Made predictions in 0.0002 seconds.
F1 score for test set: 0.6720.
 


Training a SVC using a training set size of 100. . .
Trained model in 0.0037 seconds
Made predictions in 0.0016 seconds.
F1 score for training set: 0.8366.
Made predictions in 0.0011 seconds.
F1 score for test set: 0.8228.
 


Training a SVC using a training set size of 200. . .
Trained model in 0.0047 seconds
Made predictions in 0.0034 seconds.
F1 score for training set: 0.8552.
Made predictions in 0.0017 seconds.
F1 score for test set: 0.7947.
 


Training a SVC using a training set size of 300. . .
Trained model in 0.0107 seconds
Made predictions in 0.0062 seconds.
F1 score for training set: 0.8615.
Made predictions in 0.0022 seconds.
F1 score for test set: 0.8079.
 


Tabular Results

Record of results from above in the tables:

Classifer 1 - Logistic Regression

Training Set Size Training Time Prediction Time (test) F1 Score (train) F1 Score (test)
100 0.0021 0.0004 0.8571 0.7612
200 0.0025 0.0002 0.8380 0.7794
300 0.0034 0.0003 0.8381 0.7910

Classifer 2 - Decision Tree Classifier

Training Set Size Training Time Prediction Time (test) F1 Score (train) F1 Score (test)
100 0.0009 0.0002 1.0 0.6667
200 0.0017 0.0005 1.0 0.6929
300 0.0039 0.0005 1.0 0.7119

Classifer 3 - Support vector Classifier

Training Set Size Training Time Prediction Time (test) F1 Score (train) F1 Score (test)
100 0.0105 0.0021 0.8591 0.7838
200 0.0064 0.004 0.8693 0.7755
300 0.012 0.0064 0.8692 0.7586

Choosing the Best Model

In this section, I will choose from the three supervised learning models the best model to use on the student data. I will then perform a grid search optimization for the model over the entire training set (X_train and y_train) by tuning at least one parameter to improve upon the untuned model's F1 score.

Chosing the Best Model

I choose the final model to be logistic regression, however as shown below the optimized LR gives lower F1 score (0.77) less than the untuned model (0.79). Then I optimized the 'decision tree' model and it gives the highest F1 score on the test data. Based on the available data, decision tree model seems to be the best model considering limited resources. Decision tree model automatically selects the features based on the calculation of information gain from entropy. SVC can give a good classification boundary and high accuracy with nonlinear kernels, however based on the available data, running SVC with nonlinear kernel can take a long time.

Model in Layman's Terms

I choose final model to be decision tree (DT). DT takes one feature (root node) and splits it into two child nodes by asking a question and the answer to that question will be a yes/no (or 0/1) type. So the data will be split into two catagories. Based on this split, we will calculate the information gain which is equal to: 1 - Sum(entropies of the two catagories). If we gain some information by splitting over that node, ie, if we classify the data based on that single question and if the data in each catagory is more "similar" to each other then that classification is good. We then further refine the data in each catagory by asking another question based on the other feature in our data. Again, if the data in each catagory is more similar to each other then it means that we have gained some information by splitting the data further. If we find that we have not gained any information by splitting the data, we go back and take another feature and split the data based on that feature. This is like a tree and at each node, we are creating two branchs. We repeat this process untill a certain depth of the tree is achieved and that depth is decided by a human based on the bias-variance trade-off of the model.

Implementation: Model Tuning

I will fine tune the chosen model. Using grid search (GridSearchCV) with at least one important parameter tuned with at least 3 different values. I will need to use the entire training set for this.


In [8]:
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score
from sklearn import cross_validation
from sklearn.learning_curve import validation_curve

def my_scorer(y_true, y_predict):
    return f1_score(y_true, y_predict, pos_label='yes')

In [9]:
def ModelComplexity(X, y):
    """ Calculates the performance of the model as model complexity increases.
        The learning and testing errors rates are then plotted. """
    
    # Create 10 cross-validation sets for training and testing
    cv = ShuffleSplit(X.shape[0], n_iter = 10, test_size = 0.2, random_state = rstate)

    # Vary the max_depth parameter from 1 to 10
    max_depth = np.arange(1,11)

    # Calculate the training and testing scores
    train_scores, test_scores = curves.validation_curve(DecisionTreeClassifier(), X, y, \
        param_name = "max_depth", param_range = max_depth, cv = cv, scoring = 'my_scorer')

    # Find the mean and standard deviation for smoothing
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

    # Plot the validation curve
    pl.figure(figsize=(7, 5))
    pl.title('Decision Tree Classifier Complexity Performance')
    pl.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score')
    pl.plot(max_depth, test_mean, 'o-', color = 'g', label = 'Validation Score')
    pl.fill_between(max_depth, train_mean - train_std, \
        train_mean + train_std, alpha = 0.15, color = 'r')
    pl.fill_between(max_depth, test_mean - test_std, \
        test_mean + test_std, alpha = 0.15, color = 'g')
    
    # Visual aesthetics
    pl.legend(loc = 'lower right')
    pl.xlabel('Maximum Depth')
    pl.ylabel('Score')
    pl.ylim([-0.05,1.05])
    pl.show()

In [10]:
# DECISION TREES: Import 'GridSearchCV' and 'make_scorer'

# Create the parameters list you wish to tune
parameters = {"max_depth":[1,2,3,4,5,6,7,8,9,10]}

# Initialize the classifier
clf = DecisionTreeClassifier()

# Make an f1 scoring function using 'make_scorer'

rs_scorer = make_scorer(my_scorer, greater_is_better= True)

# There is a class imbalance problem, 130 failed, 265 passed, needed a larger test_size to have
# near approximate
cv_set = cross_validation.ShuffleSplit(X_all.shape[0], n_iter=10, test_size=0.25, random_state=rstate)
sss = StratifiedShuffleSplit(y_train, n_iter=10, test_size=0.24, random_state=rstate)
print cv_set

print len(X_train)
print cv_set
# Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(estimator=clf, param_grid=parameters, scoring=rs_scorer, cv=cv_set)
grid_obj2 = GridSearchCV(clf, param_grid=parameters, scoring=rs_scorer, cv=sss)
# Fit the grid search object to the training data and find the optimal parameters

grid_obj = grid_obj.fit(X_all, y_all)
grid_obj2 = grid_obj2.fit(X_train, y_train)

#grid_obj2 = grid_obj2.fit(X_train, y_train)
# Get the estimator
clf = grid_obj.best_estimator_
clf2 = grid_obj2.best_estimator_

print clf
print clf2


# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))

print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf2, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf2, X_test, y_test))


ShuffleSplit(395, n_iter=10, test_size=0.25, random_state=10)
300
ShuffleSplit(395, n_iter=10, test_size=0.25, random_state=10)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
Made predictions in 0.0006 seconds.
Tuned model has a training F1 score of 0.8092.
Made predictions in 0.0002 seconds.
Tuned model has a testing F1 score of 0.8169.
Made predictions in 0.0002 seconds.
Tuned model has a training F1 score of 0.8270.
Made predictions in 0.0002 seconds.
Tuned model has a testing F1 score of 0.8105.

In [11]:
# LOGISTIC REGRESSION: Import 'GridSearchCV' and 'make_scorer'

# Create the parameters list you wish to tune
parameters = {"C":[10, 1,0.1, 0.001, 0.0001, 0.00001, 0.000001]}

# Initialize the classifier
clf = LogisticRegression()

# Make an f1 scoring function using 'make_scorer' 
f1_scorer = make_scorer(my_scorer, greater_is_better= True)
cv_set = cross_validation.ShuffleSplit(X_all.shape[0], n_iter=10, test_size=0.5, random_state=rstate)

# Perform grid search on the classifier using the f1_scorer as the scoring method
grid_obj = GridSearchCV(estimator=clf, param_grid=parameters, scoring=rs_scorer, cv=cv_set)

# Fit the grid search object to the training data and find the optimal parameters
grid_obj = grid_obj.fit(X_all, y_all)

# Get the estimator
clf = grid_obj.best_estimator_
print clf
# Report the final F1 score for training and testing after parameter tuning
print "Tuned model has a training F1 score of {:.4f}.".format(predict_labels(clf, X_train, y_train))
print "Tuned model has a testing F1 score of {:.4f}.".format(predict_labels(clf, X_test, y_test))


LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Made predictions in 0.0005 seconds.
Tuned model has a training F1 score of 0.7976.
Made predictions in 0.0004 seconds.
Tuned model has a testing F1 score of 0.8199.

In [12]:
# ModelComplexity(X_all, y_all)

The final training and testing scores are 0.81 and 0.80, respectively. The untuned model's training and testing scores are 1.0 and 0.75. This clearly shows that the untuned model is overfitting the training data and not generalizing well to the test data. However, the optimized model is doing a better job as it is not overfitting the training data (therefore a training score of 0.81 instead of 1.0) and generalizing well to the test data.