There is a rising trend in using data to determine student performance and provide timely intervention for low-performing students/at-risk students. This project hopes to help highlight and contribute to these efforts in education innovation.
Having previously worked at a classroom learning/management platform edtech startup, I was inspired to do this project as a way to learn more about the process of building a data-driven student intervention system as well as to understand how data science can be used to transform education.
The dataset below was provided by Kaggle, and gives a snapshot of student engagement and student background as well as their corresponding final grades (classified by 3 categories: high-level grades, middle-levelgrades, & low-level grades.
This project aims to determine which factors are the greatest indicators in identifying a student as being at‐risk either behaviorally or academically and to make predictions, based on these factors, of which students belong to which grade class (high, middle or low).
1 Gender - student's gender (nominal: 'Male' or 'Female’)
2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)
4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)
5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)
6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)
7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)
8 Semester- school year semester (nominal:’ First’,’ Second’)
9 Parent responsible for student (nominal:’mom’,’father’)
10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)
11- Visited resources- how many times the student visits a course content(numeric:0-100)
12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)
13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)
14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)
15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)
16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)
Low-Level: interval includes values from 0 to 69,
Middle-Level: interval includes values from 70 to 89,
High-Level: interval includes values from 90-100.
In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv('xAPI-Edu-Data.csv') #columns = ['Gender','Nationality', 'PlaceofBirth','StageID','GradeID','SectionID'
#,'Topic','Semester','Relation','RaisedHands','VisitedResources'
#,'AnnoucementsView','Discussion','ParentAnsweringSurvey',
#'ParentSchoolSatisfaction','StudentAbsenceDays','Class/FinalGrade'])
print (data.shape)
data.head(15)
Out[1]:
Summary of Statistics:
- Use
Pandas' describe
function to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution (for all numeric columns).- By looking at
info
: We have the total number of observations = 480, and we see there are no missing (non-null) values (i.e., we have a clean data set).
In [2]:
data.info()
In [3]:
data.describe()
Out[3]:
Now that we have an overview for the entire dataset, let's dig deeper and focus on the features we're primarily interested in: the grades the student receive.
Using
pandas grouby
we can split the data into groups based on some criteria (in this case --- the class column 'H, M, L').Using the
aggregate
function with thegroupby
, we are able to compute the summary statistics about each group (in this case --- the min, median, mean and max for each separate class 'H, M, L').We then compare the summary statistics for total students (table above) vs. students for each separate classes 'H, M, L' (table below). The information presented below is more granular & outlines differences in class.
For example: We see that the average number of times students with 'H' raised their hands is around 70.23, whereas for students with 'L' that number drops to around 17. This separation of mean values will become important later when we preprocess the data.
In [4]:
data.groupby('Class').aggregate(['min', np.median, np.mean, max])
Out[4]:
- Libraries used: Matplotlib & Seaborn
First step in exploring the data to classify the grades of students is to look at a simple value count.
Next we visualize the number of students in each of the separate classes to get an idea of how evenly spread the 'H, L, and M's are among the students in the dataset.
As the bar graph shows, distribution is fairly even, no major skews toward any individual class.
In [42]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
data.Class.value_counts().plot(kind='bar')
data.Class.value_counts()
Out[42]:
Now we want to pair those 'High, Low and Middle'grade value counts with other features that may indicate patterns in the distribution of grades and add value to our analysis.
Below are 4 bar graphs for which grades students received based on Parent satisfaction (good or bad) and Relation responsible for the student (Father or Mum)
There is a clear pattern where:
High grade = Parent schoool satisfaction (good) & Relation responsible (Mum)
Low grade = Parent schoool satisfaction (bad) & Relation responsible (Father)
In [41]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(18,4))
plt.subplot(141)
good_sat= data.Class[data.ParentschoolSatisfaction== 'Good'].value_counts()
good_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Good')
plt.subplot(142)
bad_sat = data.Class[data.ParentschoolSatisfaction== 'Bad'].value_counts()
bad_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Bad')
plt.subplot(143)
survey_no = data.Class[data.Relation == 'Father'].value_counts()
survey_no.plot(kind='bar')
plt.title('Father Responsible for Student')
plt.subplot(144)
survey_yes = data.Class[data.Relation == 'Mum'].value_counts()
survey_yes.plot(kind='bar')
plt.title('Mother Responsible for Student')
Out[41]:
Once again, by using the
pandas groupby
function, we can get a table showing the exact count of the distribution of students in H, L, M --- looking at the first feature "Parent School Satisfaction" only.
In [7]:
grades_count = data.groupby(['ParentschoolSatisfaction','Class'])['Class'].aggregate('count').unstack()
grades_count
Out[7]:
Using
pandas pivot_table
we can capture more complex insights from the data & further break down 'H, L, M' class structure by including two levels of analysis (both Parent school satisfaction & Relation).It is common to start with simple analysis with one feature and add complexity with multiple features as we understand how they interact with our target values of interest.
In [8]:
#shows that adding relation only adds noise, not a significant difference between father & mum
data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'],
columns = 'Class', aggfunc = 'mean')
Out[8]:
Here is a visualization of the above pivot table. The combination of two features do not vary too much within the 'H, M, L' classes, but there is a distict gap showing that students in 'H' class raised their hands at an average of around 3 times as often as 'L' students.
In [40]:
parents_hands = data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'],
columns = 'Class', aggfunc = 'mean')
parents_hands.plot()
plt.ylabel('Average number of times student raised hand')
Out[40]:
Another good feature that may add value to our analysis of factors that contributes to a student's class is their attendance record.
Again, looking at
value counts
for the separate classes, we see that the number of students who had under-7 absences and received a 'Low' class/grade is very low. (and vice versa for above-7 absences & 'H')
In [39]:
import matplotlib.pyplot as plt
attendance = pd.crosstab(index=data['StudentAbsenceDays'], columns=[data['Class']], normalize='columns')
attendance.plot(kind='bar', figsize=(6,6), stacked=True)
Out[39]:
Look at the distribution and probability density function of the 4 numerical columns:
1) raised hands
2) Visited Resources
3) Announcements Views
4) Discussion
The graphs for 'raised hands' and 'visited resources' show the most promise in differentiating between 'H' and 'L' classes. The bimodal shape of their density curves show distinct peaks at the opposite ends of the graph.
In [35]:
fig = plt.figure(figsize=(18,8))
plt.subplot(221)
data.raisedhands[data.Class == 'H'].plot(kind='kde')
data.raisedhands[data.Class == 'M'].plot(kind='kde')
data.raisedhands[data.Class == 'L'].plot(kind='kde')
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Raised Hands')
plt.subplot(222)
data.VisITedResources[data.Class == 'H'].plot(kind='kde')
data.VisITedResources[data.Class == 'M'].plot(kind='kde')
data.VisITedResources[data.Class == 'L'].plot(kind='kde')
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Visited Resources')
plt.subplot(223)
data.Discussion[data.Class == 'H'].plot(kind='kde')
data.Discussion[data.Class == 'M'].plot(kind='kde')
data.Discussion[data.Class == 'L'].plot(kind='kde')
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Discussion')
plt.subplot(224)
data.AnnouncementsView[data.Class == 'H'].plot(kind='kde')
data.AnnouncementsView[data.Class == 'M'].plot(kind='kde')
data.AnnouncementsView[data.Class == 'L'].plot(kind='kde')
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Viewed Announcements')
Out[35]:
Correlations can tell us about the direction, and the degree (strength) of the relationship between two variables (or features)
Using Pearson's correlation coefficient (Pearson's r) where r ranges from -1 (negative relationship) to 0 (no relationship to 1 (positive relationship).
We see that the only relatively strong relationship r = 0.69 is between the two strongest indications of class differences as noted above: 'raised hands' and 'visited resources'.
In [12]:
raised_hands = data['raisedhands']
discussion = data['Discussion']
v_resources = data['VisITedResources']
v_announcements = data['AnnouncementsView']
def correlation(x,y):
std_x = (x-x.mean())/x.std(ddof=0)
std_y = (y-y.mean())/y.std(ddof=0)
return (std_x * std_y).mean()
print ('Raised hands & Discussion: ', correlation(raised_hands, discussion))
print ('Visted Resources & Discussion: ', correlation(v_resources, discussion))
print ('Raised hands & Visited Resources: ', correlation(raised_hands, v_resources))
In [34]:
fig,ax= plt.subplots(figsize=(9,7))
sns.heatmap(data.corr(),annot=True)
Out[34]:
To get a basic understanding of what we are trying to predict, we narrow down the analysis to the two main features that have thus far provided the most information in determining 'H, M, L' classes.
First we need a procedure to turn the class values (categorical) into numerical values in order to plot them and show the different points which are 'H'(large green points) or 'M' (medium yellow points) or 'L' (small red points).
In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
def new_grade(grade):
if grade == 'H':
return 100
elif grade == 'M':
return 50
elif grade == 'L':
return 10
def new_grades(grades):
return grades.apply(new_grade)
print (new_grades(data['Class']).head())
converted_grades = new_grades(data['Class'])
fig = plt.figure(figsize=(12, 9))
plt.scatter(data['VisITedResources'], data['raisedhands'],c= converted_grades, s = converted_grades, cmap = 'RdYlGn')
plt.ylabel("Number of times Student Raised their hand in class")
plt.xlabel("Number of times Student Visited Resources")
plt.title ('Interaction Correlation by Class Marks')
plt.colorbar()
Out[14]:
While there is a decent amount of noise, we can still see a linear relationship where students who raise their hands and visit class resources are in the group with the higher class.
There is an evident pattern where there's a cluster of 'H's at the top right hand corner of the graph and a cluster of 'L's at the bottom left hand corner of the graph.
In [15]:
sns.lmplot(x="VisITedResources", y="raisedhands", data=data)
sns.plt.show()
We see the relationship between class is strongest between 'raisedhands' and other features as well as 'visitedresources' and other features
The 2 weaker features seem to be 'announcements view' and 'discussion' where the points for the different 'H, M, L' classes seem to be scattered all over with no particular pattern.
Now that we have this analysis, we have a better understanding of which features will help us predict the target classes we want and which will be added noise. This is an importance part of the process of feature analysis.
In [16]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.pairplot(data, hue='Class', size=2.5);
1. Preprocessing Data
- Removal of outliers
- Encoding categorical data
2. Machine Learning Models Used
Logistic Regression Classifier
Support Vector Machines Classifer
Random Forest Classifier
(ensemble)Gradient Boosting Classifier
(ensemble)Bagging Classifier
(ensemble)3. Evaluation Methods Used
- Prediction Accuracy Score (on test data)
- K-fold cross validation (on training data)
- Plotting learning curves to assess Bias vs. Variance
- Precision, Recall & F1 Scores
4. Implementation Process
- Once the independent variables (features) have been determined, we split the data set into testing and training sets using
sklearn's cross validation: train_test_split
function- Next we import the models we need from sklearn
- Scaling features (using standard scaler) & PCA (if applicable)
- Tuning hyperparameters to optimize model performance (each model could require different constraints, weights or learning rates to generalize different data patterns)
- Fit/train the data
- Predict y-values using test data
- Use
sklearn's metrics
to determine accuracy score- Use k-fold cross validation to avoid overfitting
- Get additional performance metrics such as precision, recall and F1 score for comparison
Here, the criteria used for determining outliers is based upon the summary of statistics table above (specifically the one separated by class). Using the mean for 'H' and 'L' students in the first two features, we identify a cutoff point for each. (For example, ff a student with an 'H' raised their hands less than the mean for which a student with 'L' raised their hands--they are removed from the dataset.)
Similarly, when looking at the bar graphs for attendance, we saw that very few 'H' students were absent more than 7 days, and very few 'L' students were absent less than 7 days-- also labeling them as outliers/candidates for removal.
In [17]:
#Creating the criteria for an outlier based on multiple conditions. Only dealing with 'H' and 'L' class to be safe.
outliers_1 = data[(data['Class'] == 'H') & (data['raisedhands'] <= 17)]
outliers_2 = data[(data['Class'] == 'H') & (data['VisITedResources'] <= 18)]
outliers_3 = data[(data['Class'] == 'L') & (data['raisedhands'] >= 70)]
outliers_4 = data[(data['Class'] == 'L') & (data['VisITedResources'] >= 78)]
outliers_5 = data[(data['Class'] == 'H') & (data['StudentAbsenceDays'] == 'Above-7')]
outliers_6 = data[(data['Class'] == 'L') & (data['StudentAbsenceDays'] == 'Under-7')]
#dropping the rows which contained the outliers as indicated by the above criteria
new_data = data.drop([14,47,48,72,74,80,84,86,87,88,94,96,124,128,129,190,200,205,226,227,
228,248,250,255,344,345,444,445,450])
#Using shape to check the number of outliers dropped, usually no more than 10%.
#However, since the dataset is small, better to leave more data to work with.
#451 left out of the orginal 480 observations: We only dropped ~ 6% of data.
print(new_data.shape)
Libraries used: Scikit-Learn
In Sklearn, machine learning algorithms require the input variables from the data to be real values. Therefore we must format (or transform) the data into the structure that allows us to feed it into the model.
Here, we use a
label encoder
to prepare the data where numerical dimensions are used to represent membership in the categories ('H, L, M')As we can see, once encoded:
Categorical Class | Numerical value |
---|---|
H (High grade) | 0 |
L (Low grade) | 1 |
M (Middle grade) | 2 |
In [18]:
#Preprocessing data to encode categorical values for the y-target column
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
target_value = le.fit_transform(new_data.Class)
print (target_value)
A common alternative approach to encoding categorical values is called dummy (or one-hot) encoding where the the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column
Pandas
supports this feature usingget_dummies
to create dummy/indicator variables (aka 1 or 0).
In [19]:
data_dummies = pd.get_dummies(new_data, columns = ['ParentschoolSatisfaction', 'StudentAbsenceDays','Relation'])
print(data_dummies.shape)
data_dummies.head()
Out[19]:
The logistic regression classifier is a method used to generalizes logistic regression to multiclass problems (applicable here since we are trying to predict more than two possible discrete outcomes). It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable
Dependent variable (y values) = Class ('H, L, M')
Independent variables (or features / x values) = 'raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad', 'Relation_Father', Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7','StudentAbsenceDays_Under-7'
In Logistic Regression modeling, input values (X) are combined linearly using weights or coefficient values to predict an output value (y)--- or more specifically, the probability that an input (X) belongs to the default class.
Pros:
- Low variance
- Provides probabilities for outcomes
- works well with diagonal (feature) decision boundaries
Cons:
- Doesn’t perform well when feature space is too large
- High bias
- Relies on entire data
In [30]:
feature_cols = feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad', 'Relation_Father',
'Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7',
'StudentAbsenceDays_Under-7']
X = data_dummies[feature_cols]
y = target_value
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe_lr = Pipeline([('rs', StandardScaler()), ('pca', PCA(n_components = 4)),
('logreg', LogisticRegression(C=1e9))])
pipe_lr.fit(X_train, y_train)
y_pred = pipe_lr.predict(X_test)
from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
from sklearn.metrics import classification_report
print(' ', classification_report(y_test, y_pred))
A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.
If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.
We can use the sklearn learning_curve
function to generate the values that are required to plot such a learning curve (number of samples that have been used, the average scores on the training sets and the average scores on the validation sets)
Learning curve allows us to verify when a model has learning as much as it can about the data, indicated by: a) the performances on the training and testing sets reach a plateau and b) here is a consistent gap between the two error rates, as is consistent with our graph below.
Our Learning Curves show a decent model because a) the testing and training learning curves converge at similar values and b) the smaller the gap between curves, the better our model generalizes. The results of our graph represent moderate bias and low variance, which is an indication that we should increase model complexity--leading us to the ensemble methods we'll be using moving forward.
In [22]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, X, y, ylim=None, cv=None,
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
plt.figure()
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Testing score")
plt.legend(loc="best")
return plt
cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = svc
plot_learning_curve(estimator, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
Out[22]:
A random forest is an ensemble method (combination of learning algorithms) that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging (majority of votes to make a prediction) to improve the predictive accuracy and control over-fitting.
At the root of random forest are decision trees, which are a type of flowchart which assist in the decision making process. Internal nodes represent tests on particular attributes, while branches exiting nodes represent a single test outcome, and leaf nodes represent class labels. The goal is to split on the attributes which create the purest child nodes possible.
Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
Pros:
- Accurate and does not tend to overfit
- Robust against outliers in the predictive variables
- It gives estimates of what variables are important in the classification--> feature selection
Cons:
- Not as easy to visually interpret
- Slower runtime
In [33]:
# RANDOM FOREST CLASSIFIER model -- ensemble method No.1
feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',
'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father',
'Relation_Mum',
'StudentAbsenceDays_Under-7']
X = data_dummies[feature_cols]
y = target_value
# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)
from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=3, n_estimators=200, oob_score = True, n_jobs = -1,
random_state=50)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))
One of the best use cases for random forest is that it's a great tool for feature selection. Since random forest is built on having multiple decision tress, one of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.
Random forests measures feature importance through something called the Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (accross all tress) that include the feature, proportionaly to the number of samples it splits.
In [24]:
# Taking a look at feature importance via Random Forest
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
feat_imp = pd.Series(rfc.feature_importances_, index=X.columns)
feat_imp.sort_values(inplace=True, ascending=False)
feat_imp.head(20).plot(kind='barh', title='Feature importance')
Out[24]:
Gradient Boosting is based on the the idea of boosting, which is a method of trying to modify a weak learner into a becoming a better one. It starts with filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations. For example, the model will build trees one at a time, where each new tree helps to correct errors made by previously trained tree.
With Gradient Boosting, the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure. This type of algorithm can be described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.
Pros:
- Can easily handle qualitative (categorical) features
- Very powerful and performs well in most cases
Cons:
- Training generally takes longer because of the fact that trees are built sequentially
- Harder to fit/tune parameters
In [27]:
# Gradient Boosting Classifer: Ensemble method No.2
feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',
'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father',
'Relation_Mum',
'StudentAbsenceDays_Under-7']
X = data_dummies[feature_cols]
y = target_value
# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)
from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(learning_rate = 0.15, n_estimators = 300, max_depth = 8, min_samples_leaf = 3,
max_features = 'log2')
gbc.fit(X_train,y_train)
y_pred = gbc.predict(X_test)
from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))
Although the majority of prediction accuracy and cross-validated scores were only in the [upper 70 - low 80] percentile range, with the best model: Gradient Boosting doing slightly better than the rest at 87% prediction accuracy. Overall, it can be said that the models performed well in a dataset where the signal to noise ratio was relatively low and little to none feature engineering was done.
However, given the nature of the project, which was to determine which students were in need of intervention, we can see that the results of the model were very favorable to our cause. In such a case, we often care more about the correct classification of 'L' students who are the ones we want to look out for, and much less about the labeling of 'H' and 'M' students.
For example: An 'H' student being labeled as an 'L' student is much less of a concern---as teachers would simply have spent a little extra time verifying that the student wasn't actually an at-risk student.
Whereas an 'L' student being labeled as an ['H' or 'M'] student would be a much bigger problem---as teachers would have skipped over them entirley and that student would have missed a chance for intervention.
Therefore, looking at the precision and recall scores for label-encoded-target-value 1 (representing 'L' students), we see the scores for correctly identifying 'L' or 1 students at a much higher percentage that the prediction accuracy/cross-validated scores of the dataset as a whole.
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
Precision & Recall scores for row 1 ('L' Students) only:
Model Used | Accuracy | Cross-Validated | Precision | Recall |
---|---|---|---|---|
Logistic Regression | 0.76 | 0.77 (+/- 0.12 | 0.88 | 1.00 |
Random Forest | 0.83 | 0.76 (+/- 0.14) | 0.87 | 0.93 |
Gradient Boosting | 0.87 | 0.76 (+/- 0.15) | 0.94 | 0.94 |
Having implented several models to observe which was the best fit for our dataset, results show that the top performing model was:
Gradient Boosting Classifier with an Accuracy Score of: 87%
In evaluating the overall performance of this project, and taking into consideration the original aim in which the goal was to have the classifier find all the positive samples of 'L' students for intervention identification, it's evident that through examining the high 'Recall' scores (ranging from .92 to 1.00), our model did exceptionally well in this area.
In [ ]: