Intro: A Student Grade Classification Project

There is a rising trend in using data to determine student performance and provide timely intervention for low-performing students/at-risk students. This project hopes to help highlight and contribute to these efforts in education innovation.

Having previously worked at a classroom learning/management platform edtech startup, I was inspired to do this project as a way to learn more about the process of building a data-driven student intervention system as well as to understand how data science can be used to transform education.

The dataset below was provided by Kaggle, and gives a snapshot of student engagement and student background as well as their corresponding final grades (classified by 3 categories: high-level grades, middle-levelgrades, & low-level grades.

This project aims to determine which factors are the greatest indicators in identifying a student as being at‐risk either behaviorally or academically and to make predictions, based on these factors, of which students belong to which grade class (high, middle or low).

Data Collection & Features

Source of Dataset: Kaggle

The following dataset includes many factors which may influence a student's final grades. Below is a description of each:

1 Gender - student's gender (nominal: 'Male' or 'Female’)

2 Nationality- student's nationality (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

3 Place of birth- student's Place of birth (nominal:’ Kuwait’,’ Lebanon’,’ Egypt’,’ SaudiArabia’,’ USA’,’ Jordan’,’ Venezuela’,’ Iran’,’ Tunis’,’ Morocco’,’ Syria’,’ Palestine’,’ Iraq’,’ Lybia’)

4 Educational Stages- educational level student belongs (nominal: ‘lowerlevel’,’MiddleSchool’,’HighSchool’)

5 Grade Levels- grade student belongs (nominal: ‘G-01’, ‘G-02’, ‘G-03’, ‘G-04’, ‘G-05’, ‘G-06’, ‘G-07’, ‘G-08’, ‘G-09’, ‘G-10’, ‘G-11’, ‘G-12 ‘)

6 Section ID- classroom student belongs (nominal:’A’,’B’,’C’)

7 Topic- course topic (nominal:’ English’,’ Spanish’, ‘French’,’ Arabic’,’ IT’,’ Math’,’ Chemistry’, ‘Biology’, ‘Science’,’ History’,’ Quran’,’ Geology’)

8 Semester- school year semester (nominal:’ First’,’ Second’)

9 Parent responsible for student (nominal:’mom’,’father’)

10 Raised hand- how many times the student raises his/her hand on classroom (numeric:0-100)

11- Visited resources- how many times the student visits a course content(numeric:0-100)

12 Viewing announcements-how many times the student checks the new announcements(numeric:0-100)

13 Discussion groups- how many times the student participate on discussion groups (numeric:0-100)

14 Parent Answering Survey- parent answered the surveys which are provided from school or not (nominal:’Yes’,’No’)

15 Parent School Satisfaction- the Degree of parent satisfaction from school(nominal:’Yes’,’No’)

16 Student Absence Days-the number of absence days for each student (nominal: above-7, under-7)

The students are classified into three numerical intervals based on their total grade/mark:

Low-Level: interval includes values from 0 to 69,

Middle-Level: interval includes values from 70 to 89,

High-Level: interval includes values from 90-100.

Data Handling

Importing Data with Pandas: Read in a csv of our data

Identify the shape of the dataset: Tells us we have 480 observations/students to analyze

data --> Summary of our data contained in a Pandas DataFrame: Preview of features and values



In [1]:

    
import numpy as np
import pandas as pd


data = pd.read_csv('xAPI-Edu-Data.csv') #columns = ['Gender','Nationality', 'PlaceofBirth','StageID','GradeID','SectionID'
                                              #,'Topic','Semester','Relation','RaisedHands','VisitedResources'
                                              #,'AnnoucementsView','Discussion','ParentAnsweringSurvey',
                                              #'ParentSchoolSatisfaction','StudentAbsenceDays','Class/FinalGrade'])

print (data.shape)

    
data.head(15)









    



(480, 17)






    Out[1]:







  
    
      
      gender
      NationalITy
      PlaceofBirth
      StageID
      GradeID
      SectionID
      Topic
      Semester
      Relation
      raisedhands
      VisITedResources
      AnnouncementsView
      Discussion
      ParentAnsweringSurvey
      ParentschoolSatisfaction
      StudentAbsenceDays
      Class
    
  
  
    
      0
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      15
      16
      2
      20
      Yes
      Good
      Under-7
      M
    
    
      1
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      20
      20
      3
      25
      Yes
      Good
      Under-7
      M
    
    
      2
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      10
      7
      0
      30
      No
      Bad
      Above-7
      L
    
    
      3
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      30
      25
      5
      35
      No
      Bad
      Above-7
      L
    
    
      4
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      40
      50
      12
      50
      No
      Bad
      Above-7
      M
    
    
      5
      F
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      42
      30
      13
      70
      Yes
      Bad
      Above-7
      M
    
    
      6
      M
      KW
      KuwaIT
      MiddleSchool
      G-07
      A
      Math
      F
      Father
      35
      12
      0
      17
      No
      Bad
      Above-7
      L
    
    
      7
      M
      KW
      KuwaIT
      MiddleSchool
      G-07
      A
      Math
      F
      Father
      50
      10
      15
      22
      Yes
      Good
      Under-7
      M
    
    
      8
      F
      KW
      KuwaIT
      MiddleSchool
      G-07
      A
      Math
      F
      Father
      12
      21
      16
      50
      Yes
      Good
      Under-7
      M
    
    
      9
      F
      KW
      KuwaIT
      MiddleSchool
      G-07
      B
      IT
      F
      Father
      70
      80
      25
      70
      Yes
      Good
      Under-7
      M
    
    
      10
      M
      KW
      KuwaIT
      MiddleSchool
      G-07
      A
      Math
      F
      Father
      50
      88
      30
      80
      Yes
      Good
      Under-7
      H
    
    
      11
      M
      KW
      KuwaIT
      MiddleSchool
      G-07
      B
      Math
      F
      Father
      19
      6
      19
      12
      Yes
      Good
      Under-7
      M
    
    
      12
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      Father
      5
      1
      0
      11
      No
      Bad
      Above-7
      L
    
    
      13
      M
      lebanon
      lebanon
      MiddleSchool
      G-08
      A
      Math
      F
      Father
      20
      14
      12
      19
      No
      Bad
      Above-7
      L
    
    
      14
      F
      KW
      KuwaIT
      MiddleSchool
      G-08
      A
      Math
      F
      Mum
      62
      70
      44
      60
      No
      Bad
      Above-7
      H

Step #1 - Exploring/Cleaning the data

Summary of Statistics:

Use Pandas' describe function to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution (for all numeric columns).

By looking at info: We have the total number of observations = 480, and we see there are no missing (non-null) values (i.e., we have a clean data set).



In [2]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
gender                      480 non-null object
NationalITy                 480 non-null object
PlaceofBirth                480 non-null object
StageID                     480 non-null object
GradeID                     480 non-null object
SectionID                   480 non-null object
Topic                       480 non-null object
Semester                    480 non-null object
Relation                    480 non-null object
raisedhands                 480 non-null int64
VisITedResources            480 non-null int64
AnnouncementsView           480 non-null int64
Discussion                  480 non-null int64
ParentAnsweringSurvey       480 non-null object
ParentschoolSatisfaction    480 non-null object
StudentAbsenceDays          480 non-null object
Class                       480 non-null object
dtypes: int64(4), object(13)
memory usage: 63.8+ KB



In [3]:

    
data.describe()









    Out[3]:







  
    
      
      raisedhands
      VisITedResources
      AnnouncementsView
      Discussion
    
  
  
    
      count
      480.000000
      480.000000
      480.000000
      480.000000
    
    
      mean
      46.775000
      54.797917
      37.918750
      43.283333
    
    
      std
      30.779223
      33.080007
      26.611244
      27.637735
    
    
      min
      0.000000
      0.000000
      0.000000
      1.000000
    
    
      25%
      15.750000
      20.000000
      14.000000
      20.000000
    
    
      50%
      50.000000
      65.000000
      33.000000
      39.000000
    
    
      75%
      75.000000
      84.000000
      58.000000
      70.000000
    
    
      max
      100.000000
      99.000000
      98.000000
      99.000000

Summary of Statistics [separated by class]: 'High, Middle, Low'

Now that we have an overview for the entire dataset, let's dig deeper and focus on the features we're primarily interested in: the grades the student receive.

Using pandas grouby we can split the data into groups based on some criteria (in this case --- the class column 'H, M, L').

Using the aggregate function with the groupby, we are able to compute the summary statistics about each group (in this case --- the min, median, mean and max for each separate class 'H, M, L').

We then compare the summary statistics for total students (table above) vs. students for each separate classes 'H, M, L' (table below). The information presented below is more granular & outlines differences in class.

For example: We see that the average number of times students with 'H' raised their hands is around 70.23, whereas for students with 'L' that number drops to around 17. This separation of mean values will become important later when we preprocess the data.



In [4]:

    
data.groupby('Class').aggregate(['min', np.median, np.mean, max])









    Out[4]:







  
    
      
      raisedhands
      VisITedResources
      AnnouncementsView
      Discussion
    
    
      
      min
      median
      mean
      max
      min
      median
      mean
      max
      min
      median
      mean
      max
      min
      median
      mean
      max
    
    
      Class
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      H
      10
      75
      70.288732
      100
      4
      84
      78.746479
      99
      2
      52
      53.380282
      98
      2
      54
      53.661972
      99
    
    
      L
      0
      10
      16.889764
      80
      0
      11
      18.322835
      90
      0
      11
      15.574803
      66
      1
      21
      30.834646
      98
    
    
      M
      0
      50
      48.938389
      100
      2
      72
      60.635071
      99
      0
      38
      40.962085
      93
      3
      40
      43.791469
      98

Data Visualization & Exploratory Analysis

Libraries used: Matplotlib & Seaborn

First step in exploring the data to classify the grades of students is to look at a simple value count.

Next we visualize the number of students in each of the separate classes to get an idea of how evenly spread the 'H, L, and M's are among the students in the dataset.

As the bar graph shows, distribution is fairly even, no major skews toward any individual class.



In [42]:

    
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set()

data.Class.value_counts().plot(kind='bar')

data.Class.value_counts()









    Out[42]:





M    211
H    142
L    127
Name: Class, dtype: int64

Categorical Feature Analysis - Parent Satisfation & Relation [Separated by Class: H, M, L]

Now we want to pair those 'High, Low and Middle'grade value counts with other features that may indicate patterns in the distribution of grades and add value to our analysis.

Below are 4 bar graphs for which grades students received based on Parent satisfaction (good or bad) and Relation responsible for the student (Father or Mum)

There is a clear pattern where:

High grade = Parent schoool satisfaction (good) & Relation responsible (Mum)

Low grade = Parent schoool satisfaction (bad) & Relation responsible (Father)



In [41]:

    
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(18,4))

plt.subplot(141)
good_sat= data.Class[data.ParentschoolSatisfaction== 'Good'].value_counts()
good_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Good')

plt.subplot(142)
bad_sat = data.Class[data.ParentschoolSatisfaction== 'Bad'].value_counts()
bad_sat.plot(kind='bar')
plt.title('Parent school satisfaction = Bad')

plt.subplot(143)
survey_no = data.Class[data.Relation == 'Father'].value_counts()
survey_no.plot(kind='bar')
plt.title('Father Responsible for Student')

plt.subplot(144)
survey_yes = data.Class[data.Relation == 'Mum'].value_counts()
survey_yes.plot(kind='bar')
plt.title('Mother Responsible for Student')









    Out[41]:





<matplotlib.text.Text at 0x11b202588>

Once again, by using the pandas groupby function, we can get a table showing the exact count of the distribution of students in H, L, M --- looking at the first feature "Parent School Satisfaction" only.



In [7]:

    
grades_count = data.groupby(['ParentschoolSatisfaction','Class'])['Class'].aggregate('count').unstack()

grades_count









    Out[7]:







  
    
      Class
      H
      L
      M
    
    
      ParentschoolSatisfaction
      
      
      
    
  
  
    
      Bad
      24
      84
      80
    
    
      Good
      118
      43
      131

Digging a little deeper:

Using pandas pivot_table we can capture more complex insights from the data & further break down 'H, L, M' class structure by including two levels of analysis (both Parent school satisfaction & Relation).

It is common to start with simple analysis with one feature and add complexity with multiple features as we understand how they interact with our target values of interest.



In [8]:

    
#shows that adding relation only adds noise, not a significant difference between father & mum 

data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'], 
                 columns = 'Class', aggfunc = 'mean')









    Out[8]:







  
    
      
      Class
      H
      L
      M
    
    
      ParentschoolSatisfaction
      Relation
      
      
      
    
  
  
    
      Bad
      Father
      73.375000
      16.282051
      42.793103
    
    
      Mum
      75.687500
      16.000000
      45.818182
    
    
      Good
      Father
      65.764706
      17.076923
      45.240506
    
    
      Mum
      70.797619
      19.705882
      62.730769

Here is a visualization of the above pivot table. The combination of two features do not vary too much within the 'H, M, L' classes, but there is a distict gap showing that students in 'H' class raised their hands at an average of around 3 times as often as 'L' students.



In [40]:

    
parents_hands = data.pivot_table('raisedhands', index = ['ParentschoolSatisfaction','Relation'], 
                 columns = 'Class', aggfunc = 'mean')

parents_hands.plot()
plt.ylabel('Average number of times student raised hand')









    Out[40]:





<matplotlib.text.Text at 0x11b56eef0>

Categorical Feature Analysis Cont.

Another good feature that may add value to our analysis of factors that contributes to a student's class is their attendance record.

Again, looking at value counts for the separate classes, we see that the number of students who had under-7 absences and received a 'Low' class/grade is very low. (and vice versa for above-7 absences & 'H')



In [39]:

    
import matplotlib.pyplot as plt

attendance = pd.crosstab(index=data['StudentAbsenceDays'], columns=[data['Class']], normalize='columns')


attendance.plot(kind='bar', figsize=(6,6), stacked=True)









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b645278>

Moving on to numerical feature analysis:

Look at the distribution and probability density function of the 4 numerical columns:

1) raised hands

2) Visited Resources

3) Announcements Views

4) Discussion

The graphs for 'raised hands' and 'visited resources' show the most promise in differentiating between 'H' and 'L' classes. The bimodal shape of their density curves show distinct peaks at the opposite ends of the graph.



In [35]:

    
fig = plt.figure(figsize=(18,8))

plt.subplot(221)
data.raisedhands[data.Class == 'H'].plot(kind='kde') 
data.raisedhands[data.Class == 'M'].plot(kind='kde') 
data.raisedhands[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best') 
plt.title('Raised Hands')


plt.subplot(222)
data.VisITedResources[data.Class == 'H'].plot(kind='kde') 
data.VisITedResources[data.Class == 'M'].plot(kind='kde') 
data.VisITedResources[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best') 
plt.title('Visited Resources')


plt.subplot(223)
data.Discussion[data.Class == 'H'].plot(kind='kde') 
data.Discussion[data.Class == 'M'].plot(kind='kde') 
data.Discussion[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Discussion')

plt.subplot(224)
data.AnnouncementsView[data.Class == 'H'].plot(kind='kde') 
data.AnnouncementsView[data.Class == 'M'].plot(kind='kde') 
data.AnnouncementsView[data.Class == 'L'].plot(kind='kde') 
plt.legend(('High', 'Middle','Low'),loc='best')
plt.title('Viewed Announcements')









    Out[35]:





<matplotlib.text.Text at 0x11ba99f60>

Numerical Feature Analysis: Relationship & Correlation

Correlations can tell us about the direction, and the degree (strength) of the relationship between two variables (or features)

Using Pearson's correlation coefficient (Pearson's r) where r ranges from -1 (negative relationship) to 0 (no relationship to 1 (positive relationship).

We see that the only relatively strong relationship r = 0.69 is between the two strongest indications of class differences as noted above: 'raised hands' and 'visited resources'.



In [12]:

    
raised_hands = data['raisedhands']
discussion = data['Discussion']
v_resources = data['VisITedResources']
v_announcements = data['AnnouncementsView']

def correlation(x,y):
    std_x = (x-x.mean())/x.std(ddof=0)
    std_y = (y-y.mean())/y.std(ddof=0)
    
    return (std_x * std_y).mean()

print ('Raised hands & Discussion: ', correlation(raised_hands, discussion))

print ('Visted Resources & Discussion: ', correlation(v_resources, discussion))

print ('Raised hands & Visited Resources: ', correlation(raised_hands, v_resources))









    



Raised hands & Discussion:  0.3393859910133952
Visted Resources & Discussion:  0.24329176916115017
Raised hands & Visited Resources:  0.6915717054692965

A more detailed view of correlation: Heatmap visualization

The stronger the correlation, the deeper the shade of purple, really just confirming our calcuations previously



In [34]:

    
fig,ax= plt.subplots(figsize=(9,7))
sns.heatmap(data.corr(),annot=True)









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b9aa908>

Scatterplot Visualization: Class ['H, M, L'] according to 'raisedhands' & 'visitedresources'

Zoom in on main features of importance:

To get a basic understanding of what we are trying to predict, we narrow down the analysis to the two main features that have thus far provided the most information in determining 'H, M, L' classes.

First we need a procedure to turn the class values (categorical) into numerical values in order to plot them and show the different points which are 'H'(large green points) or 'M' (medium yellow points) or 'L' (small red points).



In [14]:

    
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

def new_grade(grade):
    if grade == 'H':
        return 100
    elif grade == 'M':
        return 50
    elif grade == 'L':
        return 10
    
def new_grades(grades): 
    return grades.apply(new_grade)

print (new_grades(data['Class']).head())

converted_grades = new_grades(data['Class'])



fig = plt.figure(figsize=(12, 9))
plt.scatter(data['VisITedResources'], data['raisedhands'],c= converted_grades, s = converted_grades, cmap = 'RdYlGn')
plt.ylabel("Number of times Student Raised their hand in class")
plt.xlabel("Number of times Student Visited Resources")
plt.title ('Interaction Correlation by Class Marks')
plt.colorbar()









    



0    50
1    50
2    10
3    10
4    50
Name: Class, dtype: int64






    Out[14]:





<matplotlib.colorbar.Colorbar at 0x11aab6a20>

While there is a decent amount of noise, we can still see a linear relationship where students who raise their hands and visit class resources are in the group with the higher class.

There is an evident pattern where there's a cluster of 'H's at the top right hand corner of the graph and a cluster of 'L's at the bottom left hand corner of the graph.



In [15]:

    
sns.lmplot(x="VisITedResources", y="raisedhands", data=data)
sns.plt.show()

Exploring correlations between multidimensional data & plotting all pairs of values against each other

We see the relationship between class is strongest between 'raisedhands' and other features as well as 'visitedresources' and other features

The 2 weaker features seem to be 'announcements view' and 'discussion' where the points for the different 'H, M, L' classes seem to be scattered all over with no particular pattern.

Now that we have this analysis, we have a better understanding of which features will help us predict the target classes we want and which will be added noise. This is an importance part of the process of feature analysis.



In [16]:

    
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



sns.pairplot(data, hue='Class', size=2.5);

Step #2. - Data Analysis & Application of Machine Learning Classification Models with Performance Evaluation

1. Preprocessing Data

Removal of outliers

Encoding categorical data

2. Machine Learning Models Used

Logistic Regression Classifier

Support Vector Machines Classifer

Random Forest Classifier (ensemble)

Gradient Boosting Classifier (ensemble)

Bagging Classifier (ensemble)

3. Evaluation Methods Used

Prediction Accuracy Score (on test data)

K-fold cross validation (on training data)

Plotting learning curves to assess Bias vs. Variance

Precision, Recall & F1 Scores

4. Implementation Process

Once the independent variables (features) have been determined, we split the data set into testing and training sets using sklearn's cross validation: train_test_split function

Next we import the models we need from sklearn

Scaling features (using standard scaler) & PCA (if applicable)

Tuning hyperparameters to optimize model performance (each model could require different constraints, weights or learning rates to generalize different data patterns)

Fit/train the data

Predict y-values using test data

Use sklearn's metrics to determine accuracy score

Use k-fold cross validation to avoid overfitting

Get additional performance metrics such as precision, recall and F1 score for comparison

Preprocessing data: remove outliers to reduce noise and improve accuracy

Here, the criteria used for determining outliers is based upon the summary of statistics table above (specifically the one separated by class). Using the mean for 'H' and 'L' students in the first two features, we identify a cutoff point for each. (For example, ff a student with an 'H' raised their hands less than the mean for which a student with 'L' raised their hands--they are removed from the dataset.)

Similarly, when looking at the bar graphs for attendance, we saw that very few 'H' students were absent more than 7 days, and very few 'L' students were absent less than 7 days-- also labeling them as outliers/candidates for removal.



In [17]:

    
#Creating the criteria for an outlier based on multiple conditions. Only dealing with 'H' and 'L' class to be safe. 

outliers_1 = data[(data['Class'] == 'H') & (data['raisedhands'] <= 17)]
outliers_2 = data[(data['Class'] == 'H') & (data['VisITedResources'] <= 18)]
outliers_3 = data[(data['Class'] == 'L') & (data['raisedhands'] >= 70)]
outliers_4 = data[(data['Class'] == 'L') & (data['VisITedResources'] >= 78)]
outliers_5 = data[(data['Class'] == 'H') & (data['StudentAbsenceDays'] == 'Above-7')]
outliers_6 = data[(data['Class'] == 'L') & (data['StudentAbsenceDays'] == 'Under-7')]



#dropping the rows which contained the outliers as indicated by the above criteria

new_data = data.drop([14,47,48,72,74,80,84,86,87,88,94,96,124,128,129,190,200,205,226,227,
                      228,248,250,255,344,345,444,445,450])

#Using shape to check the number of outliers dropped, usually no more than 10%.
#However, since the dataset is small, better to leave more data to work with. 
#451 left out of the orginal 480 observations: We only dropped ~ 6% of data. 

print(new_data.shape)

Preprocessing Data: Label Encoding Categorical Values to transform the prediction target (y)

Libraries used: Scikit-Learn

In Sklearn, machine learning algorithms require the input variables from the data to be real values. Therefore we must format (or transform) the data into the structure that allows us to feed it into the model.

Here, we use a label encoder to prepare the data where numerical dimensions are used to represent membership in the categories ('H, L, M')

As we can see, once encoded:

Categorical Class	Numerical value
H (High grade)	0
L (Low grade)	1
M (Middle grade)	2



In [18]:

    
#Preprocessing data to encode categorical values for the y-target column


from sklearn import preprocessing
le = preprocessing.LabelEncoder()

target_value = le.fit_transform(new_data.Class)

print (target_value)









    



[2 2 1 1 2 2 1 2 2 2 0 2 1 1 2 2 2 2 0 2 2 2 1 1 1 2 1 2 2 0 1 1 1 1 1 1 2
 1 2 1 2 1 2 2 1 1 2 1 1 2 0 1 1 1 1 2 2 1 2 0 2 1 1 2 0 0 2 1 2 2 2 2 2 1
 0 1 1 2 1 1 1 0 0 0 0 2 2 2 2 0 1 1 2 1 2 0 2 2 0 2 1 1 1 1 2 0 2 2 2 1 2
 2 1 2 1 1 2 1 1 0 0 0 2 0 2 1 1 2 0 1 2 0 2 2 0 0 2 0 1 2 0 2 2 1 2 0 2 0
 2 2 0 2 0 0 2 0 2 1 1 2 1 0 2 0 2 0 1 0 2 1 0 2 2 0 2 1 2 2 2 2 0 0 1 2 0
 2 2 1 2 2 2 2 0 2 0 1 1 1 2 2 0 2 2 2 2 0 0 2 1 2 1 2 2 2 1 1 2 2 0 0 2 1
 2 0 2 0 2 2 1 2 1 0 0 2 2 1 1 2 2 2 2 0 2 2 2 2 0 2 2 0 0 0 0 0 2 2 0 0 0
 0 2 2 0 0 2 2 1 1 0 0 2 2 0 0 2 2 1 1 2 2 2 2 0 0 2 2 2 2 0 0 0 0 0 0 0 0
 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 0 0 1 1 1 1 2 2 0 0 2 2 0
 0 2 2 0 0 0 0 2 2 0 0 2 2 1 1 1 1 2 2 1 1 1 1 0 0 0 0 2 2 1 1 2 2 0 0 0 0
 2 2 0 0 2 2 0 0 0 0 1 1 2 2 0 0 2 2 1 1 0 0 0 0 0 0 2 2 0 0 2 2 1 1 0 0 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 2 2 2 2 0 0 0 0 2 2 2 2 0 0 0 0 1 1 2 2 2
 2 1 1 2 2 1 1]

Preprocessing Data: Dummy/One-hot Encoding to transform the features values (x)

A common alternative approach to encoding categorical values is called dummy (or one-hot) encoding where the the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column

Pandas supports this feature using get_dummies to create dummy/indicator variables (aka 1 or 0).



In [19]:

    
data_dummies = pd.get_dummies(new_data, columns = ['ParentschoolSatisfaction', 'StudentAbsenceDays','Relation'])

print(data_dummies.shape)

data_dummies.head()









    



(451, 20)






    Out[19]:







  
    
      
      gender
      NationalITy
      PlaceofBirth
      StageID
      GradeID
      SectionID
      Topic
      Semester
      raisedhands
      VisITedResources
      AnnouncementsView
      Discussion
      ParentAnsweringSurvey
      Class
      ParentschoolSatisfaction_Bad
      ParentschoolSatisfaction_Good
      StudentAbsenceDays_Above-7
      StudentAbsenceDays_Under-7
      Relation_Father
      Relation_Mum
    
  
  
    
      0
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      15
      16
      2
      20
      Yes
      M
      0
      1
      0
      1
      1
      0
    
    
      1
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      20
      20
      3
      25
      Yes
      M
      0
      1
      0
      1
      1
      0
    
    
      2
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      10
      7
      0
      30
      No
      L
      1
      0
      1
      0
      1
      0
    
    
      3
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      30
      25
      5
      35
      No
      L
      1
      0
      1
      0
      1
      0
    
    
      4
      M
      KW
      KuwaIT
      lowerlevel
      G-04
      A
      IT
      F
      40
      50
      12
      50
      No
      M
      1
      0
      1
      0
      1
      0

Logistic Regression Classifer

The logistic regression classifier is a method used to generalizes logistic regression to multiclass problems (applicable here since we are trying to predict more than two possible discrete outcomes). It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable

Dependent variable (y values) = Class ('H, L, M')

Independent variables (or features / x values) = 'raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad', 'Relation_Father', Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7','StudentAbsenceDays_Under-7'

In Logistic Regression modeling, input values (X) are combined linearly using weights or coefficient values to predict an output value (y)--- or more specifically, the probability that an input (X) belongs to the default class.

Pros:

Low variance

Provides probabilities for outcomes

works well with diagonal (feature) decision boundaries

Cons:

Doesn’t perform well when feature space is too large

High bias

Relies on entire data



In [30]:

    
feature_cols = feature_cols = ['raisedhands', 'VisITedResources',  'ParentschoolSatisfaction_Bad',  'Relation_Father', 
                               'Relation_Mum','ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7',
                                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)


import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe_lr = Pipeline([('rs', StandardScaler()), ('pca', PCA(n_components = 4)), 
                    ('logreg', LogisticRegression(C=1e9))])

pipe_lr.fit(X_train, y_train)

y_pred = pipe_lr.predict(X_test)

from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print(' ', classification_report(y_test, y_pred))









    



Prediction Accuracy: 0.764705882353
Cross-validated Scores: [ 0.675       0.74358974  0.76923077  0.73684211  0.81578947  0.73684211
  0.84210526  0.89473684  0.76315789  0.72972973]
CV Accuracy: 0.77 (+/- 0.12)
               precision    recall  f1-score   support

          0       0.61      0.65      0.63        17
          1       0.88      1.00      0.94        23
          2       0.75      0.64      0.69        28

avg / total       0.76      0.76      0.76        68

Visualization of Learning Curves: Test & Training Data

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.

If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.

We can use the sklearn learning_curve function to generate the values that are required to plot such a learning curve (number of samples that have been used, the average scores on the training sets and the average scores on the validation sets)

Learning curve allows us to verify when a model has learning as much as it can about the data, indicated by: a) the performances on the training and testing sets reach a plateau and b) here is a consistent gap between the two error rates, as is consistent with our graph below.

Our Learning Curves show a decent model because a) the testing and training learning curves converge at similar values and b) the smaller the gap between curves, the better our model generalizes. The results of our graph represent moderate bias and low variance, which is an indication that we should increase model complexity--leading us to the ensemble methods we'll be using moving forward.



In [22]:

    
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

def plot_learning_curve(estimator, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Testing score")

    plt.legend(loc="best")
    return plt

cv = ShuffleSplit(n_splits=10, test_size=0.2, random_state=0)
estimator = svc
plot_learning_curve(estimator, X, y, (0.7, 1.01), cv=cv, n_jobs=4)









    Out[22]:





<module 'matplotlib.pyplot' from '/Users/jadeshao/anaconda/lib/python3.6/site-packages/matplotlib/pyplot.py'>

Random Forest Classifier

A random forest is an ensemble method (combination of learning algorithms) that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging (majority of votes to make a prediction) to improve the predictive accuracy and control over-fitting.

At the root of random forest are decision trees, which are a type of flowchart which assist in the decision making process. Internal nodes represent tests on particular attributes, while branches exiting nodes represent a single test outcome, and leaf nodes represent class labels. The goal is to split on the attributes which create the purest child nodes possible.

Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.

Pros:

Accurate and does not tend to overfit

Robust against outliers in the predictive variables

It gives estimates of what variables are important in the classification--> feature selection

Cons:

Not as easy to visually interpret

Slower runtime



In [33]:

    
# RANDOM FOREST CLASSIFIER model -- ensemble method No.1

feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',  
                'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father', 
                               'Relation_Mum',
                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)


from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=3, n_estimators=200, oob_score = True, n_jobs = -1, 
                                                   random_state=50)
rfc.fit(X_train, y_train)


y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))









    



Prediction Accuracy: 0.838235294118
Cross-validated Scores: [ 0.725       0.84615385  0.66666667  0.79487179  0.71052632  0.84210526
  0.68421053  0.68421053  0.78378378  0.83783784]
CV Accuracy: 0.76 (+/- 0.14)
Scores              precision    recall  f1-score   support

          0       0.86      0.79      0.83        24
          1       0.87      0.93      0.90        14
          2       0.81      0.83      0.82        30

avg / total       0.84      0.84      0.84        68

Random Forest: Feature Importance

One of the best use cases for random forest is that it's a great tool for feature selection. Since random forest is built on having multiple decision tress, one of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

Random forests measures feature importance through something called the Gini Importance or Mean Decrease in Impurity (MDI) calculates each feature importance as the sum over the number of splits (accross all tress) that include the feature, proportionaly to the number of samples it splits.



In [24]:

    
# Taking a look at feature importance via Random Forest

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

feat_imp = pd.Series(rfc.feature_importances_, index=X.columns)
feat_imp.sort_values(inplace=True, ascending=False)
feat_imp.head(20).plot(kind='barh', title='Feature importance')









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c167eb8>

Gradient Boosting Classifier

Gradient Boosting is based on the the idea of boosting, which is a method of trying to modify a weak learner into a becoming a better one. It starts with filtering observations, leaving those observations that the weak learner can handle and focusing on developing new weak learns to handle the remaining difficult observations. For example, the model will build trees one at a time, where each new tree helps to correct errors made by previously trained tree.

With Gradient Boosting, the objective is to minimize the loss of the model by adding weak learners using a gradient descent like procedure. This type of algorithm can be described as a stage-wise additive model. This is because one new weak learner is added at a time and existing weak learners in the model are frozen and left unchanged.

Pros:

Can easily handle qualitative (categorical) features

Very powerful and performs well in most cases

Cons:

Training generally takes longer because of the fact that trees are built sequentially

Harder to fit/tune parameters



In [27]:

    
# Gradient Boosting Classifer: Ensemble method No.2

feature_cols = ['raisedhands', 'VisITedResources', 'ParentschoolSatisfaction_Bad',  
                'ParentschoolSatisfaction_Good', 'StudentAbsenceDays_Above-7', 'Relation_Father', 
                               'Relation_Mum',
                'StudentAbsenceDays_Under-7']

X = data_dummies[feature_cols]
y = target_value


# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit_transform(X_train, y_train)
scl.transform(X_test)

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate = 0.15, n_estimators = 300, max_depth = 8, min_samples_leaf = 3, 
                                 max_features = 'log2')
gbc.fit(X_train,y_train)

y_pred = gbc.predict(X_test)

from sklearn.metrics import accuracy_score
print ('Prediction Accuracy:', accuracy_score(y_test, y_pred))


from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator = pipe_lr, X= X_train, y = y_train, cv = 10, n_jobs =1)
print ('Cross-validated Scores: %s' %scores)
print("CV Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.metrics import classification_report
print('Scores', classification_report(y_test, y_pred))









    



Prediction Accuracy: 0.867647058824
Cross-validated Scores: [ 0.8974359   0.71794872  0.76923077  0.84615385  0.65789474  0.78947368
  0.73684211  0.65789474  0.73684211  0.81081081]
CV Accuracy: 0.76 (+/- 0.15)
Scores              precision    recall  f1-score   support

          0       0.77      0.89      0.83        19
          1       0.94      0.94      0.94        18
          2       0.89      0.81      0.85        31

avg / total       0.87      0.87      0.87        68

Summary of Scores & Conclusion:

Although the majority of prediction accuracy and cross-validated scores were only in the [upper 70 - low 80] percentile range, with the best model: Gradient Boosting doing slightly better than the rest at 87% prediction accuracy. Overall, it can be said that the models performed well in a dataset where the signal to noise ratio was relatively low and little to none feature engineering was done.

However, given the nature of the project, which was to determine which students were in need of intervention, we can see that the results of the model were very favorable to our cause. In such a case, we often care more about the correct classification of 'L' students who are the ones we want to look out for, and much less about the labeling of 'H' and 'M' students.

For example: An 'H' student being labeled as an 'L' student is much less of a concern---as teachers would simply have spent a little extra time verifying that the student wasn't actually an at-risk student.

Whereas an 'L' student being labeled as an ['H' or 'M'] student would be a much bigger problem---as teachers would have skipped over them entirley and that student would have missed a chance for intervention.

Therefore, looking at the precision and recall scores for label-encoded-target-value 1 (representing 'L' students), we see the scores for correctly identifying 'L' or 1 students at a much higher percentage that the prediction accuracy/cross-validated scores of the dataset as a whole.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

Precision & Recall scores for row 1 ('L' Students) only:

Model Used	Accuracy	Cross-Validated	Precision	Recall
Logistic Regression	0.76	0.77 (+/- 0.12	0.88	1.00
Random Forest	0.83	0.76 (+/- 0.14)	0.87	0.93
Gradient Boosting	0.87	0.76 (+/- 0.15)	0.94	0.94

Conclusion

Having implented several models to observe which was the best fit for our dataset, results show that the top performing model was:

Gradient Boosting Classifier with an Accuracy Score of: 87%

In evaluating the overall performance of this project, and taking into consideration the original aim in which the goal was to have the classifier find all the positive samples of 'L' students for intervention identification, it's evident that through examining the high 'Recall' scores (ranging from .92 to 1.00), our model did exceptionally well in this area.



In [ ]:

	gender	NationalITy	PlaceofBirth	StageID	GradeID	SectionID	Topic	Semester	Relation	raisedhands	VisITedResources	AnnouncementsView	Discussion	ParentAnsweringSurvey	ParentschoolSatisfaction	StudentAbsenceDays	Class
0	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	15	16	2	20	Yes	Good	Under-7	M
1	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	20	20	3	25	Yes	Good	Under-7	M
2	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	10	7	0	30	No	Bad	Above-7	L
3	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	30	25	5	35	No	Bad	Above-7	L
4	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	40	50	12	50	No	Bad	Above-7	M
5	F	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	42	30	13	70	Yes	Bad	Above-7	M
6	M	KW	KuwaIT	MiddleSchool	G-07	A	Math	F	Father	35	12	0	17	No	Bad	Above-7	L
7	M	KW	KuwaIT	MiddleSchool	G-07	A	Math	F	Father	50	10	15	22	Yes	Good	Under-7	M
8	F	KW	KuwaIT	MiddleSchool	G-07	A	Math	F	Father	12	21	16	50	Yes	Good	Under-7	M
9	F	KW	KuwaIT	MiddleSchool	G-07	B	IT	F	Father	70	80	25	70	Yes	Good	Under-7	M
10	M	KW	KuwaIT	MiddleSchool	G-07	A	Math	F	Father	50	88	30	80	Yes	Good	Under-7	H
11	M	KW	KuwaIT	MiddleSchool	G-07	B	Math	F	Father	19	6	19	12	Yes	Good	Under-7	M
12	M	KW	KuwaIT	lowerlevel	G-04	A	IT	F	Father	5	1	0	11	No	Bad	Above-7	L
13	M	lebanon	lebanon	MiddleSchool	G-08	A	Math	F	Father	20	14	12	19	No	Bad	Above-7	L
14	F	KW	KuwaIT	MiddleSchool	G-08	A	Math	F	Mum	62	70	44	60	No	Bad	Above-7	H

	raisedhands	VisITedResources	AnnouncementsView	Discussion
count	480.000000	480.000000	480.000000	480.000000
mean	46.775000	54.797917	37.918750	43.283333
std	30.779223	33.080007	26.611244	27.637735
min	0.000000	0.000000	0.000000	1.000000
25%	15.750000	20.000000	14.000000	20.000000
50%	50.000000	65.000000	33.000000	39.000000
75%	75.000000	84.000000	58.000000	70.000000
max	100.000000	99.000000	98.000000	99.000000

	raisedhands				VisITedResources				AnnouncementsView				Discussion
	min	median	mean	max	min	median	mean	max	min	median	mean	max	min	median	mean	max
Class
H	10	75	70.288732	100	4	84	78.746479	99	2	52	53.380282	98	2	54	53.661972	99
L	0	10	16.889764	80	0	11	18.322835	90	0	11	15.574803	66	1	21	30.834646	98
M	0	50	48.938389	100	2	72	60.635071	99	0	38	40.962085	93	3	40	43.791469	98

	Class	H	L	M
ParentschoolSatisfaction	Relation
Bad	Father	73.375000	16.282051	42.793103
Bad	Mum	75.687500	16.000000	45.818182
Good	Father	65.764706	17.076923	45.240506
Good	Mum	70.797619	19.705882	62.730769