In [31]:
# install.packages("rminer", repos = "http://cran.us.r-project.org") #run it once to install missing libraries
# install.packages('randomForest', repos = "http://cran.us.r-project.org")
library(rpart) # recursion trees
library(party) # needed for other trees
library(rminer) # used in paper reference
library(rattle) # for fancy plot
library(rpart.plot) # for fancy plot
library(RColorBrewer) # for fancy plot
library(randomForest)
library(data.table)
library(caret)
library(lattice)
library(ggplot2)
First step is to load and clean data from maths and portuguese data sets. In this work we utilized two different sets:
The clean process is all about to omit the na values from the table.
In [3]:
# Load math and portuguese data sets
math = read.table("student-mat.csv", sep=";", header=TRUE)
port = read.table("student-por.csv", sep=";", header=TRUE)
# Clean it up
math = na.omit(math)
port = na.omit(port)
In [4]:
str(math)
For our initial model we will take in count the following values, and create the new ones:
In [5]:
# Create two new columns one indicating overall weekly alcohol consumption
# and other one indicating if student approved or not the subject
math$alcohol = (math$Dalc*5) + (math$Walc*2)
math$passed = ifelse(math$G3 > 9, "pass" , "fail")
port$alcohol = (port$Dalc*5) + (port$Walc*2)
port$passed = ifelse(port$G3 > 9, "pass" , "fail")
In [6]:
# Simple Decision Tree with all values
tree1 <- rpart(passed~sex+age+famsize+Pstatus+Medu+Fedu+Mjob+Fjob
+traveltime+failures+schoolsup+famsup+romantic+famrel+freetime+absences+alcohol
,data= math, method="class")
# Plot a fancy and more informative tree
fancyRpartPlot(tree1)
print(tree1)
In [7]:
table(math$failures)
plot(factor(math$failures))
Out[7]:
In [8]:
head(table(math$absences)) # just some samples
plot(factor(math$absences))
Out[8]:
As we could see the behaviour from those student who has already failed at least once, is different from thew rest, since for them is needed to not assist at least one class to have a better chance to pass.
In [9]:
# Making a prediction based on previous Tree
prediction1 <- predict(tree1, port, type = "class")
# Calculating the error
err1 <- table(prediction1 == port$passed)
prop.table(err1)
Out[9]:
In [10]:
# Tree
tree2 <- rpart(passed~alcohol+absences, data= math, method="class")
# Plot
fancyRpartPlot(tree2)
The tree look pretty simple but lets check how good we've created our prediction:
In [11]:
# Prediction
prediction2 <- predict(tree2, port, type = "class")
# Error
diff2 <- table(prediction2 == port$passed)
prop.table(diff2)
Out[11]:
Not so bad, using only this two variables we've reached 83% of accuracy.
In [12]:
# Tree
tree3 <- rpart(passed~famrel+famsize+Pstatus+Medu+Fedu+Mjob+Fjob
+famsup+famrel,data= math, method="class",control=rpart.control(cp=0.005))
# Fancy plot
fancyRpartPlot(tree3)
Unfortunally using those attributes we only get a node and not a tree as error in plot states.
{R}
Error in plot.rpart(tree3): fit is not a tree, just a root
After modify a bit the control of our partitioner (setting cp == 0.005) we were able to get a tree.
If we use G3 value instead (tree above) we can generate a tree but if out of our scope its analysis, since we are only evaluating if student passed or not, so we need to move forward and keep looking for other relevant attributes.
From the manual of rpart
cp: complexity parameter. Any split that does not decrease the overall lack of fit by a factor of ‘cp’ is not attempted. For instance, with ‘anova’ splitting, this means that the overall R-squared must increase by ‘cp’ at each step. The main role of this parameter is to save computing time by pruning off splits that are obviously not worthwhile.
In [13]:
# Prediction
prediction3 <- predict(tree3, port, type = "class")
# Error
err3 <- table(prediction3 == port$passed)
prop.table(err3)
Out[13]:
And as expected this model does not given a good fit of our needs, so let's move on to another attributes.
In [14]:
# Tree
tree4 <- rpart(passed~sex+age+romantic+freetime,data= math, method="class")
# Plot
fancyRpartPlot(tree4)
Well this tree looks a bit unbalanced, but lets the error says how well or bad we did it, using prediction again.
In [15]:
# Prediction
prediction4 <- predict(tree4, port, type = "class")
# Error
diff4 <- table(prediction4 == port$passed)
prop.table(diff4)
Out[15]:
Well it was a bit better than our first model for 2%, reaching almost 80% of correctness.
In [16]:
# Tree
tree5 <- rpart(passed~Mjob+Pstatus+Medu+Fjob+romantic+freetime+absences+alcohol+failures,data= math, method="class")
# Plot
fancyRpartPlot(tree5)
In [17]:
# Prediction
prediction5 <- predict(tree5, port, type = "class")
# Error
diff5 <- table(prediction5 == port$passed)
prop.table(diff5)
Out[17]:
From this basic analysis we can make a few conclusions here:
Must to be said that work with trees is not the best way to make predicitions, since usually they are built using greedy partitioning which as was proved many times could conduct us for not the best fit.
So in order to make more solid conclusions we need to include more advanced techniques of machine learning analysis.
Within the possibilities for future work are the following items:
In order to create better predictions or trying to reproduce studies is very important to choose the correct parameters and features to use, but also optimize the data choosen, having lot of parameters can be do more difficult its analysis. In this study we are using 33 features, but in the decision trees performed before we don't take in consideration all, so lets review some recommended formal techniques to improve the selection of variables to analyze. It has been proved that a correct viarable reduction can improve the final results of the analysis.
The techniques chosen to try in this study are:
Sometimes data set information capturing can be limited by accesability to information, and not all samples have all information, in order to handle this is important to analyze if we have the enough data to perform our analysis.
Missing Values Ratio, sets that is recommendable not have a missing values ratio more than 40-50% according the quantity of samles in total, if it is bigger this percentage is better to discart the feature.
In [18]:
# Load math and portuguese data sets with a new name
math_mv = read.table("student-mat.csv", sep=";", header=TRUE)
port_mv = read.table("student-por.csv", sep=";", header=TRUE)
In [19]:
setDT(math_mv)[, list(.N,unlist(.SD)), sex][, list(Count=sum(is.na(V2))/(.N+N[1])), sex]
setDT(math_mv)[, list(.N,unlist(.SD)), age][, list(Count=sum(is.na(V2))/(.N+N[1])), age]
setDT(math_mv)[, list(.N,unlist(.SD)), famsize][, list(Count=sum(is.na(V2))/(.N+N[1])), famsize]
setDT(math_mv)[, list(.N,unlist(.SD)), Pstatus][, list(Count=sum(is.na(V2))/(.N+N[1])), Pstatus]
setDT(math_mv)[, list(.N,unlist(.SD)), Fedu][, list(Count=sum(is.na(V2))/(.N+N[1])), Fedu]
setDT(math_mv)[, list(.N,unlist(.SD)), Mjob][, list(Count=sum(is.na(V2))/(.N+N[1])), Mjob]
setDT(math_mv)[, list(.N,unlist(.SD)), Fjob][, list(Count=sum(is.na(V2))/(.N+N[1])), Fjob]
setDT(math_mv)[, list(.N,unlist(.SD)), traveltime][, list(Count=sum(is.na(V2))/(.N+N[1])), traveltime]
setDT(math_mv)[, list(.N,unlist(.SD)), failures][, list(Count=sum(is.na(V2))/(.N+N[1])), failures]
setDT(math_mv)[, list(.N,unlist(.SD)), famsup][, list(Count=sum(is.na(V2))/(.N+N[1])), famsup]
setDT(math_mv)[, list(.N,unlist(.SD)), schoolsup][, list(Count=sum(is.na(V2))/(.N+N[1])), schoolsup]
setDT(math_mv)[, list(.N,unlist(.SD)), romantic][, list(Count=sum(is.na(V2))/(.N+N[1])), romantic]
setDT(math_mv)[, list(.N,unlist(.SD)), famrel][, list(Count=sum(is.na(V2))/(.N+N[1])), famrel]
setDT(math_mv)[, list(.N,unlist(.SD)), freetime][, list(Count=sum(is.na(V2))/(.N+N[1])), freetime]
setDT(math_mv)[, list(.N,unlist(.SD)), absences][, list(Count=sum(is.na(V2))/(.N+N[1])), absences]
setDT(math_mv)[, list(.N,unlist(.SD)), Dalc][, list(Count=sum(is.na(V2))/(.N+N[1])), Dalc]
setDT(math_mv)[, list(.N,unlist(.SD)), Walc][, list(Count=sum(is.na(V2))/(.N+N[1])), Walc]
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
Out[19]:
As part of our data set, we can say that have verified it is very well prepared since missing value ratio for all features was 0%, meaning all data is complete so we can continue using them.
As part of the technique it wont help us to know which features are good to ask, but once we got data, what features have enough information to be used.
data columns with little changes in the data carry little information. Thus all data columns with variance lower than a given threshold are removed. A word of caution: variance is range dependent; therefore normalization is required before applying this technique.
R programming offers some libraries for data analysis, caret is a very useful library that provides a lot of functions related to variable reduction
Let's use the function nearZeroVar function which will get the main data variance parameters
In [20]:
x = nearZeroVar(math_mv, saveMetrics = TRUE)
x
y = nearZeroVar(port_mv, saveMetrics = TRUE)
y
Out[20]:
Out[20]:
Results in table can be readable in next way, freqRatio means how frequent any value is repeated across the samples, it is recomended features to have a freq ratio less than 15%.
Other parameter is the percentUnique, which tells the percentage of unique values across the samples, it is recommended to be higher than 10%, but checking out values most of them have a very low range of values so it could be understandable that all are less than 10%.
Finally according more rules, function does a clasification decision about if predictors are constant (ZeroVar) or if predictors are close to be constant (nzv).
As part of our data set, we can see that even when some parameters doesn't pass the recommended thresholds the function automatically declares all features as non near zero variance, as meaning that more considerations should be done, however is goog to reduce variables but we want to be sure which ones and not reduce more than needed, decide to keep all features.
More than reduce or not the variables, it is a good information know how much the data variace is across samples.
Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.
In this case correlation cna also help us as guide to know how we can relate information, in this case using math grades to train and portuguese grades to evaluate. Also how to treat partial grades for G1, G2 and G3.
In [21]:
math_avg = (math$G1 + math$G2 + math$G3)/3
port_avg = (port$G1 + port$G2 + port$G3)/3
In [22]:
# calculate correlation matrix between math partials
cor(math$G1, math$G2, use = "everything", method = "pearson")
cor(math$G1, math$G3, use = "everything", method = "pearson")
cor(math$G2, math$G3, use = "everything", method = "pearson")
# calculate correlation matrix math partials vs average
cor(math_avg, math$G1, use = "everything", method = "pearson")
cor(math_avg, math$G2, use = "everything", method = "pearson")
cor(math_avg, math$G3, use = "everything", method = "pearson")
# calculate correlation matrix between portuguese partials
cor(port$G1, port$G2, use = "everything", method = "pearson")
cor(port$G1, port$G3, use = "everything", method = "pearson")
cor(port$G2, port$G3, use = "everything", method = "pearson")
# calculate correlation matrix portuguese partials vs average
cor(port_avg, port$G1, use = "everything", method = "pearson")
cor(port_avg, port$G2, use = "everything", method = "pearson")
cor(port_avg, port$G3, use = "everything", method = "pearson")
# calculate correlation matrix portuguese vs math
cor(port$G1[0:394], math$G2[0:394], use = "everything", method = "pearson")
cor(port$G1[0:394], math$G3[0:394], use = "everything", method = "pearson")
cor(port$G2[0:394], math$G3[0:394], use = "everything", method = "pearson")
cor(port_avg[0:394], math_avg[0:394], use = "everything", method = "pearson")
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
Out[22]:
In [23]:
# calculate correlation matrix between math partials
cor(math$G1, math$G2, use = "everything", method = "kendall")
cor(math$G1, math$G3, use = "everything", method = "kendall")
cor(math$G2, math$G3, use = "everything", method = "kendall")
# calculate correlation matrix math partials vs average
cor(math_avg, math$G1, use = "everything", method = "kendall")
cor(math_avg, math$G2, use = "everything", method = "kendall")
cor(math_avg, math$G3, use = "everything", method = "kendall")
# calculate correlation matrix between portuguese partials
cor(port$G1, port$G2, use = "everything", method = "kendall")
cor(port$G1, port$G3, use = "everything", method = "kendall")
cor(port$G2, port$G3, use = "everything", method = "kendall")
# calculate correlation matrix portuguese partials vs average
cor(port_avg, port$G1, use = "everything", method = "kendall")
cor(port_avg, port$G2, use = "everything", method = "kendall")
cor(port_avg, port$G3, use = "everything", method = "kendall")
# calculate correlation matrix portuguese vs math
cor(port$G1[0:394], math$G2[0:394], use = "everything", method = "kendall")
cor(port$G1[0:394], math$G3[0:394], use = "everything", method = "kendall")
cor(port$G2[0:394], math$G3[0:394], use = "everything", method = "kendall")
cor(port_avg[0:394], math_avg[0:394], use = "everything", method = "kendall")
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
Out[23]:
In [24]:
# calculate correlation matrix between math partials
cor(math$G1, math$G2, use = "everything", method = "spearman")
cor(math$G1, math$G3, use = "everything", method = "spearman")
cor(math$G2, math$G3, use = "everything", method = "spearman")
# calculate correlation matrix math partials vs average
cor(math_avg, math$G1, use = "everything", method = "spearman")
cor(math_avg, math$G2, use = "everything", method = "spearman")
cor(math_avg, math$G3, use = "everything", method = "spearman")
# calculate correlation matrix between portuguese partials
cor(port$G1, port$G2, use = "everything", method = "spearman")
cor(port$G1, port$G3, use = "everything", method = "spearman")
cor(port$G2, port$G3, use = "everything", method = "spearman")
# calculate correlation matrix portuguese partials vs average
cor(port_avg, port$G1, use = "everything", method = "spearman")
cor(port_avg, port$G2, use = "everything", method = "spearman")
cor(port_avg, port$G3, use = "everything", method = "spearman")
# calculate correlation matrix portuguese vs math
cor(port$G1[0:394], math$G2[0:394], use = "everything", method = "spearman")
cor(port$G1[0:394], math$G3[0:394], use = "everything", method = "spearman")
cor(port$G2[0:394], math$G3[0:394], use = "everything", method = "spearman")
cor(port_avg[0:394], math_avg[0:394], use = "everything", method = "spearman")
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Out[24]:
Seeing results using cor function we can see that spearman coeficent got the higher percentage of correlation, and kendall gets the lower but difference is in a range of 10%.
According values gotten there are big correlation between partial grades and more correlation between each partial with average grades for both subjects.
But when we compare math grades vs portuguse grades the correlation value is not greater than 35%. Printing out matrix we can obviously see that math grades are significally less than portuguese ones for all students, this is causing this change in correlation.
However in this paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.540.8151&rep=rep1&type=pdf author clasifies grades in groups where it is identified that in most of cases the classification in math corresponds in portuguese, so this confirm as a good experiment the decision of using math grades to predict portuguese ones.
Early in this study we used decision trees as predictors, but also we were playing with different variables looking the ones more useful, and even in the most overfitting trees not all variables were present, so it can be an easy tool to get idea which variants can be useful.
However there are difference between data scientist point of views. since some ones just consider them as a part of the random forest technique we will see next.
In [25]:
# Value inside does not matter just to keep consistency
set.seed(444)
In [26]:
# Simple Random forest with all values
rndtree1 <- randomForest(as.factor(passed) ~ sex + age + famsize + Pstatus + Medu
+ Fedu + Mjob + Fjob + traveltime + failures + schoolsup
+ famsup + romantic + famrel + freetime + absences + alcohol
,data = math, importance=TRUE, ntree=2000)
varImpPlot(rndtree1)
In [27]:
# Prediction
rfpred1 <- predict(rndtree1, port, type = "class")
# Error
rferr1 <- table(rfpred1 == port$passed)
prop.table(rferr1)
Out[27]:
Actually it was not the best prediction we made so far (76% vs 84%), but as we will see in the next step, it could be improved.
In [28]:
# Simple Conditional Forests with all values
rndtree2 <- cforest(as.factor(passed) ~ sex + age + famsize + Pstatus + Medu
+ Fedu + Mjob + Fjob + traveltime + failures + schoolsup
+ famsup + romantic + famrel + freetime + absences + alcohol
,data = math, controls=cforest_unbiased(ntree=2000, mtry=5))
# Prediction
rfpred2 <- predict(rndtree2, port, OOB=TRUE, type = "response")
# Error
rferr2 <- table(rfpred2 == port$passed)
prop.table(rferr2)
Out[28]:
And we did it again, we improve our guesses by 1% to get a total of 85.5% now we can disccard or to add some other more variants to try to increase this indicators.
In [29]:
# Simple Conditional Forests with all values
rndtree3 <- cforest(as.factor(passed) ~ Mjob + Pstatus + Medu +
Fjob + romantic + freetime + absences + alcohol + failures
,data = math, controls=cforest_unbiased(ntree=2000, mtry=5))
# Prediction
rfpred3 <- predict(rndtree3, port, OOB=TRUE, type = "response")
# Error
rferr3 <- table(rfpred3 == port$passed)
prop.table(rferr3)
Out[29]:
Well the result varies from 85.8% to 86% which is in general a slightly better but not by far. So let's use the top 5 of relevant values from point 7.0
In [30]:
# Simple Conditional Forests with all values
rndtree4 <- cforest(as.factor(passed) ~ failures + absences + schoolsup + alcohol + age
,data = math, controls=cforest_unbiased(ntree=2000, mtry=3))
# Prediction
rfpred4 <- predict(rndtree4, port, OOB=TRUE, type = "response")
# Error
rferr4 <- table(rfpred4 == port$passed)
prop.table(rferr4)
Out[30]:
We had to decrease the number of mtry = 3, since it is related to the numbers of variables used to created the trees, nevertheless our prediction does not improve as we wished, and it remains in 85%.
We could increase our chance to win with predictions using random forests, but still not reach a better index of success, but we need to remember that we are training our model using one subject and use it in another one, so there is one factor which is missing, and is the difficult of subject itself, so after that our result seen very acceptable.