It is common for students to have a Facebook group on which they post course relevant discussion and announcements to help each other keep up with various aspects. Among the different type of posts, it is customary for people to announce when assessment marks are available to check online, in lieu of a formal notification from the administration. The premise of this short study is that the amount of likes that this sort of post gets can be used to predict the average mark of the assessment.
The reasoning behind this is that students may be less likely to react positively when they check their mark and it turns out to be low, but in case of a high mark, students may feel a need to express their happiness and reward the messenger with a like. Yet, number of likes may be an imperfect measure in many ways:
This is just a one-day project, so only the 2nd point will be addressed.
This is done for a UK BSc course, so if you are not familiar with the academic environment there, some aspects may not make sense. For example, 70 is considered quite a good mark (first class), and one can pass if they obtain a mark over 30 and they can compensate with marks from a different module. Furthermore, when I talk about "closed assessment" or "closed exam" I refer to an exam typically done with no internet access in about 2 hours, and "open assessment" typically refers to a few week/months-long project (report/experiment/coding/coursework). Also, a bachelor's degree in UK lasts 3 years.
Due to an unstandardised method of announcing that the marks are up, the collection of the data was accomplished via a tedious process using Facebook group search and manually storing relevant data. I had access to 2 groups, so I gathered data from several academic years and connected every exam to its average mark as released by the university department. I will not make the data public because I'm not sure about the privacy policies involved.
In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 12, 10
plt.rcParams.update({'font.size': 15})
In [2]:
data = pd.read_csv('../data/train.csv')
In [3]:
data.describe()
Out[3]:
As shown above, I managed to collect data for 56 exams. It seems that the minimum mean mark was of 37, and the maximum was of 83.66. On average, it looks like exams tend to end up with a mark slightly above 60 (std of 9.39), about 66% percent have been closed exams, and I collected slightly more data from 3rd years (as seen by the mean of the cohort column.) The minimum number of students that an exam had was 13 (I was there!), and the maximum was 131 (I was there too!).
Anyway, let's get to what is likely the most interesting question right now: is there a distinguishable structure in the data?
Let's plot the mean marks against the number of likes.
In [4]:
# Create a quick function to allow reusing
def scatter_plot(independent, dependent, title="", xlabel="", ylabel=""):
plt.scatter(independent,dependent)
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.show()
In [5]:
scatter_plot(data['likes'],data['mean'],
title="Relationship between number of likes and mean mark",
xlabel="Number of students who liked the announcements",
ylabel="Mean mark")
Alright, there is very little structure here, but let's see how it looks when we use the percentage of students.
In [6]:
data['likes_normalised'] = (100 * data['likes'])/data['number_of_students']
scatter_plot(data['likes_normalised'],data['mean'],
title="Relationship between percentage of likes and mean mark",
xlabel="Percentage of students who liked the announcements",
ylabel="Mean mark")
Quite a bit better, but let's do it for the median as it seems like a more intuitive measurement here: half the students below the median got less than the mark indicated on the y axis. Also, the median is more resilient to outliers which may be helpful in this case.
In [7]:
scatter_plot(data['likes_normalised'],data['median'],
title="Relationship between percentage of likes and median mark",
xlabel="Percentage of students who liked the announcements",
ylabel="Median mark")
Alright, without starting to remove outliers, this seems to be the best I have. I will stay away from outlier detection, and from the graph there doesn't seem to be any really obvious one (maybe besides that one at 80+ median, but I won't go there).
The structure looks like it can be fitted with a polynomial curve, so without further ado, I'll just go ahead and try some polynomial regression.
In [8]:
from sklearn import linear_model
# Convert series to dataframes to prepare them for fitting
X = pd.DataFrame(data['likes_normalised'])
y = pd.DataFrame(data['median'])
# Add a squared feature
X = pd.concat([X,pow(X,2)],axis=1)
# Fit data with linear regression
model_linear = linear_model.LinearRegression()
model_linear.fit(X,y)
# Create some new points to graph the curve
new_x = np.linspace(X.iloc[:,0].min(),X.iloc[:,0].max())
test_x = pd.DataFrame([new_x, pow(new_x,2)]).T
# Plot curve
plt.plot(test_x.iloc[:,0],model_linear.predict(test_x))
# Plot points
scatter_plot(X.iloc[:,0],y,
title="Relationship between percentage of likes and median mark",
xlabel="Percentage of students who liked the announcements",
ylabel="Median mark")
Looks about right to me. I don't think stopping here would be a bad decision, but for the sake of it, I will try to fit it with a cubic polynomial.
In [9]:
# Add a 3rd power and fit the new data
X_ = pd.concat([X,pow(X.iloc[:,0],3)],axis=1)
model_linear.fit(X_,y)
test_x_ = pd.concat([test_x, pow(test_x.iloc[:,0],3)],axis=1)
plt.plot(test_x_.iloc[:,0],model_linear.predict(test_x_))
# Plot points
scatter_plot(X_[[0]],y,
title="Relationship between percentage of likes and median mark",
xlabel="Percentage of students who liked the announcements",
ylabel="Median mark")
Looks pretty, but it likely overfits. So let's try to add L2 regularization and find the regularization parameter through 10-fold cross validation.
In [10]:
from sklearn.linear_model import RidgeCV
list_reg_params = np.linspace(0.00001, X_.max().max() * 100, 2000)
model_ridge = linear_model.RidgeCV(alphas=list_reg_params, cv=10)
model_ridge.fit(X_,y)
reg_param = model_ridge.alpha_
print("The regularization parameter that provided the best results was {}.".format(reg_param))
plt.plot(test_x_.iloc[:,0],model_ridge.predict(test_x_))
scatter_plot(X_[[0]],y,
title="Relationship between percentage of likes and median mark",
xlabel="Percentage of students who liked the announcements",
ylabel="Median mark")
In [11]:
model_ridge_reg = linear_model.Ridge(alpha=reg_param)
In [12]:
from sklearn import cross_validation
scores_quadratic = abs(cross_validation.cross_val_score(
model_linear, X, y, cv=10, scoring="mean_absolute_error"))
scores_cubic = abs(cross_validation.cross_val_score(
model_ridge_reg, X_, y, cv=10, scoring="mean_absolute_error"))
print("Average error for quadratic polynomial: {:.2f} (+/- {:.2f})" .format(
scores_quadratic.mean(), scores_quadratic.std() * 2))
print("Average error for cubic polynomial: {:.2f} (+/- {:.2f})" .format(
scores_cubic.mean(), scores_cubic.std() * 2))
Not too bad. An error of less than 7 marks is better than I expected. Now for a comparison test, let's just predict that the median of a module is the mean of the medians of all the other modules (but using K-fold CV).
In [13]:
from sklearn.metrics import mean_absolute_error
from sklearn.cross_validation import KFold
kf = KFold(X.shape[0], n_folds=10)
scores_naive = []
for train, test in kf:
train_mean = data['median'][train].mean()
scores_naive.append(mean_absolute_error(data['median'][test], [train_mean] * len(test)))
print("Average error for naive predictor: {:.2f} (+/- {:.2f})" .format(
np.mean(scores_naive), np.std(scores_naive) * 2))
D'aww, the regression predictor is not much better than the naive one. At least it's not worse! As a last test, let's see what are the chances that the the regression results are better just by chance. For this, a paired one-tailed t-test will be performed, with the null hypothesis that the 2 algorithms perform the same, and the alternative hypothesis that the regression one performs better. The p-value I will choose is 0.05.
In [14]:
from scipy.stats import ttest_rel
results = ttest_rel(scores_naive, scores_cubic)
print("P-value: {:.4f}" .format(results.pvalue/2)) # Dividing by 2 because it's one-tailed
Everything was not in vain! It seems that the algorithm is likely to perform better than the naive approach.
And this is it. The algorithm appears to predict grades with an average error of approximately 7 marks, and performs better than a naive predictor. Not bad I'd say considering we only have the number of likes on Facebook!
I did not put aside a separate testing set because the data was not very big to begin with, and I don't think that a small test set would necessarily give a good indication of the algorithm performance. However, I will nonetheless test it on this year's exams once the results and the statistics are up!