Predicting Average Marks Based on Facebook Likes

Introduction

It is common for students to have a Facebook group on which they post course relevant discussion and announcements to help each other keep up with various aspects. Among the different type of posts, it is customary for people to announce when assessment marks are available to check online, in lieu of a formal notification from the administration. The premise of this short study is that the amount of likes that this sort of post gets can be used to predict the average mark of the assessment.

The reasoning behind this is that students may be less likely to react positively when they check their mark and it turns out to be low, but in case of a high mark, students may feel a need to express their happiness and reward the messenger with a like. Yet, number of likes may be an imperfect measure in many ways:

Low-ish marks for an exam may excite students if their expectation was low; while good-ish marks for an exam may upset them if they had higher expectations. This indicates there may be a need for a possibly-difficult-to-quantify "expectation" variable. One possible way to address this is by introducing a difficulty variable for each exam based on historical results.
Number of likes varies depending on the number of students in a class -- this may be addressed by evaluating the percentage of students who liked the post, rather than the raw number.
People may express their happiness through comments. This may be addressed through counting the number of comments, but also some sorts of sentiment analysis to distinguish between happiness and disappointment. In my experience though, people are much more likely to stick with a simple "like", rather than a more emotional public display of enthusiasm.

This is just a one-day project, so only the 2nd point will be addressed.

Additional information

This is done for a UK BSc course, so if you are not familiar with the academic environment there, some aspects may not make sense. For example, 70 is considered quite a good mark (first class), and one can pass if they obtain a mark over 30 and they can compensate with marks from a different module. Furthermore, when I talk about "closed assessment" or "closed exam" I refer to an exam typically done with no internet access in about 2 hours, and "open assessment" typically refers to a few week/months-long project (report/experiment/coding/coursework). Also, a bachelor's degree in UK lasts 3 years.

Collecting data

Due to an unstandardised method of announcing that the marks are up, the collection of the data was accomplished via a tedious process using Facebook group search and manually storing relevant data. I had access to 2 groups, so I gathered data from several academic years and connected every exam to its average mark as released by the university department. I will not make the data public because I'm not sure about the privacy policies involved.

Exploring data



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = 12, 10
plt.rcParams.update({'font.size': 15})



In [2]:

    
data = pd.read_csv('../data/train.csv')



In [3]:

    
data.describe()









    Out[3]:






  
    
      
      id
      likes
      year
      cohort
      closed
      number_of_students
      mean
      median
    
  
  
    
      count
      56.000000
      56.000000
      56.000000
      56.000000
      56.000000
      56.000000
      56.000000
      56.000000
    
    
      mean
      28.500000
      6.750000
      2014.500000
      2.142857
      0.660714
      80.571429
      61.445536
      62.273036
    
    
      std
      16.309506
      4.806246
      0.603023
      0.772918
      0.477752
      36.801609
      9.391542
      10.718933
    
    
      min
      1.000000
      0.000000
      2013.000000
      1.000000
      0.000000
      13.000000
      37.040000
      31.000000
    
    
      25%
      14.750000
      2.750000
      2014.000000
      2.000000
      0.000000
      42.000000
      55.490000
      56.875000
    
    
      50%
      28.500000
      6.000000
      2015.000000
      2.000000
      1.000000
      80.000000
      62.050000
      63.250000
    
    
      75%
      42.250000
      9.000000
      2015.000000
      3.000000
      1.000000
      114.000000
      68.417500
      70.125000
    
    
      max
      56.000000
      19.000000
      2015.000000
      3.000000
      1.000000
      131.000000
      83.660000
      87.000000

As shown above, I managed to collect data for 56 exams. It seems that the minimum mean mark was of 37, and the maximum was of 83.66. On average, it looks like exams tend to end up with a mark slightly above 60 (std of 9.39), about 66% percent have been closed exams, and I collected slightly more data from 3rd years (as seen by the mean of the cohort column.) The minimum number of students that an exam had was 13 (I was there!), and the maximum was 131 (I was there too!).

Anyway, let's get to what is likely the most interesting question right now: is there a distinguishable structure in the data?

Let's plot the mean marks against the number of likes.



In [4]:

    
# Create a quick function to allow reusing
def scatter_plot(independent, dependent, title="", xlabel="", ylabel=""):
    plt.scatter(independent,dependent)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()



In [5]:

    
scatter_plot(data['likes'],data['mean'], 
                      title="Relationship between number of likes and mean mark",
                      xlabel="Number of students who liked the announcements",
                      ylabel="Mean mark")

Alright, there is very little structure here, but let's see how it looks when we use the percentage of students.



In [6]:

    
data['likes_normalised'] = (100 * data['likes'])/data['number_of_students']

scatter_plot(data['likes_normalised'],data['mean'], 
                      title="Relationship between percentage of likes and mean mark",
                      xlabel="Percentage of students who liked the announcements",
                      ylabel="Mean mark")

Quite a bit better, but let's do it for the median as it seems like a more intuitive measurement here: half the students below the median got less than the mark indicated on the y axis. Also, the median is more resilient to outliers which may be helpful in this case.



In [7]:

    
scatter_plot(data['likes_normalised'],data['median'], 
                      title="Relationship between percentage of likes and median mark",
                      xlabel="Percentage of students who liked the announcements",
                      ylabel="Median mark")

Alright, without starting to remove outliers, this seems to be the best I have. I will stay away from outlier detection, and from the graph there doesn't seem to be any really obvious one (maybe besides that one at 80+ median, but I won't go there).

The structure looks like it can be fitted with a polynomial curve, so without further ado, I'll just go ahead and try some polynomial regression.

Fitting the data



In [8]:

    
from sklearn import linear_model

# Convert series to dataframes to prepare them for fitting
X = pd.DataFrame(data['likes_normalised'])
y = pd.DataFrame(data['median'])

# Add a squared feature
X = pd.concat([X,pow(X,2)],axis=1)

# Fit data with linear regression
model_linear = linear_model.LinearRegression()
model_linear.fit(X,y)

# Create some new points to graph the curve
new_x = np.linspace(X.iloc[:,0].min(),X.iloc[:,0].max())
test_x = pd.DataFrame([new_x, pow(new_x,2)]).T

# Plot curve
plt.plot(test_x.iloc[:,0],model_linear.predict(test_x))
# Plot points
scatter_plot(X.iloc[:,0],y,
                      title="Relationship between percentage of likes and median mark",
                      xlabel="Percentage of students who liked the announcements",
                      ylabel="Median mark")

Looks about right to me. I don't think stopping here would be a bad decision, but for the sake of it, I will try to fit it with a cubic polynomial.



In [9]:

    
# Add a 3rd power and fit the new data
X_ = pd.concat([X,pow(X.iloc[:,0],3)],axis=1)
model_linear.fit(X_,y)

test_x_ = pd.concat([test_x, pow(test_x.iloc[:,0],3)],axis=1)
plt.plot(test_x_.iloc[:,0],model_linear.predict(test_x_))
# Plot points
scatter_plot(X_[[0]],y,
                      title="Relationship between percentage of likes and median mark",
                      xlabel="Percentage of students who liked the announcements",
                      ylabel="Median mark")

Looks pretty, but it likely overfits. So let's try to add L2 regularization and find the regularization parameter through 10-fold cross validation.



In [10]:

    
from sklearn.linear_model import RidgeCV

list_reg_params = np.linspace(0.00001, X_.max().max() * 100, 2000)
model_ridge = linear_model.RidgeCV(alphas=list_reg_params, cv=10)
model_ridge.fit(X_,y)

reg_param = model_ridge.alpha_
print("The regularization parameter that provided the best results was {}.".format(reg_param))
plt.plot(test_x_.iloc[:,0],model_ridge.predict(test_x_))
scatter_plot(X_[[0]],y,
                      title="Relationship between percentage of likes and median mark",
                      xlabel="Percentage of students who liked the announcements",
                      ylabel="Median mark")









    



The regularization parameter that provided the best results was 10016.299663688093.

The regularization parameter introduced only a very subtle difference. It's also a very big number as I haven't normalised the data. Now let's see which of the 2 polynomial curves fit the data better.

Evaluating the models



In [11]:

    
model_ridge_reg = linear_model.Ridge(alpha=reg_param)



In [12]:

    
from sklearn import cross_validation

scores_quadratic = abs(cross_validation.cross_val_score(
        model_linear, X, y, cv=10, scoring="mean_absolute_error"))
scores_cubic = abs(cross_validation.cross_val_score(
        model_ridge_reg, X_, y, cv=10, scoring="mean_absolute_error"))

print("Average error for quadratic polynomial: {:.2f} (+/- {:.2f})" .format(
        scores_quadratic.mean(), scores_quadratic.std() * 2))
print("Average error  for cubic polynomial: {:.2f} (+/- {:.2f})" .format(
        scores_cubic.mean(), scores_cubic.std() * 2))









    



Average error for quadratic polynomial: 7.05 (+/- 3.87)
Average error  for cubic polynomial: 6.86 (+/- 4.75)

Not too bad. An error of less than 7 marks is better than I expected. Now for a comparison test, let's just predict that the median of a module is the mean of the medians of all the other modules (but using K-fold CV).



In [13]:

    
from sklearn.metrics import mean_absolute_error
from sklearn.cross_validation import KFold

kf = KFold(X.shape[0], n_folds=10)
scores_naive = []
for train, test in kf:
    train_mean = data['median'][train].mean()
    scores_naive.append(mean_absolute_error(data['median'][test], [train_mean] * len(test)))
    
print("Average error  for naive predictor: {:.2f} (+/- {:.2f})" .format(
        np.mean(scores_naive), np.std(scores_naive) * 2))









    



Average error  for naive predictor: 8.18 (+/- 6.37)

D'aww, the regression predictor is not much better than the naive one. At least it's not worse! As a last test, let's see what are the chances that the the regression results are better just by chance. For this, a paired one-tailed t-test will be performed, with the null hypothesis that the 2 algorithms perform the same, and the alternative hypothesis that the regression one performs better. The p-value I will choose is 0.05.



In [14]:

    
from scipy.stats import ttest_rel

results = ttest_rel(scores_naive, scores_cubic)
print("P-value: {:.4f}" .format(results.pvalue/2)) # Dividing by 2 because it's one-tailed









    



P-value: 0.0387

Everything was not in vain! It seems that the algorithm is likely to perform better than the naive approach.

Conclusion

And this is it. The algorithm appears to predict grades with an average error of approximately 7 marks, and performs better than a naive predictor. Not bad I'd say considering we only have the number of likes on Facebook!

I did not put aside a separate testing set because the data was not very big to begin with, and I don't think that a small test set would necessarily give a good indication of the algorithm performance. However, I will nonetheless test it on this year's exams once the results and the statistics are up!

	id	likes	year	cohort	closed	number_of_students	mean	median
count	56.000000	56.000000	56.000000	56.000000	56.000000	56.000000	56.000000	56.000000
mean	28.500000	6.750000	2014.500000	2.142857	0.660714	80.571429	61.445536	62.273036
std	16.309506	4.806246	0.603023	0.772918	0.477752	36.801609	9.391542	10.718933
min	1.000000	0.000000	2013.000000	1.000000	0.000000	13.000000	37.040000	31.000000
25%	14.750000	2.750000	2014.000000	2.000000	0.000000	42.000000	55.490000	56.875000
50%	28.500000	6.000000	2015.000000	2.000000	1.000000	80.000000	62.050000	63.250000
75%	42.250000	9.000000	2015.000000	3.000000	1.000000	114.000000	68.417500	70.125000
max	56.000000	19.000000	2015.000000	3.000000	1.000000	131.000000	83.660000	87.000000