In this notebook, I use simple statistical metrics (mean, median, standard deviation and some quantiles) to analyze different movie rating systems. I also perform a linear regression and analyse its significance.
In [78]:
import pandas as pd
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
%matplotlib inline
In [5]:
movies = pd.read_csv('fandango_score_comparison.csv')
In [6]:
movies.describe()
Out[6]:
In [57]:
movies.info()
In [8]:
# Check the movie data structure
movies.head()
Out[8]:
In [9]:
movies["Fandango_Stars"].hist(bins=[1,2,3,4,5])
Out[9]:
In [10]:
movies["Metacritic_norm_round"].hist(bins=[1,2,3,4,5])
Out[10]:
In [11]:
# Mean, standard deviation, median and other quantiles
movies[["Metacritic_norm_round","Fandango_Stars"]].describe()
Out[11]:
=> The Fandango rating system seems biased, flawed and thus of lesser value than the Metacritic one.
In [14]:
movies.plot(x='Fandango_Stars',
y='Metacritic_norm_round',
kind='scatter')
Out[14]:
In [28]:
def abs_diff(row):
return np.abs(row[0] - row[1])
fm_diff = movies[["Metacritic_norm_round","Fandango_Stars"]].apply(
lambda row: abs_diff(row), axis=1)
In [30]:
fm_diff.sort(ascending=False)
In [45]:
differently_rated_movies = movies.loc[fm_diff.head().index]["FILM"]
differently_rated_movies = '\n'.join(differently_rated_movies.values)
In [46]:
print("""The top 5 differently rated movies are: \n{0}""".format(differently_rated_movies))
In [55]:
x = movies["Metacritic_norm_round"]
y = movies["Fandango_Stars"]
correlation, p_value = sps.pearsonr(x, y)
correlation_message = "The correlation between Metacritic and Fandango ratings is {0} for a p-value of {1}"
print(correlation_message.format(correlation, p_value))
=> The low correlation (which is significant at an $alpha=0.05$ level) might mean that the two ratings systems are independent (this is of course just an assumption which is generally false). This is why for the next part we will analyse a linear regression model between both.
In [65]:
slope, intercept, _, _,stderr = sps.linregress(x,y)
In [75]:
def make_prediction(x):
return slope * x + intercept
metacritic_rating = 3.0
fandango_predicted_rating = make_prediction(metacritic_rating)
prediction_message = "The Fandango rating for a {0} Metacritic one is: {1}"
print(prediction_message.format(metacritic_rating,
fandango_predicted_rating))
In [88]:
fandango_predicted_ratings = list(map(make_prediction,
x))
plt.scatter(x, y)
plt.plot(x, fandango_predicted_ratings, color='red')
Out[88]:
In [153]:
# Construct a Gaussian curve using the mean and std of the Fandango
# residuals
np.random.seed(314159265)
min_res = fandango_ratings_residuals.min()
max_res = fandango_ratings_residuals.max()
linear_space = np.linspace(min_res, max_res,
len(fandango_ratings_residuals))
mean = fandango_ratings_residuals.mean()
std = fandango_ratings_residuals.std()
normal_residuals = sps.norm.pdf(linear_space, mean, std)
In [154]:
import seaborn as sns
fandango_ratings_residuals = fandango_predicted_ratings - y
def IQR(x):
return np.ediff1d(x.quantile([0.25, 0.75]))
def optimal_bins_width(x):
return 2 * IQR(x) / (len(x) ** (1/3))
def optimal_bins_number(x):
return (x.max() - x.min()) / optimal_bins_width(x)
bins = int(optimal_bins_number(fandango_ratings_residuals))
sns.distplot(fandango_ratings_residuals, bins=bins, color="r")
plt.plot(linear_space, normal_residuals)
Out[154]:
=> The residuals doesn't seem to be drawn from a Gaussian distribution. Let's test this assumption.
In [159]:
_, normality_p_value = sps.normaltest(fandango_ratings_residuals)
In [160]:
normality_p_value
Out[160]:
=> The fandango residuals are thus normal at $alpha=0.05$ level but not at $alpha=0.01$ one.
In [167]:
sps.probplot(fandango_ratings_residuals, dist="norm", plot=plt)
plt.show()