Comparing different movie rating systems

In this notebook, I use simple statistical metrics (mean, median, standard deviation and some quantiles) to analyze different movie rating systems. I also perform a linear regression and analyse its significance.

Author: Yassine Alouini
Date: 16-5-2016
License: MIT



In [78]:

    
import pandas as pd
import numpy as np
import scipy.stats as sps
import matplotlib.pyplot as plt
%matplotlib inline



In [5]:

    
movies = pd.read_csv('fandango_score_comparison.csv')



In [6]:

    
movies.describe()









    Out[6]:






  
    
      
      RottenTomatoes
      RottenTomatoes_User
      Metacritic
      Metacritic_User
      IMDB
      Fandango_Stars
      Fandango_Ratingvalue
      RT_norm
      RT_user_norm
      Metacritic_norm
      ...
      IMDB_norm
      RT_norm_round
      RT_user_norm_round
      Metacritic_norm_round
      Metacritic_user_norm_round
      IMDB_norm_round
      Metacritic_user_vote_count
      IMDB_user_vote_count
      Fandango_votes
      Fandango_Difference
    
  
  
    
      count
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
      ...
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
       146.000000
        146.000000
          146.000000
         146.000000
       146.000000
    
    
      mean
        60.849315
        63.876712
        58.808219
         6.519178
         6.736986
         4.089041
         3.845205
         3.042466
         3.193836
         2.940411
      ...
         3.368493
         3.065068
         3.226027
         2.972603
         3.270548
         3.380137
        185.705479
        42846.205479
        3848.787671
         0.243836
    
    
      std
        30.168799
        20.024430
        19.517389
         1.510712
         0.958736
         0.540386
         0.502831
         1.508440
         1.001222
         0.975869
      ...
         0.479368
         1.514600
         1.007014
         0.990961
         0.788116
         0.502767
        316.606515
        67406.509171
        6357.778617
         0.152665
    
    
      min
         5.000000
        20.000000
        13.000000
         2.400000
         4.000000
         3.000000
         2.700000
         0.250000
         1.000000
         0.650000
      ...
         2.000000
         0.500000
         1.000000
         0.500000
         1.000000
         2.000000
          4.000000
          243.000000
          35.000000
         0.000000
    
    
      25%
        31.250000
        50.000000
        43.500000
         5.700000
         6.300000
         3.500000
         3.500000
         1.562500
         2.500000
         2.175000
      ...
         3.150000
         1.500000
         2.500000
         2.125000
         3.000000
         3.000000
         33.250000
         5627.000000
         222.250000
         0.100000
    
    
      50%
        63.500000
        66.500000
        59.000000
         6.850000
         6.900000
         4.000000
         3.900000
         3.175000
         3.325000
         2.950000
      ...
         3.450000
         3.000000
         3.500000
         3.000000
         3.500000
         3.500000
         72.500000
        19103.000000
        1446.000000
         0.200000
    
    
      75%
        89.000000
        81.000000
        75.000000
         7.500000
         7.400000
         4.500000
         4.200000
         4.450000
         4.050000
         3.750000
      ...
         3.700000
         4.500000
         4.000000
         4.000000
         4.000000
         3.500000
        168.500000
        45185.750000
        4439.500000
         0.400000
    
    
      max
       100.000000
        94.000000
        94.000000
         9.600000
         8.600000
         5.000000
         4.800000
         5.000000
         4.700000
         4.700000
      ...
         4.300000
         5.000000
         4.500000
         4.500000
         5.000000
         4.500000
       2375.000000
       334164.000000
       34846.000000
         0.500000
    
  

8 rows × 21 columns



In [57]:

    
movies.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 146 entries, 0 to 145
Data columns (total 22 columns):
FILM                          146 non-null object
RottenTomatoes                146 non-null int64
RottenTomatoes_User           146 non-null int64
Metacritic                    146 non-null int64
Metacritic_User               146 non-null float64
IMDB                          146 non-null float64
Fandango_Stars                146 non-null float64
Fandango_Ratingvalue          146 non-null float64
RT_norm                       146 non-null float64
RT_user_norm                  146 non-null float64
Metacritic_norm               146 non-null float64
Metacritic_user_nom           146 non-null float64
IMDB_norm                     146 non-null float64
RT_norm_round                 146 non-null float64
RT_user_norm_round            146 non-null float64
Metacritic_norm_round         146 non-null float64
Metacritic_user_norm_round    146 non-null float64
IMDB_norm_round               146 non-null float64
Metacritic_user_vote_count    146 non-null int64
IMDB_user_vote_count          146 non-null int64
Fandango_votes                146 non-null int64
Fandango_Difference           146 non-null float64
dtypes: float64(15), int64(6), object(1)
memory usage: 26.2+ KB



In [8]:

    
# Check the movie data structure
movies.head()









    Out[8]:






  
    
      
      FILM
      RottenTomatoes
      RottenTomatoes_User
      Metacritic
      Metacritic_User
      IMDB
      Fandango_Stars
      Fandango_Ratingvalue
      RT_norm
      RT_user_norm
      ...
      IMDB_norm
      RT_norm_round
      RT_user_norm_round
      Metacritic_norm_round
      Metacritic_user_norm_round
      IMDB_norm_round
      Metacritic_user_vote_count
      IMDB_user_vote_count
      Fandango_votes
      Fandango_Difference
    
  
  
    
      0
       Avengers: Age of Ultron (2015)
       74
       86
       66
       7.1
       7.8
       5.0
       4.5
       3.70
       4.3
      ...
       3.90
       3.5
       4.5
       3.5
       3.5
       4.0
       1330
       271107
       14846
       0.5
    
    
      1
                    Cinderella (2015)
       85
       80
       67
       7.5
       7.1
       5.0
       4.5
       4.25
       4.0
      ...
       3.55
       4.5
       4.0
       3.5
       4.0
       3.5
        249
        65709
       12640
       0.5
    
    
      2
                       Ant-Man (2015)
       80
       90
       64
       8.1
       7.8
       5.0
       4.5
       4.00
       4.5
      ...
       3.90
       4.0
       4.5
       3.0
       4.0
       4.0
        627
       103660
       12055
       0.5
    
    
      3
               Do You Believe? (2015)
       18
       84
       22
       4.7
       5.4
       5.0
       4.5
       0.90
       4.2
      ...
       2.70
       1.0
       4.0
       1.0
       2.5
       2.5
         31
         3136
        1793
       0.5
    
    
      4
        Hot Tub Time Machine 2 (2015)
       14
       28
       29
       3.4
       5.1
       3.5
       3.0
       0.70
       1.4
      ...
       2.55
       0.5
       1.5
       1.5
       1.5
       2.5
         88
        19560
        1021
       0.5
    
  

5 rows × 22 columns

Fandango stars histogram



In [9]:

    
movies["Fandango_Stars"].hist(bins=[1,2,3,4,5])









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fd2a1952d30>

Metacritic (normed and rounded to the closest 0.5) histogram



In [10]:

    
movies["Metacritic_norm_round"].hist(bins=[1,2,3,4,5])









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fd284e58a90>

Some comparison metrics for Metacritic and Fandango ratings



In [11]:

    
# Mean, standard deviation, median and other quantiles 
movies[["Metacritic_norm_round","Fandango_Stars"]].describe()









    Out[11]:






  
    
      
      Metacritic_norm_round
      Fandango_Stars
    
  
  
    
      count
       146.000000
       146.000000
    
    
      mean
         2.972603
         4.089041
    
    
      std
         0.990961
         0.540386
    
    
      min
         0.500000
         3.000000
    
    
      25%
         2.125000
         3.500000
    
    
      50%
         3.000000
         4.000000
    
    
      75%
         4.000000
         4.500000
    
    
      max
         4.500000
         5.000000

Major differences between Fandango and Metacritic ratings

Fandango lowest rating is 3.0 whereas Metacritic could go as low as 0.5
The rating for Fandango are rounded to the closest 0.5 point above
The mean for Fandango is higher than its median which could mean there are more large ratings
The opposite is true for Metacritic ratings
There is less variability in the Fandango ratings (0.54 std) than the Metacritic ones (0.99 std)

=> The Fandango rating system seems biased, flawed and thus of lesser value than the Metacritic one.

Scatter plot between Fandango and Metacritic ratings



In [14]:

    
movies.plot(x='Fandango_Stars', 
            y='Metacritic_norm_round',
            kind='scatter')









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fd284e06630>

Finding the 5 movies with the biggest rating differences (in absolute values)



In [28]:

    
def abs_diff(row):
    return np.abs(row[0] - row[1])
fm_diff = movies[["Metacritic_norm_round","Fandango_Stars"]].apply(
    lambda row: abs_diff(row), axis=1)



In [30]:

    
fm_diff.sort(ascending=False)



In [45]:

    
differently_rated_movies = movies.loc[fm_diff.head().index]["FILM"]
differently_rated_movies = '\n'.join(differently_rated_movies.values)



In [46]:

    
print("""The top 5 differently rated  movies are: \n{0}""".format(differently_rated_movies))









    



The top 5 differently rated  movies are: 
Do You Believe? (2015)
Taken 3 (2015)
The Longest Ride (2015)
Annie (2014)
Pixels (2015)

Correlation analysis between Fandango and Metacritic ratings



In [55]:

    
x = movies["Metacritic_norm_round"]
y = movies["Fandango_Stars"]
correlation, p_value = sps.pearsonr(x, y)
correlation_message = "The correlation between Metacritic and Fandango ratings is {0} for a p-value of {1}"
print(correlation_message.format(correlation, p_value))









    



The correlation between Metacritic and Fandango ratings is 0.17844919073895918 for a p-value of 0.031161516228523815

=> The low correlation (which is significant at an $alpha=0.05$ level) might mean that the two ratings systems are independent (this is of course just an assumption which is generally false). This is why for the next part we will analyse a linear regression model between both.

Linear regression between the two ratings systems



In [65]:

    
slope, intercept, _, _,stderr = sps.linregress(x,y)

Predict the Fandango rating given a Metacritic one



In [75]:

    
def make_prediction(x):
    return slope * x + intercept

metacritic_rating = 3.0
fandango_predicted_rating = make_prediction(metacritic_rating)
prediction_message = "The Fandango rating for a {0} Metacritic one is: {1}"
print(prediction_message.format(metacritic_rating, 
                                fandango_predicted_rating))









    



The Fandango rating for a 3.0 Metacritic one is: 4.091707152821204

A scatter plot + fitted regression line



In [88]:

    
fandango_predicted_ratings = list(map(make_prediction, 
                                      x))
plt.scatter(x, y)
plt.plot(x, fandango_predicted_ratings, color='red')









    Out[88]:





[<matplotlib.lines.Line2D at 0x7fd27f271cf8>]

Display the residuals distribution of the fitted Fandango ratings



In [153]:

    
# Construct a Gaussian curve using the mean and std of the Fandango 
# residuals
np.random.seed(314159265)
min_res = fandango_ratings_residuals.min()
max_res = fandango_ratings_residuals.max()
linear_space = np.linspace(min_res, max_res, 
                           len(fandango_ratings_residuals))
mean = fandango_ratings_residuals.mean()
std = fandango_ratings_residuals.std()
normal_residuals = sps.norm.pdf(linear_space, mean, std)



In [154]:

    
import seaborn as sns
fandango_ratings_residuals = fandango_predicted_ratings - y
def IQR(x):
    return np.ediff1d(x.quantile([0.25, 0.75]))
def optimal_bins_width(x):
    return 2 * IQR(x) / (len(x) ** (1/3))
def optimal_bins_number(x):
    return (x.max() - x.min()) / optimal_bins_width(x)
bins = int(optimal_bins_number(fandango_ratings_residuals))
sns.distplot(fandango_ratings_residuals, bins=bins, color="r")
plt.plot(linear_space, normal_residuals)









    Out[154]:





[<matplotlib.lines.Line2D at 0x7fd27cd99d30>]

=> The residuals doesn't seem to be drawn from a Gaussian distribution. Let's test this assumption.

Testing the normality of the residuals



In [159]:

    
_, normality_p_value = sps.normaltest(fandango_ratings_residuals)



In [160]:

    
normality_p_value









    Out[160]:





0.029984190318971815

=> The fandango residuals are thus normal at $alpha=0.05$ level but not at $alpha=0.01$ one.

To end out investigation, a qqplot



In [167]:

    
sps.probplot(fandango_ratings_residuals, dist="norm", plot=plt)
plt.show()

	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	Metacritic_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
count	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	...	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000	146.000000
mean	60.849315	63.876712	58.808219	6.519178	6.736986	4.089041	3.845205	3.042466	3.193836	2.940411	...	3.368493	3.065068	3.226027	2.972603	3.270548	3.380137	185.705479	42846.205479	3848.787671	0.243836
std	30.168799	20.024430	19.517389	1.510712	0.958736	0.540386	0.502831	1.508440	1.001222	0.975869	...	0.479368	1.514600	1.007014	0.990961	0.788116	0.502767	316.606515	67406.509171	6357.778617	0.152665
min	5.000000	20.000000	13.000000	2.400000	4.000000	3.000000	2.700000	0.250000	1.000000	0.650000	...	2.000000	0.500000	1.000000	0.500000	1.000000	2.000000	4.000000	243.000000	35.000000	0.000000
25%	31.250000	50.000000	43.500000	5.700000	6.300000	3.500000	3.500000	1.562500	2.500000	2.175000	...	3.150000	1.500000	2.500000	2.125000	3.000000	3.000000	33.250000	5627.000000	222.250000	0.100000
50%	63.500000	66.500000	59.000000	6.850000	6.900000	4.000000	3.900000	3.175000	3.325000	2.950000	...	3.450000	3.000000	3.500000	3.000000	3.500000	3.500000	72.500000	19103.000000	1446.000000	0.200000
75%	89.000000	81.000000	75.000000	7.500000	7.400000	4.500000	4.200000	4.450000	4.050000	3.750000	...	3.700000	4.500000	4.000000	4.000000	4.000000	3.500000	168.500000	45185.750000	4439.500000	0.400000
max	100.000000	94.000000	94.000000	9.600000	8.600000	5.000000	4.800000	5.000000	4.700000	4.700000	...	4.300000	5.000000	4.500000	4.500000	5.000000	4.500000	2375.000000	334164.000000	34846.000000	0.500000

	FILM	RottenTomatoes	RottenTomatoes_User	Metacritic	Metacritic_User	IMDB	Fandango_Stars	Fandango_Ratingvalue	RT_norm	RT_user_norm	...	IMDB_norm	RT_norm_round	RT_user_norm_round	Metacritic_norm_round	Metacritic_user_norm_round	IMDB_norm_round	Metacritic_user_vote_count	IMDB_user_vote_count	Fandango_votes	Fandango_Difference
0	Avengers: Age of Ultron (2015)	74	86	66	7.1	7.8	5.0	4.5	3.70	4.3	...	3.90	3.5	4.5	3.5	3.5	4.0	1330	271107	14846	0.5
1	Cinderella (2015)	85	80	67	7.5	7.1	5.0	4.5	4.25	4.0	...	3.55	4.5	4.0	3.5	4.0	3.5	249	65709	12640	0.5
2	Ant-Man (2015)	80	90	64	8.1	7.8	5.0	4.5	4.00	4.5	...	3.90	4.0	4.5	3.0	4.0	4.0	627	103660	12055	0.5
3	Do You Believe? (2015)	18	84	22	4.7	5.4	5.0	4.5	0.90	4.2	...	2.70	1.0	4.0	1.0	2.5	2.5	31	3136	1793	0.5
4	Hot Tub Time Machine 2 (2015)	14	28	29	3.4	5.1	3.5	3.0	0.70	1.4	...	2.55	0.5	1.5	1.5	1.5	2.5	88	19560	1021	0.5