# Non-Personalized Recommenders Assignment

## Overview

This assignment will explore non-personalized recommendations. You will be given a 20x20 matrix where columns represent movies, rows represent users, and each cell represents a user-movie rating.

## Deliverables

There are 4 deliverables for this assignment. Each deliverable represents a different analysis of the data provided to you. For each deliverable, you will submit a list of the top 5 movies as ranked by a particular metric. The 4 metrics are:

1. Mean Rating: Calculate the mean rating for each movie, order with the highest rating listed first, and submit the top 5.
2. % of ratings 4+: Calculate the percentage of ratings for each movie that are 4 or higher. Order with the highest percentage first, and submit the top 5.
3. Rating Count: Count the number of ratings for each movie, order with the most number of ratings first, and submit the top 5.
4. Top 5 Star Wars: Calculate movies that most often occur with Star Wars: Episode IV - A New Hope (1977) using the (x+y)/x method described in class. In other words, for each movie, calculate the percentage of Star Wars raters who also rated that movie. Order with the highest percentage first, and submit the top 5.

## Importing Libraries

``````

In [114]:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

``````

``````

In [115]:

``````
``````

In [116]:

# Looking at the first 5 rows of the dataframe

``````
``````

Out[116]:

User
260: Star Wars: Episode IV - A New Hope (1977)
1210: Star Wars: Episode VI - Return of the Jedi (1983)
356: Forrest Gump (1994)
318: Shawshank Redemption, The (1994)
593: Silence of the Lambs, The (1991)
1: Toy Story (1995)
2028: Saving Private Ryan (1998)
296: Pulp Fiction (1994)
...
2396: Shakespeare in Love (1998)
2916: Total Recall (1990)
780: Independence Day (ID4) (1996)
1265: Groundhog Day (1993)
2571: Matrix, The (1999)
527: Schindler's List (1993)
2762: Sixth Sense, The (1999)
1198: Raiders of the Lost Ark (1981)
34: Babe (1995)

0
755
1
5
2
NaN
4
4
2
2
NaN
...
2
NaN
5
2
NaN
4
2
5
NaN
NaN

1
5277
5
3
NaN
2
4
2
1
NaN
NaN
...
3
2
2
NaN
2
NaN
5
1
3
NaN

2
1577
NaN
NaN
NaN
5
2
NaN
4
NaN
NaN
...
NaN
1
4
4
1
1
2
3
1
3

3
4388
NaN
3
NaN
NaN
NaN
1
2
3
4
...
NaN
4
1
3
5
NaN
5
1
1
2

4
1202
4
3
4
1
4
1
NaN
4
NaN
...
5
1
NaN
4
NaN
3
5
5
NaN
NaN

5 rows × 21 columns

``````
``````

In [117]:

#printing the column names of the dataframe
movie_data.columns

``````
``````

Out[117]:

Index([u'User', u'260: Star Wars: Episode IV - A New Hope (1977)',
u'1210: Star Wars: Episode VI - Return of the Jedi (1983)',
u'356: Forrest Gump (1994)', u'318: Shawshank Redemption, The (1994)',
u'593: Silence of the Lambs, The (1991)', u'3578: Gladiator (2000)',
u'1: Toy Story (1995)', u'2028: Saving Private Ryan (1998)',
u'296: Pulp Fiction (1994)', u'1259: Stand by Me (1986)',
u'2396: Shakespeare in Love (1998)', u'2916: Total Recall (1990)',
u'780: Independence Day (ID4) (1996)', u'541: Blade Runner (1982)',
u'1265: Groundhog Day (1993)', u'2571: Matrix, The (1999)',
u'527: Schindler's List (1993)', u'2762: Sixth Sense, The (1999)',
u'1198: Raiders of the Lost Ark (1981)', u'34: Babe (1995)'],
dtype='object')

``````
``````

In [118]:

# Summarizing the data in the movie_data dataframe
movie_data.describe()

``````
``````

Out[118]:

User
260: Star Wars: Episode IV - A New Hope (1977)
1210: Star Wars: Episode VI - Return of the Jedi (1983)
356: Forrest Gump (1994)
318: Shawshank Redemption, The (1994)
593: Silence of the Lambs, The (1991)
1: Toy Story (1995)
2028: Saving Private Ryan (1998)
296: Pulp Fiction (1994)
...
2396: Shakespeare in Love (1998)
2916: Total Recall (1990)
780: Independence Day (ID4) (1996)
1265: Groundhog Day (1993)
2571: Matrix, The (1999)
527: Schindler's List (1993)
2762: Sixth Sense, The (1999)
1198: Raiders of the Lost Ark (1981)
34: Babe (1995)

count
20.000000
15.000000
14.000000
10.000000
10.000000
16.00000
12.000000
17.000000
11.000000
11.000000
...
11.000000
12.000000
13.000000
9.000000
12.000000
12.000000
12.000000
12.000000
11.000000
10.000000

mean
3658.100000
3.266667
3.000000
2.700000
3.600000
3.06250
2.916667
2.823529
3.000000
3.000000
...
2.909091
1.916667
2.769231
3.222222
3.166667
2.833333
3.000000
2.833333
2.909091
3.000000

std
1749.716756
1.387015
1.467599
1.337494
1.646545
1.28938
1.564279
1.131111
1.414214
1.183216
...
1.513575
0.996205
1.235168
1.092906
1.585923
1.527525
1.595448
1.642245
1.578261
1.414214

min
139.000000
1.000000
1.000000
1.000000
1.000000
1.00000
1.000000
1.000000
1.000000
1.000000
...
1.000000
1.000000
1.000000
2.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000

25%
2558.750000
2.000000
2.000000
2.000000
2.500000
2.00000
1.750000
2.000000
2.000000
2.000000
...
2.000000
1.000000
2.000000
2.000000
2.000000
1.750000
2.000000
1.000000
1.500000
2.000000

50%
4252.500000
4.000000
3.000000
2.500000
4.000000
3.00000
3.000000
2.000000
3.000000
3.000000
...
3.000000
2.000000
3.000000
3.000000
3.000000
2.500000
2.500000
3.000000
3.000000
2.500000

75%
4916.250000
4.000000
4.000000
3.750000
5.000000
4.00000
4.000000
4.000000
4.000000
4.000000
...
4.000000
2.250000
4.000000
4.000000
5.000000
4.000000
5.000000
4.250000
4.000000
4.000000

max
6037.000000
5.000000
5.000000
5.000000
5.000000
5.00000
5.000000
5.000000
5.000000
5.000000
...
5.000000
4.000000
5.000000
5.000000
5.000000
5.000000
5.000000
5.000000
5.000000
5.000000

8 rows × 21 columns

``````

## Non-Personalized Recommenders for Raiders of the Lost Ark

``````

In [119]:

# Storing the "1198: Raiders of the Lost Ark (1981)" data into an array
raid_lost_arc = movie_data["1198: Raiders of the Lost Ark (1981)"]
raid_lost_arc

``````
``````

Out[119]:

0    NaN
1      3
2      1
3      1
4    NaN
5    NaN
6      5
7      5
8    NaN
9    NaN
10     1
11   NaN
12     5
13   NaN
14     3
15     3
16   NaN
17     2
18   NaN
19     3
Name: 1198: Raiders of the Lost Ark (1981), dtype: float64

``````

Mean rating for Raiders of the Lost Ark (1981)

``````

In [120]:

print '%.2f' % ( raid_lost_arc.mean() )

``````
``````

2.91

``````

Number of non-NA ratings for Raiders of the Lost Ark (1981)

``````

In [121]:

raid_lost_arc.count()

``````
``````

Out[121]:

11

``````

Percentage of ratings >=4 for Raiders of the Lost Ark (1981)

``````

In [122]:

print '%.1f' % ( (len(raid_lost_arc[raid_lost_arc>=4])/float(raid_lost_arc.count()))*100.0 )

``````
``````

27.3

``````

Finding Association of Raiders of the Lost Ark (1981) with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH Raiders of the Lost Ark (1981) and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV.

``````

In [123]:

# First, storing the Star Wars count
star_wars_count = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count()

``````
``````

In [124]:

# Then multiply the Raiders of the Lost Ark and Star Wars data.
# non-NA values will be the ones where both entries do not have NA. Then, count these entries
rad_arc_star_wars_count = (movie_data["1198: Raiders of the Lost Ark (1981)"]*movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]).count()

``````

Printing the Association of Raiders of the Lost Ark (1981) and Star Wars Episode IV

``````

In [125]:

print '%.1f' % ( (rad_arc_star_wars_count/float(star_wars_count))*100.0 )

``````
``````

46.7

``````

## Finding top 5 movies with the highest ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the mean rating for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.

``````

In [126]:

rating_means = pd.Series([movie_data[col_name].mean() for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])

``````

Printing the top 5 rated movies

``````

In [127]:

rating_means.sort_values(ascending=False)[0:5]

``````
``````

Out[127]:

318: Shawshank Redemption, The (1994)             3.600000
260: Star Wars: Episode IV - A New Hope (1977)    3.266667
1265: Groundhog Day (1993)                        3.166667
593: Silence of the Lambs, The (1991)             3.062500
dtype: float64

``````

## Finding top 5 movies with the most ratings

Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.

``````

In [128]:

rating_count = pd.Series([movie_data[col_name].count() for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])

``````

Printing the top 5 movies with the most ratings

``````

In [129]:

rating_count.sort_values(ascending=False)[0:5]

``````
``````

Out[129]:

1: Toy Story (1995)                                        17
593: Silence of the Lambs, The (1991)                      16
260: Star Wars: Episode IV - A New Hope (1977)             15
1210: Star Wars: Episode VI - Return of the Jedi (1983)    14
780: Independence Day (ID4) (1996)                         13
dtype: int64

``````

## Top 5 movies with Percentage of ratings >=4

Making a Pandas Series with the index name equal to the movie and the entry equal to the number of non-Na ratings for each movie. Sliced the column names from of the movie_data dataframe from [1:] since the first column is the user id.

``````

In [130]:

rating_positive = pd.Series([sum(movie_data[col_name]>=4)/float(movie_data[col_name].count()) for col_name in movie_data.columns[1:]],
index=movie_data.columns[1:])

``````

Printing Top 5 movies with Percentage of ratings >=4

``````

In [131]:

rating_positive.sort_values(ascending=False)[0:5]

``````
``````

Out[131]:

318: Shawshank Redemption, The (1994)             0.700000
260: Star Wars: Episode IV - A New Hope (1977)    0.533333
593: Silence of the Lambs, The (1991)             0.437500
dtype: float64

``````

## Top 5 movies most similar to Star Wars (movie id =260)

``````

In [132]:

# First, storing the Star Wars ratings and the count of non-NA Star Wars ratings
star_wars_rat = movie_data["260: Star Wars: Episode IV - A New Hope (1977)"]
star_wars_count = float(movie_data["260: Star Wars: Episode IV - A New Hope (1977)"].count())
print star_wars_count

``````
``````

15.0

``````

Finding Association of all movies with Star Wars Episode IV. The association with Star Wars Episode IV is defined as the number of users that rated BOTH movie i and Star Wars Episode IV divided by the number of users that rated Star Wars Episode IV. Below, we are looping over [2:] to not include Star Wars Episode IV in the Association calculation.

``````

In [133]:

sim_val = pd.Series( [ (movie_data[col_name]*star_wars_rat).count()/star_wars_count
for col_name in movie_data.columns[2:] ], index=movie_data.columns[2:] )

``````

Printing Top 5 movies most similar to Star Wars (movie id =260)

``````

In [134]:

sim_val.sort_values(ascending=False)[0:5]

``````
``````

Out[134]:

1: Toy Story (1995)                                        0.933333
1210: Star Wars: Episode VI - Return of the Jedi (1983)    0.866667
593: Silence of the Lambs, The (1991)                      0.800000
780: Independence Day (ID4) (1996)                         0.733333
2916: Total Recall (1990)                                  0.666667
dtype: float64

``````
``````

In [ ]:

``````