Finding Similar Movies

We'll start by loading up the MovieLens dataset. Using Pandas, we can very quickly load the rows of the u.data and u.item files that we care about, and merge them together so we can work with movie names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)


In [1]:
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('e:/sundog-consult/udemy/datascience/ml-100k/u.data', sep='\t', names=r_cols, usecols=range(3), encoding="ISO-8859-1")

m_cols = ['movie_id', 'title']
movies = pd.read_csv('e:/sundog-consult/udemy/datascience/ml-100k/u.item', sep='|', names=m_cols, usecols=range(2), encoding="ISO-8859-1")

ratings = pd.merge(movies, ratings)

In [2]:
ratings.head()


Out[2]:
movie_id title user_id rating
0 1 Toy Story (1995) 308 4
1 1 Toy Story (1995) 287 5
2 1 Toy Story (1995) 148 4
3 1 Toy Story (1995) 280 4
4 1 Toy Story (1995) 66 3

Now the amazing pivot_table function on a DataFrame will construct a user / movie rating matrix. Note how NaN indicates missing data - movies that specific users didn't rate.


In [3]:
movieRatings = ratings.pivot_table(index=['user_id'],columns=['title'],values='rating')
movieRatings.head()


Out[3]:
title 'Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996) 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996) 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968) 3 Ninjas: High Noon At Mega Mountain (1998) 39 Steps, The (1935) ... Yankee Zulu (1994) Year of the Horse (1997) You So Crazy (1994) Young Frankenstein (1974) Young Guns (1988) Young Guns II (1990) Young Poisoner's Handbook, The (1995) Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 2.0 5.0 NaN NaN 3.0 4.0 NaN NaN ... NaN NaN NaN 5.0 3.0 NaN NaN NaN 4.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1664 columns

Let's extract a Series of users who rated Star Wars:


In [4]:
starWarsRatings = movieRatings['Star Wars (1977)']
starWarsRatings.head()


Out[4]:
user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Pandas' corrwith function makes it really easy to compute the pairwise correlation of Star Wars' vector of user rating with every other movie! After that, we'll drop any results that have no data, and construct a new DataFrame of movies and their correlation score (similarity) to Star Wars:


In [5]:
similarMovies = movieRatings.corrwith(starWarsRatings)
similarMovies = similarMovies.dropna()
df = pd.DataFrame(similarMovies)
df.head(10)


C:\Users\Frank\AppData\Local\Enthought\Canopy\User\lib\site-packages\numpy\lib\function_base.py:2079: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
Out[5]:
0
title
'Til There Was You (1997) 0.872872
1-900 (1994) -0.645497
101 Dalmatians (1996) 0.211132
12 Angry Men (1957) 0.184289
187 (1997) 0.027398
2 Days in the Valley (1996) 0.066654
20,000 Leagues Under the Sea (1954) 0.289768
2001: A Space Odyssey (1968) 0.230884
39 Steps, The (1935) 0.106453
8 1/2 (1963) -0.142977

(That warning is safe to ignore.) Let's sort the results by similarity score, and we should have the movies most similar to Star Wars! Except... we don't. These results make no sense at all! This is why it's important to know your data - clearly we missed something important.


In [6]:
similarMovies.sort_values(ascending=False)


Out[6]:
title
Star Wars (1977)                                                                     1.000000
Man of the Year (1995)                                                               1.000000
Full Speed (1996)                                                                    1.000000
Mondo (1996)                                                                         1.000000
Line King: Al Hirschfeld, The (1996)                                                 1.000000
Outlaw, The (1943)                                                                   1.000000
Hurricane Streets (1998)                                                             1.000000
Hollow Reed (1996)                                                                   1.000000
Scarlet Letter, The (1926)                                                           1.000000
Safe Passage (1994)                                                                  1.000000
Good Man in Africa, A (1994)                                                         1.000000
Golden Earrings (1947)                                                               1.000000
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)    1.000000
No Escape (1994)                                                                     1.000000
Ed's Next Move (1996)                                                                1.000000
Stripes (1981)                                                                       1.000000
Cosi (1996)                                                                          1.000000
Commandments (1997)                                                                  1.000000
Twisted (1996)                                                                       1.000000
Beans of Egypt, Maine, The (1994)                                                    1.000000
Last Time I Saw Paris, The (1954)                                                    1.000000
Maya Lin: A Strong Clear Vision (1994)                                               1.000000
Designated Mourner, The (1997)                                                       0.970725
Albino Alligator (1996)                                                              0.968496
Angel Baby (1995)                                                                    0.962250
Prisoner of the Mountains (Kavkazsky Plennik) (1996)                                 0.927173
Love in the Afternoon (1957)                                                         0.923381
'Til There Was You (1997)                                                            0.872872
A Chef in Love (1996)                                                                0.868599
Quiet Room, The (1996)                                                               0.866025
                                                                                       ...   
Collectionneuse, La (1967)                                                          -1.000000
Bewegte Mann, Der (1994)                                                            -1.000000
Lamerica (1994)                                                                     -1.000000
Frankie Starlight (1995)                                                            -1.000000
To Have, or Not (1995)                                                              -1.000000
Legal Deceit (1997)                                                                 -1.000000
Slingshot, The (1993)                                                               -1.000000
Swept from the Sea (1997)                                                           -1.000000
For Ever Mozart (1996)                                                              -1.000000
Love and Death on Long Island (1997)                                                -1.000000
Glass Shield, The (1994)                                                            -1.000000
Squeeze (1996)                                                                      -1.000000
Crossfire (1947)                                                                    -1.000000
Neon Bible, The (1995)                                                              -1.000000
American Dream (1990)                                                               -1.000000
Theodore Rex (1995)                                                                 -1.000000
Horse Whisperer, The (1998)                                                         -1.000000
Lover's Knot (1996)                                                                 -1.000000
S.F.W. (1994)                                                                       -1.000000
Fille seule, La (A Single Girl) (1995)                                              -1.000000
Sliding Doors (1998)                                                                -1.000000
Nightwatch (1997)                                                                   -1.000000
Show, The (1995)                                                                    -1.000000
Nil By Mouth (1997)                                                                 -1.000000
Fall (1997)                                                                         -1.000000
I Like It Like That (1994)                                                          -1.000000
Sudden Manhattan (1996)                                                             -1.000000
Salut cousin! (1996)                                                                -1.000000
Tough and Deadly (1995)                                                             -1.000000
Dream Man (1995)                                                                    -1.000000
dtype: float64

Our results are probably getting messed up by movies that have only been viewed by a handful of people who also happened to like Star Wars. So we need to get rid of movies that were only watched by a few people that are producing spurious results. Let's construct a new DataFrame that counts up how many ratings exist for each movie, and also the average rating while we're at it - that could also come in handy later.


In [7]:
import numpy as np
movieStats = ratings.groupby('title').agg({'rating': [np.size, np.mean]})
movieStats.head()


Out[7]:
rating
size mean
title
'Til There Was You (1997) 9 2.333333
1-900 (1994) 5 2.600000
101 Dalmatians (1996) 109 2.908257
12 Angry Men (1957) 125 4.344000
187 (1997) 41 3.024390

Let's get rid of any movies rated by fewer than 100 people, and check the top-rated ones that are left:


In [8]:
popularMovies = movieStats['rating']['size'] >= 100
movieStats[popularMovies].sort_values([('rating', 'mean')], ascending=False)[:15]


Out[8]:
rating
size mean
title
Close Shave, A (1995) 112 4.491071
Schindler's List (1993) 298 4.466443
Wrong Trousers, The (1993) 118 4.466102
Casablanca (1942) 243 4.456790
Shawshank Redemption, The (1994) 283 4.445230
Rear Window (1954) 209 4.387560
Usual Suspects, The (1995) 267 4.385768
Star Wars (1977) 584 4.359589
12 Angry Men (1957) 125 4.344000
Citizen Kane (1941) 198 4.292929
To Kill a Mockingbird (1962) 219 4.292237
One Flew Over the Cuckoo's Nest (1975) 264 4.291667
Silence of the Lambs, The (1991) 390 4.289744
North by Northwest (1959) 179 4.284916
Godfather, The (1972) 413 4.283293

100 might still be too low, but these results look pretty good as far as "well rated movies that people have heard of." Let's join this data with our original set of similar movies to Star Wars:


In [9]:
df = movieStats[popularMovies].join(pd.DataFrame(similarMovies, columns=['similarity']))

In [10]:
df.head()


Out[10]:
(rating, size) (rating, mean) similarity
title
101 Dalmatians (1996) 109 2.908257 0.211132
12 Angry Men (1957) 125 4.344000 0.184289
2001: A Space Odyssey (1968) 259 3.969112 0.230884
Absolute Power (1997) 127 3.370079 0.085440
Abyss, The (1989) 151 3.589404 0.203709

And, sort these new results by similarity score. That's more like it!


In [11]:
df.sort_values(['similarity'], ascending=False)[:15]


Out[11]:
(rating, size) (rating, mean) similarity
title
Star Wars (1977) 584 4.359589 1.000000
Empire Strikes Back, The (1980) 368 4.206522 0.748353
Return of the Jedi (1983) 507 4.007890 0.672556
Raiders of the Lost Ark (1981) 420 4.252381 0.536117
Austin Powers: International Man of Mystery (1997) 130 3.246154 0.377433
Sting, The (1973) 241 4.058091 0.367538
Indiana Jones and the Last Crusade (1989) 331 3.930514 0.350107
Pinocchio (1940) 101 3.673267 0.347868
Frighteners, The (1996) 115 3.234783 0.332729
L.A. Confidential (1997) 297 4.161616 0.319065
Wag the Dog (1997) 137 3.510949 0.318645
Dumbo (1941) 123 3.495935 0.317656
Bridge on the River Kwai, The (1957) 165 4.175758 0.316580
Philadelphia Story, The (1940) 104 4.115385 0.314272
Miracle on 34th Street (1994) 101 3.722772 0.310921

Ideally we'd also filter out the movie we started from - of course Star Wars is 100% similar to itself. But otherwise these results aren't bad.

Activity

100 was an arbitrarily chosen cutoff. Try different values - what effect does it have on the end results?


In [ ]: