Analysis on the Movie Lens dataset using pandas

I am creating the notebook for the mini project for course DSE200x - Python for Data Science on edX. The project requires each participant to complete the following steps:
  • Selecting a dataset
  • Exploring the dataset to identify what kinds of questions can be answered using the dataset
  • Identifying one research question
  • Using pandas methods to explore the dataset - this also involves using visualization techniques using matplotlib
  • Reporting findings/analyses
  • Presenting the work in the given presentation template

Selecting a dataset

The mini projects requires us to choose from among three datasets that have been explored through the course previously. I have selected the movie lens dataset, also known as the IMDB Movie Dataset.

The dataset is available for download here - https://grouplens.org/datasets/movielens/20m/

Description about the dataset, as shown on the website is below:

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.


In [2]:
# The first step is to import the dataset into a pandas dataframe. 

import pandas as pd

#path = 'C:/Users/hrao/Documents/Personal/HK/Python/ml-20m/ml-20m/'
path = '/Users/Harish/Documents/HK_Work/Python/ml-20m/'

movies = pd.read_csv(path+'movies.csv')
movies.shape


Out[2]:
(27278, 3)

In [3]:
tags = pd.read_csv(path+'tags.csv')
tags.shape


Out[3]:
(465564, 4)

In [4]:
ratings = pd.read_csv(path+'ratings.csv')
ratings.shape


Out[4]:
(20000263, 4)

In [5]:
links = pd.read_csv(path+'links.csv')
links.shape


Out[5]:
(27278, 3)

Exploring the dataset

Identifying the questions that can be answered using the dataset


In [6]:
movies.head()


Out[6]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

In [7]:
tags.head()


Out[7]:
userId movieId tag timestamp
0 18 4141 Mark Waters 1240597180
1 65 208 dark hero 1368150078
2 65 353 dark hero 1368150079
3 65 521 noir thriller 1368149983
4 65 592 dark hero 1368150078

In [8]:
ratings.head()


Out[8]:
userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580

In [9]:
links.head()


Out[9]:
movieId imdbId tmdbId
0 1 114709 862.0
1 2 113497 8844.0
2 3 113228 15602.0
3 4 114885 31357.0
4 5 113041 11862.0
Based on the above exploratory commands, I believe that the following questions can be answered using the dataset:
  1. Is there a correlation or a trend between the year of release of a movie and the genre?
  2. Which genres were more dominant in each decade of the range available in the dataset?
  3. Do science fiction movies tend to be rated more highly than other movie genres?
For the mini-project, I have chosen question 3 for further analysis.

Using pandas methods to explore the dataset

Includes matplotlib visualization


In [10]:
# List of genres as a Python list 

genres = ['Action','Adventure','Animation','Children','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']

In [11]:
genres_rating_list = []

In [12]:
# The loop reads each element of the above list
    # For each iteration, one genre is selected from the movies data frame
    # This selection of the data frame is then merged with the rating data frame to get the rating for that genre
    # Once the new merged data frame is created, we use the mean function to get the mean rating for the genre
    # The genre and the corresponding mean rating are then appended to the genres_rating Data Frame
    # The entire looping takes long - can certainly be optimized for performance
    
for i in range(len(genres)):
    fil = genres[i]+'_filter'
    mov = genres[i]+'_movies'
    rat = genres[i]+'_ratings'
    rat_mean = rat+'_mean'
    fil = movies['genres'].str.contains(genres[i])
    mov = movies[fil]
    rat = mov.merge(ratings, on='movieId', how='inner')
    rat_mean = round(rat['rating'].mean(), 2)
    #print(genres[i], round(rat_mean,2))
    genres_rating_list.append(rat_mean)

In [13]:
df = {'Genre':genres, 'Genres Mean Rating':genres_rating_list}

In [14]:
genres_rating = pd.DataFrame(df)

In [15]:
genres_rating


Out[15]:
Genre Genres Mean Rating
0 Action 3.44
1 Adventure 3.50
2 Animation 3.62
3 Children 3.41
4 Comedy 3.43
5 Crime 3.67
6 Documentary 3.74
7 Drama 3.67
8 Fantasy 3.51
9 Film-Noir 3.97
10 Horror 3.28
11 Musical 3.56
12 Mystery 3.66
13 Romance 3.54
14 Sci-Fi 3.44
15 Thriller 3.51
16 War 3.81
17 Western 3.57

In [16]:
genres_rating['Genres Standard Deviation'] = genres_rating['Genres Mean Rating'].std()

In [17]:
genres_rating['Mean'] = genres_rating['Genres Mean Rating'].mean()
genres_rating['Zero'] = 0

In [18]:
genres_rating


Out[18]:
Genre Genres Mean Rating Genres Standard Deviation Mean Zero
0 Action 3.44 0.163244 3.573889 0
1 Adventure 3.50 0.163244 3.573889 0
2 Animation 3.62 0.163244 3.573889 0
3 Children 3.41 0.163244 3.573889 0
4 Comedy 3.43 0.163244 3.573889 0
5 Crime 3.67 0.163244 3.573889 0
6 Documentary 3.74 0.163244 3.573889 0
7 Drama 3.67 0.163244 3.573889 0
8 Fantasy 3.51 0.163244 3.573889 0
9 Film-Noir 3.97 0.163244 3.573889 0
10 Horror 3.28 0.163244 3.573889 0
11 Musical 3.56 0.163244 3.573889 0
12 Mystery 3.66 0.163244 3.573889 0
13 Romance 3.54 0.163244 3.573889 0
14 Sci-Fi 3.44 0.163244 3.573889 0
15 Thriller 3.51 0.163244 3.573889 0
16 War 3.81 0.163244 3.573889 0
17 Western 3.57 0.163244 3.573889 0

In [19]:
overall_mean = round(genres_rating['Genres Mean Rating'].mean(), 2)
overall_std = round(genres_rating['Genres Mean Rating'].std(),2)
scifi_rating = genres_rating[genres_rating['Genre'] == 'Sci-Fi']['Genres Mean Rating']

In [20]:
print(overall_mean)


3.57

In [21]:
print(overall_std)


0.16

In [22]:
print(scifi_rating)


14    3.44
Name: Genres Mean Rating, dtype: float64

In [23]:
genres_rating['Diff from Mean'] = genres_rating['Genres Mean Rating'] - overall_mean

In [24]:
genres_rating


Out[24]:
Genre Genres Mean Rating Genres Standard Deviation Mean Zero Diff from Mean
0 Action 3.44 0.163244 3.573889 0 -0.13
1 Adventure 3.50 0.163244 3.573889 0 -0.07
2 Animation 3.62 0.163244 3.573889 0 0.05
3 Children 3.41 0.163244 3.573889 0 -0.16
4 Comedy 3.43 0.163244 3.573889 0 -0.14
5 Crime 3.67 0.163244 3.573889 0 0.10
6 Documentary 3.74 0.163244 3.573889 0 0.17
7 Drama 3.67 0.163244 3.573889 0 0.10
8 Fantasy 3.51 0.163244 3.573889 0 -0.06
9 Film-Noir 3.97 0.163244 3.573889 0 0.40
10 Horror 3.28 0.163244 3.573889 0 -0.29
11 Musical 3.56 0.163244 3.573889 0 -0.01
12 Mystery 3.66 0.163244 3.573889 0 0.09
13 Romance 3.54 0.163244 3.573889 0 -0.03
14 Sci-Fi 3.44 0.163244 3.573889 0 -0.13
15 Thriller 3.51 0.163244 3.573889 0 -0.06
16 War 3.81 0.163244 3.573889 0 0.24
17 Western 3.57 0.163244 3.573889 0 0.00
Now that we have a data frame of information about each genre and the corresponding mean rating, we will visualize the data using matplotlib

In [25]:
genre_list = list(genres_rating['Genre'])

In [26]:
genres_rating_list = list(genres_rating['Genres Mean Rating'])
genres_diff_list = list(genres_rating['Diff from Mean'])

In [27]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))

ax1 = plt.subplot(2,1,1)
x = [x for x in range(0, 18)]
xticks_genre_list = genre_list
y = genres_rating_list
plt.xticks(range(len(x)), xticks_genre_list)
plt.scatter(x,y, color='g')
plt.plot(x, genres_rating['Mean'], color="red")
plt.autoscale(tight=True)
#plt.rcParams["figure.figsize"] = (10,2)
plt.title('Movie ratings by genre')
plt.xlabel('Genre')
plt.ylabel('Rating')
plt.ylim(ymax = 4, ymin = 3)
plt.grid(True)
plt.savefig(r'movie-ratings-by-genre.png')

plt.annotate("Sci-Fi Rating",
            xy=(14.25,3.5), xycoords='data',
            xytext=(14.20, 3.7), textcoords='data',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"),
            )

for i,j in enumerate( y ):
    ax1.annotate( j, ( x[i] + 0.03, y[i] + 0.02))

ax2 = plt.subplot(2,1,2)
x = [x for x in range(0, 18)]
xticks_genre_list = genre_list
y = genres_rating['Diff from Mean']
plt.xticks(range(len(x)), xticks_genre_list)
plt.plot(x,y)
plt.plot(x, genres_rating['Zero'])
plt.autoscale(tight=True)
#plt.rcParams["figure.figsize"] = (10,2)
plt.title('Deviation of each genre\'s rating from the overall mean rating')
plt.xlabel('Genre')
plt.ylabel('Deviation from mean rating')
plt.grid(True)
plt.savefig(r'deviation-from-mean-rating.png')

plt.annotate("Sci-Fi Rating",
            xy=(14,-0.13), xycoords='data',
            xytext=(14.00, 0.0), textcoords='data',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"),
            )


plt.show()


Reporting findings/analyses

Now that we have a couple plots, let us revisit the question we want to answer using the dataset.
Again, the question is - Do science fiction movies tend to be rated more highly than other movie genres?
  • The scatter plot shows the mean rating value for each genre. Each genre has a value on the scatter plot for the mean rating value for that genre. Let us now see if the plot is able to help us answer the question above.

  • The mean rating for Sci-Fi genre is about 3.45. When looking at the plot, we see that there are only three other genres out of 18 genres in total, that have lesser mean ratings than Sci-Fi - Horror, Children and Comedy. The remaining 10 genres have mean ratings higher than Science Fiction.

  • This gives us enough information to answer the question. Sci-Fi movies do not tend to be rated higher than other genres.

  • The second plot, a bar plot, shows how much each genre's ratings deviate from the overall mean of ratings. Science Fiction is around -0.13 lower than the mean rating of 3.58, showing lesser deviation than Horror at the lower end and Film-Noir at the higher end.

To conclude - no, science fiction movies are not rated higher than other movie genres. The ratings for science fiction movies hover around the mean ratings for all movies.

I have submitted my work to the mini project section of the course. Now, we will explore the dataset further and try to answer the remaining questions I have listed at the beginning of the notebook.

- Is there a correlation or a trend between the year of release of a movie and the genre?
- Which genres were more dominant in each decade of the range available in the dataset?

In [160]:
# extract year of release of each movie from the title column
# convert the data type of the movie_year column to numeric (from str)

import numpy as np
import re 

movies['movie_year'] = movies['title']

movies['movie_year'] = movies['movie_year'].str.extract(r"\(([0-9]+)\)", expand=False)

# creating a new column with just the movie titles
movies['title_only'] = movies['title']
movies['title_only'] = movies['title_only'].str.extract('(.*?)\s*\(', expand=False)

In [161]:
movies['movie_year'].fillna(0, inplace=True)

In [162]:
#Drop all rows containing incorrect year values - such as 0, 6, 69, 500 and -2147483648
movies.drop(movies[movies.movie_year == '0'].index, inplace=True)
movies.drop(movies[movies.movie_year == '6'].index, inplace=True)
movies.drop(movies[movies.movie_year == '06'].index, inplace=True)
movies.drop(movies[movies.movie_year == '69'].index, inplace=True)
movies.drop(movies[movies.movie_year == '500'].index, inplace=True)
movies.drop(movies[movies.movie_year == '-2147483648'].index, inplace=True)

movies.drop(movies[movies.movie_year == 0].index, inplace=True)
movies.drop(movies[movies.movie_year == 6].index, inplace=True)
movies.drop(movies[movies.movie_year == 69].index, inplace=True)
movies.drop(movies[movies.movie_year == 500].index, inplace=True)
movies.drop(movies[movies.movie_year == -2147483648].index, inplace=True)

In [163]:
#convert the string values to numeric
movies['movie_year'] = pd.to_datetime(movies['movie_year'], format='%Y')

Now that we have a move year column, let us list the data types of the columns in the movies data frame.

movie_year is of float64 datat type. We must convert the data type of the movie_year column to int64. Before we go ahead and do that, we must replace all NULL and inifinite entries in the column with zero. If we do not perform this step, we will get the following errror message.


In [164]:
movie_year = pd.DataFrame(movies['title_only'].groupby(movies['movie_year']).count())

In [165]:
movie_year.reset_index(inplace=True)

In [166]:
X=movie_year['movie_year']
Y=movie_year['title_only']

In [167]:
plt.plot_date(X,Y,'bo-')
plt.grid(True)
plt.rcParams["figure.figsize"] = (15,5)
plt.title('Number of movies per year')
plt.xlabel('Years')
plt.ylabel('Number of movies')
plt.xlim('1885-01-01','2020-01-01')
plt.show()


The above plot provides some interesting insight:

  • There was a steady increase in the number of movies after 1930 and till 2008.
  • In this dataset, 2009 was the year when the highest number of movies were produced - 1112 in all.
  • The decades between 1970 and 2000 saw the highest year-on-year increase in the number of movies produced.
  • 2014 saw a sharp drop in the nimber of movies produced, from 1011 in 2013 to only 740 movies.
  • The movie count of 2015 is only 120. This could possibly be due to the lack of information available for the entire year of 2015.

In [264]:
movies.head()


Out[264]:
movieId title genres movie_year title_only
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy 1995-01-01 Toy Story
1 2 Jumanji (1995) Adventure|Children|Fantasy 1995-01-01 Jumanji
2 3 Grumpier Old Men (1995) Comedy|Romance 1995-01-01 Grumpier Old Men
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance 1995-01-01 Waiting to Exhale
4 5 Father of the Bride Part II (1995) Comedy 1995-01-01 Father of the Bride Part II

In [415]:
list(movies)


Out[415]:
['movieId', 'title', 'genres', 'movie_year', 'title_only']

In [410]:
a = pd.Series(movies.iloc[0])

In [411]:
a


Out[411]:
movieId                                                 1
title                                    Toy Story (1995)
genres        Adventure|Animation|Children|Comedy|Fantasy
movie_year                            1995-01-01 00:00:00
title_only                                      Toy Story
Name: 0, dtype: object

In [418]:
def flat(str1):
    c = pd.DataFrame(columns=list(movies))
    for i in range(len(str1)):
        #print(str1[i])
        if i == 2:
            a = str1[i].split('|')
    for j in range(len(a)):
        c.loc[j] = [str1[0], str1[1], a[j], str1[3], str1[4]]
    return c

In [419]:
c = flat(a)

In [420]:
c


Out[420]:
movieId title genres movie_year title_only
0 1.0 Toy Story (1995) Adventure 1995-01-01 Toy Story
1 1.0 Toy Story (1995) Animation 1995-01-01 Toy Story
2 1.0 Toy Story (1995) Children 1995-01-01 Toy Story
3 1.0 Toy Story (1995) Comedy 1995-01-01 Toy Story
4 1.0 Toy Story (1995) Fantasy 1995-01-01 Toy Story

In [ ]: