Pixar Animation is one of the most well known animation studios in the world and many people worldwide religiously watch every new released film.
Here we use a dataset on Pixar movies gathered from multiple sources including:
Here are some of the columns in our dataset PixarMovies.csv:
In [1]:
# Setup the environment by importing the libraries we need
import pandas as pd
import matplotlib.pyplot as plt
# Note: Importing seaborn effects all matplotlib and pandas plots as well
import seaborn as sns
# Run the Jupyter magic so that plots are displayed in the notebook
%matplotlib notebook
In [2]:
# Read the dataset into a DataFrame and determine the dimensions
pixar_movies = pd.read_csv('../data/PixarMovies.csv')
pixar_movies.shape
Out[2]:
In [3]:
# Display the entire dataset since it isn't too big
pixar_movies.head(15)
Out[3]:
In [4]:
# Get the datatypes for each column
pixar_movies.dtypes
Out[4]:
In [5]:
# Generate some summary statistics
pixar_movies.describe()
Out[5]:
In [6]:
# Strip the percentage sign (%) from the end of values and convert to float
pixar_movies['Domestic %'] = pixar_movies['Domestic %'].str.rstrip('%').astype(float)
pixar_movies['International %'] = pixar_movies['International %'].str.rstrip('%').astype(float)
pixar_movies[['Domestic %', 'International %']].head()
Out[6]:
In [7]:
# Multiply IMDB Scroe column by 10 to convert to a 100 point scale
pixar_movies['IMDB Score'] = pixar_movies['IMDB Score'] * 10
pixar_movies.head()
Out[7]:
In [8]:
# Create a new DataFrame with the last row filtered out
filtered_pixar = pixar_movies.dropna()
filtered_pixar
Out[8]:
In [9]:
# Set the Movie column as the index for both DataFrames
pixar_movies.set_index('Movie', inplace=True)
filtered_pixar.set_index('Movie', inplace=True)
pixar_movies.head()
Out[9]:
In [10]:
# Create a new DataFrame containing just the critics reviews
critics_reviews = pixar_movies[['RT Score', 'IMDB Score', 'Metacritic Score']]
critics_reviews.head()
Out[10]:
In [12]:
# Use the DataFrame plot() metod to visualize this new DataFrame
critics_reviews.plot()
Out[12]:
In [13]:
# The resulting plot is a little cramped, so lets tweak the figure size
critics_reviews.plot(figsize=(9,6))
Out[13]:
Note: Note all movie names are listed on the x-axis and the vertical grid line on the x-aixs exist only for every other movie.
In [14]:
# Box plot
critics_reviews.plot(kind='box', figsize=(9,5))
Out[14]:
In [15]:
# Stacked bar plot
revenue_proportions = filtered_pixar[['Domestic %', 'International %']]
revenue_proportions.plot(kind='bar', stacked='True', figsize=(9,6))
Out[15]:
Create a grouped bar plot to explore if there's any correlation between the number of Oscars a movie was nominated for and the number it actually won.
In [16]:
# Create a grouped bar plot
movie_oscars = filtered_pixar[['Oscars Nominated', 'Oscars Won']]
movie_oscars.plot(kind='bar')
Out[16]:
What plots can you generate to better understand which columns correlate with the Adjusted Domestic Gross revenue column, which describes the total domestic revenue adjusted for economic and ticket price inflation?
In [17]:
# Generate plots to better understand which columsn correlate with the Adjusted Domestic Gross revenue
# Compute pairwise correlation of columns to understand which columns may have interesting correlation
pixar_movies.corr()
Out[17]:
Domesitic Gross obviously has a strong positive correlation with Adusted Domestic Gross. It makes sense that all of the critic reviews have a strong positive correlation with money made. What isn't obvious aprior is which review score will correlate most strongly with box office success. For example, it looks like RT has a really strong correlation, ubt Metacritic less so.
In [18]:
pixar_movies.plot(x='RT Score', y='Adjusted Domestic Gross', kind='scatter')
Out[18]:
In [19]:
adjusted_gross = pixar_movies.copy()
adjusted_gross.set_index('Adjusted Domestic Gross', inplace=True)
adjusted_gross.head()
Out[19]:
In [ ]: