In this challenge, you'll practice creating data visualizations using data on Hollywood movies that were released betwen 2007 to 2011. The goal is to better understand the underlying economics of Hollywood and explore the outlier nature of success of movies. The dataset was compiled by David McCandless and you can read about how the data was compiled here. You'll use a version of this dataset that was compiled by John Goodall, which can be downloaded from his Github repo here.
In [32]:
# %sh
# wget https://raw.githubusercontent.com/jgoodall/cinevis/master/data/csvs/moviedata.csv
# ls -l
In [33]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
hollywood_movies = pd.read_csv('moviedata.csv')
print hollywood_movies.head()
In [34]:
print hollywood_movies['exclude'].value_counts()
In [35]:
hollywood_movies = hollywood_movies.drop('exclude', axis=1)
In [36]:
fig = plt.figure(figsize=(6, 10))
ax1 = fig.add_subplot(2, 1, 1)
ax1.scatter(hollywood_movies['Profitability'], hollywood_movies['Audience Rating'])
ax1.set(xlabel='Profitability', ylabel='Audience Rating', title='Hollywood Movies, 2007-2011')
ax2 = fig.add_subplot(2, 1, 2)
ax2.scatter(hollywood_movies['Audience Rating'], hollywood_movies['Profitability'])
ax2.set(xlabel='Audience Rating', ylabel='Profitability', title='Hollywood Movies, 2007-2011')
plt.show()
Both scatter plots in the previous step contained 1 outlier data point, which caused the scale of both plots to be incredibly lopsided to accomodate for this one outlier. The movie in question is Paranormal Activity, and is widely known as the most profitable movie ever. The movie brought in $193.4 million in revenue with a budget of only $15,000. Let's filter out this movie so you can create useful visualizations with the rest of the data.
In [37]:
from pandas.tools.plotting import scatter_matrix
normal_movies = hollywood_movies[hollywood_movies['Film'] != 'Paranormal Activity']
scatter_matrix(normal_movies[['Profitability', 'Audience Rating']], figsize=(6,6))
plt.show()
In [42]:
fig = plt.figure()
normal_movies.boxplot(['Critic Rating', 'Audience Rating'])
plt.show()
In [39]:
normal_movies = normal_movies.sort(columns='Year')
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(1, 2, 1)
sns.boxplot(x=normal_movies['Year'], y=normal_movies['Critic Rating'], ax=ax1)
ax2 = fig.add_subplot(1, 2, 2)
sns.boxplot(x=normal_movies['Year'], y=normal_movies['Audience Rating'], ax=ax2)
plt.show()
Many Hollywood movies aren't profitable and it's interesting to understand the role of ratings in a movie's profitability. You first need to separate the movies into those were that profitable and those that weren't.
We've created a new Boolean column called Profitable with the following specification:
False if the value for Profitability is less than or equal to 1.0.
True if the value for Profitability is greater than or equal to 1.0.
In [40]:
def is_profitable(row):
if row["Profitability"] <= 1.0:
return False
return True
normal_movies["Profitable"] = normal_movies.apply(is_profitable, axis=1)
fig = plt.figure(figsize=(8,6))
ax1 = fig.add_subplot(1, 2, 1)
sns.boxplot(x=normal_movies['Profitable'], y=normal_movies['Audience Rating'], ax=ax1)
ax2 = fig.add_subplot(1, 2, 2)
sns.boxplot(x=normal_movies['Profitable'], y=normal_movies['Critic Rating'], ax=ax2)
plt.show()
In [ ]: