Project: Project 2: Luther
Date: 02/03/2017
Name: Prashant Tatineni
For Project Luther, I gathered the set of all films listed under movie franchises on boxofficemojo.com. My goal was to predict the success of a movie sequel (i.e., domestic gross in USD) based on the performance of other sequels, and especially based on previous films in that particular franchise. I saw some linear correlation between certain variables, like number of theaters, and the total domestic gross, but the predictions from my final model were not entirely reasonable. More time could be spent on better addressing the various outliers in the dataset.
In [978]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
import requests
from bs4 import BeautifulSoup
import dateutil.parser
import statsmodels.api as sm
import patsy
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import sys, sklearn
from sklearn import linear_model, preprocessing
from sklearn import metrics
%matplotlib inline
I started with the "Franchises" list on Boxofficemojo.com. Within each franchise page, I scraped each movie's information and enter it into a Python dictionary. If it's already in the dictionary, the entry will be overwritten, except with a different Franchise name. But note below that the url for "Franchises" list was sorted Ascending, so this conveniently rolls "subfranchises" into their "parent" franchise.
E.g., "Fantastic Beasts" and the "Harry Potter" movies have their own separate Franchises, but they will all be tagged as the "JKRowling" franchise, i.e. "./chart/?id=jkrowling.htm"
Also, because I was comparing sequels to their predecessors, I focused on Domestic Gross, adjusted for ticket price inflation.
In [75]:
url = 'http://www.boxofficemojo.com/franchises/?view=Franchise&sort=nummovies&order=ASC&p=.htm'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page,"lxml")
tables = soup.find_all("table")
rows = [row for row in tables[3].find_all('tr')]
rows = rows[1:]
In [900]:
# Initialize empty dictionary of movies
movies = {}
for row in rows:
items = row.find_all('td')
franchise = items[0].find('a')['href']
franchiseurl = 'http://www.boxofficemojo.com/franchises/' + franchise[2:]
response = requests.get(franchiseurl)
franchise_page = response.text
franchise_soup = BeautifulSoup(franchise_page,"lxml")
franchise_tables = franchise_soup.find_all("table")
franchise_gross = [row for row in franchise_tables[4].find_all('tr')]
franchise_gross = franchise_gross[1:len(franchise_gross)-2]
franchise_adjgross = [row for row in franchise_tables[5].find_all('tr')]
franchise_adjgross = franchise_adjgross[1:len(franchise_adjgross)-2]
# Assign movieurl as key
# Add title, franchise, inflation-adjusted gross, release date.
for row in franchise_adjgross:
movie_info = row.find_all('td')
movieurl = movie_info[1].find('a')['href']
title = movie_info[1]
adjgross = movie_info[3]
release = movie_info[5]
movies[movieurl] = [title.text]
movies[movieurl].append(franchise)
movies[movieurl].append(adjgross.text)
movies[movieurl].append(release.text)
# Add number of theaters for the above movies
for row in franchise_gross:
movie_info = row.find_all('td')
movieurl = movie_info[1].find('a')['href']
theaters = movie_info[4]
if movieurl in movies.keys():
movies[movieurl].append(theaters.text)
In [901]:
df = pd.DataFrame(movies.values())
df.columns = ['Title','Franchise', 'AdjGross', 'Release', 'Theaters']
df.head()
Out[901]:
In [902]:
df.shape
Out[902]:
In [903]:
# Remove movies that were re-issues, special editions, or separate 3D or IMAX versions.
df['Ignore'] = df['Title'].apply(lambda x: 're-issue' in x.lower() or 're-release' in x.lower() or 'special edition' in x.lower() or '3d)' in x.lower() or 'imax' in x.lower())
df = df[(df.Ignore == False)]
del df['Ignore']
df.shape
Out[903]:
In [905]:
# Convert Adjusted Gross to a number
df['AdjGross'] = df['AdjGross'].apply(lambda x: int(x.replace('$','').replace(',','')))
# Convert Date string to dateobject. Need to prepend '19' for dates > 17 because Python treats '/60' as year '2060'
df['Release'] = df['Release'].apply(lambda x: (x[:-2] + '19' + x[-2:]) if int(x[-2:]) > 17 else x)
df['Release'] = df['Release'].apply(lambda x: dateutil.parser.parse(x))
The films need to be grouped by franchise so that franchise-related data can be included as featured for each observation.
In [908]:
df = df.sort_values(['Franchise','Release'])
df['CumGross'] = df.groupby(['Franchise'])['AdjGross'].apply(lambda x: x.cumsum())
df['SeriesNum'] = df.groupby(['Franchise'])['Release'].apply(lambda x: x.rank())
df['PrevAvgGross'] = (df['CumGross'] - df['AdjGross'])/(df['SeriesNum'] - 1)
In [910]:
df.Theaters = df.Theaters.replace('-','0')
df['Theaters'] = df['Theaters'].apply(lambda x: int(x.replace(',','')))
In [911]:
df['PrevRelease'] = df['Release'].shift()
# Create a second dataframe with franchise group-related information.
df_group = pd.DataFrame(df.groupby(['Franchise'])['Title'].apply(lambda x: x.count()))
df_group['FirstGross'] = df.groupby(['Franchise'])['AdjGross'].first()
df_group['FirstRelease'] = df.groupby(['Franchise'])['Release'].first()
df_group['SumTheaters'] = df.groupby(['Franchise'])['Theaters'].apply(lambda x: x.sum())
In [912]:
df_group.columns = ['NumOfFilms','FirstGross','FirstRelease','SumTheaters']
df_group['AvgTheaters'] = df_group['SumTheaters']/df_group['NumOfFilms']
In [913]:
df_group['Franchise'] = df.groupby(['Franchise'])['Franchise'].first()
In [914]:
df = df.merge(df_group, on='Franchise')
In [915]:
df.head()
Out[915]:
In [916]:
df['Theaters'] = df.Theaters.replace(0,df.AvgTheaters)
In [917]:
# Drop rows with NaN. Drops all first films, but I've already stored first film information within other features.
df = df.dropna()
df.shape
Out[917]:
In [918]:
df['DaysSinceFirstFilm'] = df.Release - df.FirstRelease
df['DaysSinceFirstFilm'] = df['DaysSinceFirstFilm'].apply(lambda x: x.days)
In [919]:
df['DaysSincePrevFilm'] = df.Release - df.PrevRelease
df['DaysSincePrevFilm'] = df['DaysSincePrevFilm'].apply(lambda x: x.days)
In [922]:
df.sort_values('Release',ascending=False).head()
Out[922]:
For the regression model, I decided to keep data for films released through 2016, but drop the 3 films released this year; because of their recent release date, their gross earnings will not yet be representative.
In [923]:
films17 = df.loc[[530,712,676]]
In [1008]:
# Grabbing columns for regression model and dropping 2017 films
dfreg = df[['AdjGross','Theaters','SeriesNum','PrevAvgGross','FirstGross','DaysSinceFirstFilm','DaysSincePrevFilm']]
dfreg = dfreg.drop([530,712,676])
dfreg.shape
Out[1008]:
In [929]:
dfreg.corr()
Out[929]:
In [930]:
sns.pairplot(dfreg);
In [1031]:
sns.regplot((dfreg.PrevAvgGross), (dfreg.AdjGross));
In [1032]:
sns.regplot(np.log(dfreg.Theaters), np.log(dfreg.AdjGross));
In the pairplot we can see that 'AdjGross' may have some correlation with the variables, particularly 'Theaters' and 'PrevAvgGross'. However, it looks like a polynomial model, or natural log / some other transformation will be required before fitting a linear model.
In [932]:
y, X = patsy.dmatrices('AdjGross ~ Theaters + SeriesNum + PrevAvgGross + FirstGross + DaysSinceFirstFilm + DaysSincePrevFilm', data=dfreg, return_type="dataframe")
In [933]:
model = sm.OLS(y, X)
fit = model.fit()
fit.summary()
Out[933]:
In [935]:
fit.resid.plot(style='o');
In [938]:
polyX=PolynomialFeatures(2).fit_transform(X)
In [941]:
polymodel = sm.OLS(y, polyX)
polyfit = polymodel.fit()
polyfit.rsquared
Out[941]:
In [944]:
polyfit.resid.plot(style='o');
In [945]:
polyfit.rsquared_adj
Out[945]:
In [947]:
hetnames = ['Lagrange multiplier statistic', 'p-val', 'f-val', 'f p-val']
hettest = sm.stats.diagnostic.het_breushpagan(fit.resid, fit.model.exog)
zip(hetnames,hettest)
Out[947]:
In [948]:
hetnames = ['Lagrange multiplier statistic', 'p-val', 'f-val', 'f p-val']
hettest = sm.stats.diagnostic.het_breushpagan(polyfit.resid, fit.model.exog)
zip(hetnames,hettest)
Out[948]:
In [949]:
dfPolyX = pd.DataFrame(polyX)
bcPolyX = pd.DataFrame()
for i in range(dfPolyX.shape[1]):
bcPolyX[i] = scipy.stats.boxcox(dfPolyX[i])[0]
In [950]:
# Transformed data with Box-Cox:
bcPolyX.head()
Out[950]:
In [952]:
# Introduce log(y) for target variable:
y = y.reset_index(drop=True)
logy = np.log(y)
In [965]:
logPolyModel = sm.OLS(logy, bcPolyX)
logPolyFit = logPolyModel.fit()
logPolyFit.rsquared_adj
Out[965]:
In [974]:
X_scaled = preprocessing.scale(bcPolyX)
en_cv = linear_model.ElasticNetCV(cv=10, normalize=False)
en_cv.fit(X_scaled, logy)
Out[974]:
In [985]:
en_cv.coef_
Out[985]:
In [983]:
logy_en = en_cv.predict(X_scaled)
mse = metrics.mean_squared_error(logy, logy_en)
In [984]:
# The mean square error for this model
mse
Out[984]:
In [1069]:
plt.scatter([x for x in range(540)],(pd.DataFrame(logy_en)[0] - logy['AdjGross']));
In [1004]:
films17
Out[1004]:
In [992]:
df17 = films17[['AdjGross','Theaters','SeriesNum','PrevAvgGross','FirstGross','DaysSinceFirstFilm','DaysSincePrevFilm']]
y17, X17 = patsy.dmatrices('AdjGross ~ Theaters + SeriesNum + PrevAvgGross + FirstGross + DaysSinceFirstFilm + DaysSincePrevFilm', data=df17, return_type="dataframe")
polyX17 = PolynomialFeatures(2).fit_transform(X17)
dfPolyX17 = pd.DataFrame(polyX17)
bcPolyX17 = pd.DataFrame()
for i in range(dfPolyX17.shape[1]):
bcPolyX17[i] = scipy.stats.boxcox(dfPolyX17[i])[0]
X17_scaled = preprocessing.scale(bcPolyX17)
In [993]:
# Run the "en_cv" model from above on the 2017 data:
logy_en_2017 = en_cv.predict(X17_scaled)
In [1005]:
# Predicted Adjusted Gross:
pd.DataFrame(np.exp(logy_en_2017))
Out[1005]:
In [1002]:
# Adjusted Gross as of 2/1:
y17
Out[1002]: