In [1]:
%run imports.py
%run helper_functions.py
%matplotlib inline
%run grid.py
%autosave 120
plt.rcParams["xtick.labelsize"] = 20
plt.rcParams["ytick.labelsize"] = 20
plt.rcParams["axes.labelsize"] = 20
The objective of this notebook, and more broadly, this project is to see whether we can discern a linear relationship between metrics found on Rotton Tomatoes and Box Office performance.
Box office performance is measured in millions as is budget.
Because we have used scaling, interpretation of the raw coefficients will be difficult. Luckily, sklearns standard scaler has an inverse_transform method, thus, if we had to, we could reverse transform the coefficients (sc_X_train for the holdout group and sc_x for the non-holdout group) to get some interpretation. The same logic follows for interpreting target variables should we use the model for prediction.
The year, country, language and month will all be made into dummy variables. I will do this with the built in pd.get_dummies() function! This will turn all columns into type object into dummies! I will also use the optional parameter drop_first to avoid the dummy variable trap!
I will use sklearn's standard scaler on all of my variables, except for the dummies! This is important since we will be using regularized regression i.e. Lasso, Ridge, Elastic Net
I will shuffle my dataframe before the train, test split. I will utilise the X_train, y_train, X_test and y_test variables in GridsearchCV. This is an example of employing cross-validation with a holdout set. This will help guard against overfitting.
To be truly honest, I do not have enough data to justify using a hold out set, however, I want to implement it as an academic exercise! It also give me more code to write!
I will then re-implement the models with using the hold out set to compare results!
Let's get to it!
In [2]:
df = unpickle_object("final_dataframe_for_analysis.pkl") #dataframe we got from webscraping and cleaning!
#see other notebooks for more info.
In [3]:
df.dtypes # there are all our features. Our target variable is Box_office
Out[3]:
In [4]:
df.shape
Out[4]:
Upon further thought, it doesnt make sense to have rank_in_genre as a predictor variable for box office budget. When the movie is release, it is not ranked immeadiately. The ranks assigned often occur many years after the movie is released and so it not related to the amount of money accrued at the box office. We will drop this variable.
Right now, our index is the name of the movie! We dont need these to be indicies, it would be cleaner to have a numeric index
The month and year columns are currently in numerical form, however, for our analysis, we require that these be of type object!
In [5]:
df['Month'] = df['Month'].astype(object)
df['Year'] = df['Year'].astype(object)
del df['Rank_in_genre']
df.reset_index(inplace=True)
del df['index']
In [6]:
percentage_missing(df)
In [7]:
df.hist(layout=(4,2), figsize=(50,50))
Out[7]:
From the above plots, we see that we have heavy skewness in all of our features and our target variable.
The features will be scaled using standard scaler.
When splitting the data into training and test. I will fit my scaler according to the training data!
There is no sign of multi-collinearity $(>= 0.9)$ - good to go!
In [8]:
plot_corr_matrix(df)
In [9]:
X = unpickle_object("X_features_selection.pkl") #all features from the suffled dataframe. Numpy array
y = unpickle_object("y_variable_selection.pkl") #target variable from shuffled dataframe. Numpy array
final_df = unpickle_object("analysis_dataframe.pkl") #this is the shuffled dataframe!
In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0) #train on 75% of data
In [11]:
sc_X_train = StandardScaler()
sc_y_train = StandardScaler()
sc_X_train.fit(X_train[:,:6])#only need to learn fit of first 6 - rest are dummies
sc_y_train.fit(y_train)
X_train[:,:6] = sc_X_train.transform(X_train[:,:6]) #only need to transform first 6 columns - rest are dummies
X_test[:,:6] = sc_X_train.transform(X_test[:,:6]) #same as above
y_train = sc_y_train.transform(y_train)
y_test = sc_y_train.transform(y_test)
As we can see - the baseline model of regular linear regression is dreadful! Let's move on to more sophisticated methods!
In [12]:
baseline_model(X_train, X_test, y_train, y_test)
Out[12]:
In [13]:
holdout_results = holdout_grid(["Ridge", "Lasso", "Elastic Net"], X_train, X_test, y_train, y_test)
In [14]:
pickle_object(holdout_results, "holdout_model_results")
In [15]:
sc_X = StandardScaler()
sc_y = StandardScaler()
In [16]:
sc_X.fit(X[:,:6])#only need to learn fit of first 6 - rest are dummies
sc_y.fit(y)
X[:,:6] = sc_X.transform(X[:,:6]) #only need to transform first 6 columns - rest are dummies
y = sc_y.transform(y)
In [17]:
no_holdout_results = regular_grid(["Ridge", "Lasso", "Elastic Net"], X, y)
In [18]:
pickle_object(no_holdout_results, "no_holdout_model_results")
In [19]:
extract_model_comparisons(holdout_results, no_holdout_results, "Ridge")
In [20]:
extract_model_comparisons(holdout_results, no_holdout_results, "Lasso")
In [21]:
extract_model_comparisons(holdout_results, no_holdout_results, "Elastic Net")
From the above, we can see that 2/3 models had a higher $R^{2}$ without the use of a holdout set. This fit in line with the theory that when you use less data (i.e. use a hold-out set), you model will perform worse. As such, hold -out sets should only be implemented when a plethora of training data is available!
We also see that the budget feature and No. of Reviews on Rotton tomatoes were the strongest feature when predicting Box Office Revenue!
Please note: The data collection process was a nightmare with lots of missing data, as such, various methods of imputation were employed. This is probably most apparent from the fact that our highest $R^{2}$ across all models was $~0.45$.
While I did not obtain the strongest results, this project was fantastic in exposing me to methods of regularization, standardization and introductory machine learning techniques.
I am especially ecstatic that I got to use the GridSearchCV class! This class will make it very easy for me to take on more advanced machine learning topics in future projects.
Other than the focus on Machine Learning - this project was an excellent exercise in data collection and cleaning!