So you want to win an Oscar? Do successful movies share certain attributes?
Here we explore the factors that correlate with movie success, defined as grossing a hefty chunk of change and/or winning an Oscar. We answer these questions using separate data sets. First, we determine which factors are correlated with a movie financial success, using movie information spanning 2009-2014 that we scraped from IMDb (using IMDbpy) and financial information from boxofficemojo.com, the-numbers.com. Second, to determine the factors correlated with winning a sought after Oscar, we use award information for the years 1981-2006 available via csv on ya-shin.com and supplemented it with movie information from IMDb using IMDbpy.
We were inspired to work on this project, because we want to win an Oscar. After all, who doesn't? To analyze our data, we drew extensively upon the skills that we developed throughout Harvard's Data Science course, especially regression in lab 4, ensemble methods in lab 7, and the classification methods we used in assignment 3.
Initially, our goal was to perform a comprehensive mutli-part analysis to find which attributes might make a movie more likely to win an Oscar from 1980-2015. Our initial goals were as follows:
Classify actors, directors, and producers as A-list, B-list, and C-list, using clustering algorithms.
Using data from IMDB, we will predict which movies will be Oscar nominees. We will perform this prediction on data from both the current year and previous years (1985-2015), so we can see how well our algorithm performs. Determining the metric of success will be an important first step in the project (i.e. nominated vs. winning).
In addition to predicting the movie’s success, we will answer a number of questions regarding the factors that might be associated with success. Some questions we will try to answer are:
Does the time of year that a movie comes out affect the movie’s success?
Does the cast & crew matter?
Does the movie’s budget matter?
Are there certain plot elements that will make the movies more successful? We will determine this using plot keywords.
Does the movie’s genre affect its potential for success?
IMDbpy is not capable of retrieving data especially quickly, so this turned out to be a monumental task, which our WiFi bandwidth was not capable of performing for us in the time allowed for this project. Therefore, we decided to focus our analysis around structured data that was more readily available via other sites such as boxofficemojo, the-numbers, and ya-shin and supplement the movies in these data sets with data from IMDbpy. By narrowing our data set to some of the most popular movies, we expect our data will also be more balanced allowing us to recover a stronger signal in our analyses.
For our analysis of what features are associated with winning an Oscar, we asked the following questions:
For our analysis of what features are associated with a movie's profitability, we asked the following questions:
We wrote several helper functions to retrieve the IMDbpy data that we needed for the project. We used these helper functions as we determined the features associated with both movie financial succes and academy award success. We describe these functions in the order they are introduced. As part of this submission, we have included separate notebooks that describe the data scraping and cleaning process. These respective scraping processes insert the required data into pickle files that are leveraged in the analysis later.
Because our datasets did not span the same time range, at this point, our analyses diverged into two mostly separate analyses, which we performed in separate ipython notebooks.
Our results are presented in platable form at So you want to win an Oscar?.