Exploring Movie Data


Timothy Helton

Exploring and visualizing movie data scraped from Box Office Mojo. Make sure to format your plots properly with axis labels and graph titles at the very least.



NOTE:
This notebook uses code found in the k2datascience.movies module. To execute all the cells do one of the following items:

  • Install the k2datascience package to the active Python interpreter.
  • Add k2datascience/k2datascience to the PYTHON_PATH system variable.
  • Create a link to the movies.py file in the same directory as this notebook.


Imports


In [ ]:
from k2datascience import movies

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

Data Prep

Load the dataset and explore its contents, columns and types. Convert any date columns from string to Python's datetime.


In [ ]:
mov = movies.BoxOffice()
print(f'Data Types:\n{mov.data.dtypes}\n\n')
print(f'Data Shape:\n{mov.data.shape}\n\n')
print(f'Missing Data:\n{mov.data.isnull().sum()}\n\n')
mov.data.head()
mov.data.tail()
mov.data.describe()

In [ ]:
mov.distribution_plot()

In [ ]:
mov.kde_plot()

Exercise 1

Plot the Domestic Total Gross by Release Date. Try a scatterplot and a line plot. In what scale are the numbers on the Y axis?


In [ ]:
mov.domestic_gross_vs_release_date_plot()

Y-Axis Labels

The y-axis labels are scaled to millions of dollars.

Exercise 2

Plot the Domestic Total Gross by Runtime. Plot Runtime vs Release Date. Try a scatterplot and a line plot. What are the benefits and liabilities of each type of plot in this case?


In [ ]:
mov.domestic_gross_vs_runtime_plot()
mov.runtime_vs_release_plot()

Benefits and Liabilities

  1. The scatter plot clearly bins the data allowing one to observe the counts of movies at a given run time and the corresponding sales.
  2. The line plot is harder to gain insight.
    • The overall majority of run times reside beween 100-120 minutes in length and there is not a trend over time that the run times are increasing or decreasing.

Exercise 3

The Motion Picture Association of America (MPAA) rating system comprises of G (general audiences: all ages admitted), PG (parental guidance suggested: may not be suitable for children), PG-13 (parents stronly cautioned: may be inappropriate for children under 13), and R (restricted: under 17 requires accompanying parent or adult guardian). Find the average Runtime and Domestic Total Gross at each Rating Level. Plot both by Rating. Do you see any pattern in the data?


In [ ]:
mov.rating_plot()

Findings

  1. Movies marketed towards adults are on average 10-15% longer than those for younger audiences.
  2. Movies marketed towards younger audiences are far more likely to produce larger domestic gross sales.

Exercise 4

Plot the Domestic Total Gross by the Release Date. Segment by Rating - that is, have all 4 groups on the same plot. Can you spot anything out of the ordinary? Now make 4 separate plots (one for each Rating) but part of the same matplotlib figure. What are the benefits and liabilities of each approach?


In [ ]:
mov.domestic_gross_rating_plot()

Findings

  1. The combined plot makes details of most of the data difficult to interpret, but highlights the points that are above average.
  2. The segregated plots require the scales to be set equal on each plot to facilitate a comparision.
  3. By combining all these plots together a composite view of the data is formed resulting in insight.

Exercise 5

Who are the top 10 directors with the highest gross per movie (highest average gross)? How many movies did they release in that time period? Who are the top 10 directors with the highest average gross excluding one-hit wonders (directors that have a single release in the dataset)?


In [ ]:
directors = mov.director_performance()
print('Top 10 Directors')
directors.head(10)

print('Top 10 Directors with more than one release')
directors.query('qty > 1').head(10)

Exercise 6

Group the movies by month of release and plot the mean Domestic Total Gross by month. Calculate the Standard Error and add error bars to the plot. Can you tell any pattern from the data?


In [ ]:
mov.domestic_gross_vs_months()

Findings

  1. The peak periods for average domestic gross sales are summer (May, June).
  2. The holiday season (November, December) appears to be a time for promissing time for releasing a movie, but upon further investigation the market is saturated.