Network Tour of Data Science - Project: "What impacts the success of a movie?"

By Célia Raposo, Yong Joon Thoo, Valentin Kindschi

$\textbf{Goal of the project:}$ From the start of the declaration of a new film, film companies try to build up hype for their future movie to gain attention. However, gaining much hype or having many page views does not necessarily give a good indication of the success of a film. Indeed, in recent years, multiple movies have crashed at the box office despite having reasonably well known actors and a big budget [1]. The goal of this project is to be able to determine what has the most impact on the success of a movie and possibly evaluate the eventual success of an upcoming film based on multiple features such as its cast or its genre by comparing them to the features of past successful or non successful films in IMDb [2] (Internet Movie Data Base).



In [ ]:

    
%matplotlib inline

import configparser
import os

import requests
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
from sklearn import preprocessing, decomposition
import librosa
import IPython.display as ipd
import json
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import WordNetLemmatizer, PorterStemmer 
from collections import OrderedDict
from pygsp import graphs, filters, plotting
from IPython.display import Image

plt.rcParams['figure.figsize'] = (17, 5)
plotting.BACKEND = 'matplotlib'

1. Data acquisition and cleaning

Datasets [3] obtained from kaggle.com: https://www.kaggle.com/stephanerappeneau/350-000-movies-from-themoviedborg were used, containing different features about the same movie such as its cast, budget, genre, IMDd’s rating, writers, popularity, gross, etc. They also provided each movie’s id which can be used to find it in either IMDb [2] and TMDb [4] to find additional information. Data on theses datasets come from The Movie Databaset (TMDb) which is based on crowdsourcing.

There was one dataset containing information about the movies such as the realease date, the budget,etc...and an other one containing information about the casting of the movies such as the main actors, the director, the producer, etc... Each of these datasets contained the same number of movies.

The two inital datasets are the following:



In [2]:

    
all_movies = pd.read_excel('350000-movies/AllMoviesDetailsCleaned.xlsx',sep=';', encoding='utf-8', 
                           error_bad_lines=False)



In [3]:

    
all_actors = pd.read_excel('350000-movies/AllMoviesCastingRaw.xlsx',sep=';', encoding='utf-8',
                           error_bad_lines=False)



In [4]:

    
all_movies.head(4)









    Out[4]:







  
    
      
      id
      budget
      genres
      imdb_id
      original_language
      original_title
      overview
      popularity
      production_companies
      production_countries
      ...
      runtime
      spoken_languages
      status
      tagline
      title
      vote_average
      vote_count
      production_companies_number
      production_countries_number
      spoken_languages_number
    
  
  
    
      0
      2
      0
      Drama|Crime
      tt0094675
      fi
      Ariel
      Taisto Kasurinen is a Finnish coal miner whose...
      0.823904
      Villealfa Filmproduction Oy
      Finland
      ...
      69.0
      suomi
      Released
      NaN
      Ariel
      7.1
      40
      2
      1
      2
    
    
      1
      3
      0
      Drama|Comedy
      tt0092149
      fi
      Varjoja paratiisissa
      An episode in the life of Nikander, a garbage ...
      0.47445
      Villealfa Filmproduction Oy
      Finland
      ...
      76.0
      English
      Released
      NaN
      Shadows in Paradise
      7.0
      32
      1
      1
      3
    
    
      2
      5
      4000000
      Crime|Comedy
      tt0113101
      en
      Four Rooms
      It's Ted the Bellhop's first night on the job....
      1.698
      Miramax Films
      United States of America
      ...
      98.0
      English
      Released
      Twelve outrageous guests. Four scandalous requ...
      Four Rooms
      6.5
      485
      2
      1
      1
    
    
      3
      6
      0
      Action|Thriller|Crime
      tt0107286
      en
      Judgment Night
      While racing to a boxing match, Frank, Mike, J...
      1.32287
      Universal Pictures
      Japan
      ...
      110.0
      English
      Released
      Don't move. Don't whisper. Don't even breathe.
      Judgment Night
      6.5
      69
      3
      2
      1
    
  

4 rows × 22 columns



In [5]:

    
all_actors.head(4)









    Out[5]:







  
    
      
      id
      actor1_name
      actor1_gender
      actor2_name
      actor2_gender
      actor3_name
      actor3_gender
      actor4_name
      actor4_gender
      actor5_name
      actor5_gender
      actor_number
      director_name
      director_gender
      director_number
      producer_name
      producer_number
      screeplay_name
      editor_name
    
  
  
    
      0
      2
      Turo Pajala
      0
      Susanna Haavisto
      0.0
      Matti PellonpÃ¤Ã¤
      2
      Eetu Hilkamo
      0
      none
      0
      4
      Aki KaurismÃ¤ki
      0.0
      1
      none
      0
      Aki KaurismÃ¤ki
      Raija Talvio
    
    
      1
      3
      Matti PellonpÃ¤Ã¤
      2
      Kati Outinen
      1.0
      Sakari Kuosmanen
      2
      Esko Nikkari
      2
      Kylli KÃ¶ngÃ¤s
      0
      7
      Aki KaurismÃ¤ki
      0.0
      1
      Mika KaurismÃ¤ki
      1
      Aki KaurismÃ¤ki
      Raija Talvio
    
    
      2
      5
      Tim Roth
      2
      Antonio Banderas
      2.0
      Jennifer Beals
      1
      Madonna
      1
      Marisa Tomei
      1
      24
      Allison Anders
      1.0
      4
      Lawrence Bender
      1
      none
      Margaret Goodspeed
    
    
      3
      6
      Emilio Estevez
      2
      Cuba Gooding Jr.
      2.0
      Denis Leary
      2
      Jeremy Piven
      2
      Peter Greene
      2
      15
      Stephen Hopkins
      2.0
      1
      Gene Levy
      1
      Lewis Colick
      Tim Wellburn



In [6]:

    
print('The initial datasets contain {} movies'.format(len(all_movies)))









    



The initial datasets contain 329044 movies

The two datasets were then merged and cleaned in order to have one dataset containing only the usefull information.

As can be seen, the dataset includes movies of diverse actors, genres, countries, years, etc. However certain films, notably those produced in non english speaking countries, contain budgets of zero and/or contain missing information. Furthermore, as most of the films produced by these companies are not showed on screens worldwide, they might not have the same impact. Thus, in addition to the films missing information or non relevant features, it was decided to remove all non english movies from our dataset.

Therefore the merged dataset was cleaned by removing the following movies:

Movies with revenue equal to zero
Movies with an original language different from english
Movies without any genre provided
Movies with a budget below 1000 dollars (probably errors)
Movies without a director name
Movies without any actors
Movies without a production company name
Movies that were not released between 1st January 2000 and 31th Decembre 2016 in order to have....

Moreover, only relevant columns of the two datasets were kept, which are: the budget, the genre, imdb id, the overview, production companies, release date, revenue, (tagline), title, director name and actor name.

Later on the project, other information were collected:

Data from metacritic website (http://www.metacritic.com/) thanks to the metacritic API were collected. In Metacritic website, ratings on movies are given by press reviewers before the release of the movie. Then these data were added to the main dataset and movies which were not present in metacritic website were removed from the main dataset.
In order to get more information on actor such as the tenure of the actors, other data which were not in the inital datset, were collected thanks to the api of the TMDb website.
The average number of trailer views over 5 different sources on YouTube.

The final, merged dataset is then displayed below:



In [7]:

    
df = pd.read_csv('Saved_Datasets/CleanDataset.csv')



In [8]:

    
df.head(4)









    Out[8]:







  
    
      
      id
      budget
      genres
      imdb_id
      overview
      production_companies
      release_date
      revenue
      title
      director_name
      actor_names
      Metacritic
      YouTube_Mean
    
  
  
    
      0
      12
      94000000
      Animation|Family
      266543
      Nemo, an adventurous young clownfish, is unexp...
      Pixar Animation Studios
      2003-05-30
      940335536
      Finding Nemo
      Andrew Stanton
      ['Albert Brooks', 'Ellen DeGeneres', 'Alexande...
      90
      0.218
    
    
      1
      16
      12800000
      Drama|Crime|Music
      168629
      Selma, a Czech immigrant on the verge of blind...
      Fine Line Features
      2000-05-17
      40031879
      Dancer in the Dark
      Lars von Trier
      ['Björk', 'Catherine Deneuve', 'David Morse', ...
      61
      Error
    
    
      2
      22
      140000000
      Adventure|Fantasy|Action
      325980
      Jack Sparrow, a freewheeling 17th-century pira...
      Walt Disney Pictures
      2003-09-07
      655011224
      Pirates of the Caribbean: The Curse of the Bla...
      Gore Verbinski
      ['Johnny Depp', 'Geoffrey Rush', 'Orlando Bloo...
      63
      1.0
    
    
      3
      24
      30000000
      Action|Crime
      266697
      An assassin is shot at the altar by her ruthle...
      Miramax Films
      2003-10-10
      180949000
      Kill Bill: Vol. 1
      Quentin Tarantino
      ['Uma Thurman', 'Lucy Liu', 'Vivica A. Fox', '...
      69
      1.0



In [9]:

    
print('The final dataset contains {} movies'.format(len(df)))









    



The final dataset contains 2621 movies



In [10]:

    
df = pd.read_csv('Saved_Datasets/NewFeaturesDataset.csv')



In [11]:

    
df.head(3)









    Out[11]:







  
    
      
      id
      budget
      genres
      imdb_id
      overview
      production_companies
      release_date
      revenue
      title
      director_name
      ...
      actors_ids
      actors_tenures
      total_tenure
      average_tenure
      Total_profitability_actors
      Metacritic
      YouTube_Mean
      Profitability
      ROI
      success
    
  
  
    
      0
      12
      94000000
      Animation|Family
      266543
      Nemo, an adventurous young clownfish, is unexp...
      Pixar Animation Studios
      2003-05-30
      940335536
      Finding Nemo
      Andrew Stanton
      ...
      [14, 5293, 12, 13, 18]
      [18, 24, 2, 28, 14]
      86
      17.2
      7310194071
      90
      0.218
      846335536
      2.639
      1
    
    
      1
      16
      12800000
      Drama|Crime|Music
      168629
      Selma, a Czech immigrant on the verge of blind...
      Fine Line Features
      2000-05-17
      40031879
      Dancer in the Dark
      Lars von Trier
      ...
      [6748, 47, 52, 50, 53]
      [49, 19, 21, 44, 15]
      148
      29.6
      294261790
      61
      Error
      27231879
      2.127
      1
    
    
      2
      22
      140000000
      Adventure|Fantasy|Action
      325980
      Jack Sparrow, a freewheeling 17th-century pira...
      Walt Disney Pictures
      2003-09-07
      655011224
      Pirates of the Caribbean: The Curse of the Bla...
      Gore Verbinski
      ...
      [1709, 116, 114, 118, 85]
      [7, 9, 7, 22, 20]
      65
      13.0
      15077223101
      63
      1.0
      515011224
      2.639
      1
    
  

3 rows × 21 columns

2. Data exploration

The goal of the project is to identify which parameters impact the success of a movie and if possible, to try and predict the success of a movie before its release. A lot of features can be constructed and explored. Indeed, whereas some features are more about the movie itself such as the cast or the genre of the movie, others are more about its popularity and network effect such as the Metacritic reviews and the number of views of the trailer.

After cleaning our dataset, multiple aspects of the features were therefore explored to give us more information about our dataset as well as to give us an indication of how we would later be able to exploit our data to reach our goal.

In this work, the following features were studied and explored:

The measure of succes determined by the Return on investment
Genres similarities between movies
Actor similarities between movies
Actors tenures at the released date of the movie
Similarity between profitability of actors between pairs of movies
Director similarties between movies
Similarities between number of movies per director
Similarities between number of movies per production company
Production companies similarities between movies
Storyline analysis
Metractric grades similarities between movies
Budget difference between movies
Number of view on trailers on YouTube before the realease date of movies

2.1. Success Definition

We have access to two metrics to define if a movie is successful or not: the metacritic and the revenue. The former is a qualitative measurement representing the overall quality of a movie and the latter a quantitative number that shows if a movie "worked" or not.

We decided to focus on the quantitative measurment, because we think that most of the people interseted in knowing if a movie will work want to know if it is going to generate money rather than knowing if it is a masterpiece.

However, to be able to compare movies, we need a mesurement that takes into account the revenue, but also the budget. This is why we computed the Return On Investment or ROI.

2.1.1 Return On Investment - ROI

The return on investment or ROI, is defined as: $$ ROI = \frac{revenue-budget}{budget}$$

Hence, when the revenue is smaller than the budget the ROI can be negative. When the ROI is null, the movie reached the break-even point, where the revenue is equal to the budget.

We also saturated the data above the third quartile, i.e. 75%, to avoid having a large spread and and beacuse we noticed that most of the very high ROI were in fact due to mistakes in the dataset (very high revenue or very low budget for example)



In [12]:

    
plt.hist(df['ROI'],bins='auto');

2.1.2 Regression

We determined two ways to evaluate is a movie is successful. The first solution to determine the success of a movie is to keep the ROI values to later perform a regression.



In [13]:

    
Image("images/ROI_regression.png")









    Out[13]:

Note that the figure seems to have a discrete number of ROI values, which is not the case in our regression algorithm.

2.1.3 Classification

The second way to evaluate the success is to divide the ROI into two categories: success and failure. We decided to consider the lowest 25% ROI as failure and the upper 75% ROI as success.



In [14]:

    
colors = ['red','green']
plt.hist([df[df['success']==0]['ROI'], df[df['success']==1]['ROI']],bins='auto',color=colors,stacked=True,range=(-1,2.7));
plt.xlabel('ROI');

The peak with a green ROI is because of the saturation.

2.2 Budget



In [15]:

    
plt.hist(df['budget'],bins=100);
plt.xlabel('Budget of movies')
plt.ylabel('Number of movies [$]')
plt.savefig('images/movies_budget.png', dpi=300, bbox_inches='tight')

In the distribution of budget of movies in the dataset, it can be seen that a lot of movies has a budget comprising between 1000 and 100'000'000'.



In [16]:

    
print('The average budget of movies in the dataset is {} $'.format(df['budget'].mean()))









    



The average budget of movies in the dataset is 44398899.01907669 $



In [17]:

    
Image("images/budget_ROI.png")









    Out[17]:



In [18]:

    
Image("images/budget_Profitability.png")









    Out[18]:

By observing the distribution of budget as function of the ROI and as function of the profitability of movies, there is no correlation between the budget of movies and the ROI and the profitability. To note that the ROI of movies in the dataset has been saturated to 2.64

2.3. Genres

In this section, we want to explore the movies based on the genre, such as how many films are there per genre, to which other genres are they most often associated and does one genre seem to be most successful than another.

2.3.1. Number of films per genre

In this subsection, we are interested in knowing which genres are most present in our dataset. To do this, we simply compute the number of times a movie indicates a certain genre. The results are then shown below.

$\underline{\textbf{Note:}}$ Since some movies have multiple genres, the sum of the number of movies per genre exceeds the total number of movies in our dataset



In [19]:

    
GenreFreq = pd.read_csv('Saved_Datasets/NbGenre.csv')



In [20]:

    
GenreFreq









    Out[20]:







  
    
      
      TV Movie
      War
      Music
      Drama
      Thriller
      Western
      Comedy
      Foreign
      Crime
      Mystery
      Horror
      History
      Adventure
      Animation
      Action
      Romance
      Fantasy
      Science Fiction
      Family
      Documentary
    
  
  
    
      0
      1
      74
      81
      1212
      785
      26
      910
      6
      419
      230
      278
      88
      491
      160
      684
      457
      261
      297
      293
      36

These results can also be visualized in the histogram plot below:



In [21]:

    
Image("images/GenreFreq.png")









    Out[21]:

From this plot, we can observe that there is a big difference between the number of movies per genre! Indeed, we can typically see that there are few foreign or TV movies compared to the number of drama or comedy movies. However, as this calculation is only done on the number of films in our cleaned dataset, this is not representative of the number of films per genre in reality.

2.3.2. Rate success per genre

Knowing the labels of each film (success or not) defined in the previous section, we can then try to determine whether a certain genre tends to be more successful than another. To do this, we simply divide the number of films of a certain genre which are labeled as successful and we divide it by the total number of films of this genre.

The results are then shown in the figure below:



In [22]:

    
Image("images/GenreSuccessRate.PNG")









    Out[22]:

As mentionned above, our computations are based only on the movies contained in our dataset and our definition of success. As such, we can see that for the genre "TV Movie" of which we only have one movie, the success rate is zero. This therefore does not give us a good generalization of the success rate in reality.

Furthermore, as indicated previously, our dataset seems to contain much more successful films than uncessful which could influence the computation as a lot of unsuccessful english films could have been removed in our data cleaning due to missing data.

2.3.3. Genre associations

Another interesting thing that could be determined is the genres with which each genre is most commonly associated. This was done by counting the number of times each pairs of genres appear with each other and then by sorting them by descending order.

The result is then shown in the dataframe below where 0 indicates the most frequent and 18 indicates the least frequent.



In [23]:

    
GenreRanking = pd.read_csv('Saved_Datasets/GenreRanking.csv')



In [24]:

    
GenreRanking









    Out[24]:







  
    
      
      Action
      Adventure
      Animation
      Comedy
      Crime
      Documentary
      Drama
      Family
      Fantasy
      Foreign
      History
      Horror
      Music
      Mystery
      Romance
      Science Fiction
      TV Movie
      Thriller
      War
      Western
    
  
  
    
      0
      Thriller
      Action
      Family
      Drama
      Thriller
      Music
      Thriller
      Comedy
      Adventure
      Drama
      Drama
      Thriller
      Drama
      Thriller
      Drama
      Action
      Drama
      Drama
      Drama
      Drama
    
    
      1
      Adventure
      Family
      Comedy
      Romance
      Drama
      Comedy
      Romance
      Adventure
      Action
      Documentary
      War
      Mystery
      Comedy
      Drama
      Comedy
      Adventure
      Thriller
      Action
      Action
      Adventure
    
    
      2
      Crime
      Comedy
      Adventure
      Family
      Action
      Family
      Comedy
      Animation
      Family
      Crime
      Action
      Drama
      Romance
      Crime
      Fantasy
      Thriller
      Crime
      Crime
      History
      Action
    
    
      3
      Drama
      Fantasy
      Fantasy
      Action
      Comedy
      Drama
      Crime
      Fantasy
      Comedy
      Thriller
      Thriller
      Action
      Family
      Horror
      Thriller
      Drama
      Science Fiction
      Horror
      Thriller
      Thriller
    
    
      4
      Science Fiction
      Science Fiction
      Action
      Adventure
      Mystery
      Foreign
      Action
      Action
      Drama
      Comedy
      Adventure
      Science Fiction
      Fantasy
      Action
      Adventure
      Fantasy
      Fantasy
      Mystery
      Adventure
      Crime
    
    
      5
      Comedy
      Thriller
      Science Fiction
      Animation
      Adventure
      Action
      Mystery
      Drama
      Science Fiction
      Action
      Romance
      Fantasy
      Animation
      Science Fiction
      Music
      Comedy
      Romance
      Science Fiction
      Romance
      Comedy
    
    
      6
      Fantasy
      Drama
      Music
      Crime
      Horror
      Adventure
      Adventure
      Science Fiction
      Thriller
      Family
      Crime
      Comedy
      Documentary
      Adventure
      Family
      Horror
      Action
      Adventure
      Comedy
      Mystery
    
    
      7
      Horror
      Animation
      Drama
      Fantasy
      Romance
      TV Movie
      History
      Romance
      Romance
      Horror
      Comedy
      Crime
      Action
      Romance
      Action
      Family
      Animation
      Comedy
      Science Fiction
      Fantasy
    
    
      8
      Mystery
      Romance
      Thriller
      Thriller
      Science Fiction
      Science Fiction
      Science Fiction
      Music
      Animation
      Science Fiction
      Mystery
      Adventure
      Crime
      Comedy
      Mystery
      Mystery
      Adventure
      Fantasy
      Fantasy
      Family
    
    
      9
      Family
      Crime
      Romance
      Science Fiction
      History
      Fantasy
      War
      Thriller
      Horror
      Fantasy
      Western
      Romance
      Adventure
      Fantasy
      Crime
      Animation
      History
      Romance
      Crime
      Animation
    
    
      10
      War
      Mystery
      Western
      Music
      Fantasy
      Romance
      Fantasy
      Mystery
      Mystery
      Romance
      Science Fiction
      War
      War
      Family
      Science Fiction
      Romance
      Mystery
      War
      Mystery
      War
    
    
      11
      Romance
      War
      Mystery
      Horror
      Western
      Animation
      Horror
      Documentary
      Music
      Animation
      Fantasy
      Music
      Horror
      Western
      History
      Crime
      Family
      History
      Music
      Science Fiction
    
    
      12
      Animation
      Western
      War
      Mystery
      Music
      Mystery
      Music
      Crime
      Crime
      TV Movie
      Animation
      Foreign
      Science Fiction
      History
      War
      War
      Foreign
      Western
      Western
      History
    
    
      13
      History
      History
      Crime
      Documentary
      War
      Horror
      Family
      Western
      War
      History
      TV Movie
      Family
      TV Movie
      War
      Horror
      Western
      Comedy
      Family
      Horror
      Romance
    
    
      14
      Western
      Horror
      TV Movie
      Western
      Foreign
      Crime
      Western
      Horror
      Western
      Western
      Horror
      TV Movie
      Foreign
      Animation
      Animation
      TV Movie
      Western
      Animation
      Animation
      TV Movie
    
    
      15
      Music
      Music
      History
      War
      Family
      Western
      Animation
      Foreign
      TV Movie
      Music
      Family
      History
      Western
      TV Movie
      TV Movie
      Foreign
      Music
      TV Movie
      TV Movie
      Foreign
    
    
      16
      Documentary
      Documentary
      Horror
      History
      TV Movie
      Thriller
      Foreign
      War
      Foreign
      War
      Foreign
      Western
      Thriller
      Music
      Foreign
      Music
      War
      Foreign
      Foreign
      Music
    
    
      17
      Foreign
      Foreign
      Foreign
      Foreign
      Animation
      War
      Documentary
      History
      History
      Adventure
      Music
      Animation
      History
      Foreign
      Western
      History
      Horror
      Music
      Family
      Horror
    
    
      18
      TV Movie
      TV Movie
      Documentary
      TV Movie
      Documentary
      History
      TV Movie
      TV Movie
      Documentary
      Mystery
      Documentary
      Documentary
      Mystery
      Documentary
      Documentary
      Documentary
      Documentary
      Documentary
      Documentary
      Documentary

Due to the same reasons listed above, we can see that "TV Movies" often figures amongst the least commonly associated genre. Furthermore, due to the limited size of our dataset, there are certain genres that are never associated to multiple other genres. As such, the ranking between the genres that have 0 associations with this genre would no longer make sense.

$\textbf{For example:}$ If genre i is never associated to genres j, k, l. It is impossible to determine a ranking between them for genre i.

However, as can be seen in the association matrix below (where associated genres are indicated in black), the minimum number of associations is 3 (see the first line). This means that for all genres, the top 3 results are representative certain trends in our cleaned dataset.



In [25]:

    
NbGenreAssos = pd.read_csv('Saved_Datasets/NbGenreAssos.csv')
plt.spy(NbGenreAssos)









    Out[25]:





<matplotlib.image.AxesImage at 0x17303006ef0>

2.4. Actors

2.4.1 Data collection and tenures computation

As mentionned in section 1, Data were collected with the TMDb API in order to build the following dataset, which contains information about actors who play in movies that are in the main dataset.



In [28]:

    
Actors = pd.read_csv('Saved_Datasets/Actorsv4Dataset.csv')



In [29]:

    
Actors.head(4)









    Out[29]:







  
    
      
      tmdb_id
      Name
      date
      total_tenure
      nb_total_movies
      movies_in_dataset
      Realease_date_of_movies_in_dataset
      Actors_tenure_in_movies
      Profitability
    
  
  
    
      0
      15295
      Vicky Haughton
      ['2000', '2010']
      11
      5
      ['Whale Rider']
      ['2003']
      [4]
      33400000
    
    
      1
      16940
      Jeremy Irons
      ['1974', '2016']
      43
      90
      ['Kingdom of Heaven', 'Eragon', 'Dungeons & Dr...
      ['2005', '2006', '2000', '2008', '2011', '2012...
      [32, 33, 27, 35, 38, 39, 40, 43, 43]
      369419665
    
    
      2
      41087
      Leslie Mann
      ['1996', '2016']
      21
      31
      ['Knocked Up', 'I Love You Phillip Morris', '1...
      ['2007', '2009', '2009', '2009', '2011', '2011...
      [12, 14, 14, 14, 16, 16, 15, 17, 19, 19]
      1314569622
    
    
      3
      52262
      Sean Maguire
      ['2000', '2015']
      16
      7
      ['Meet the Spartans']
      ['2008']
      [9]
      54646831

In this section, the idea is to evaluate the impacts of choice of the actor in a movie on the success of the movie. First the total tenure of each actors was computed. The total tenure of an actor is the time difference in years between the release date of the movie in which he first ever appeared and the last movie in which he has played. Only the movies until 2016 were taken in account. However, the total tenure is not very relevant because when a specific movie is studied, the total tenure includes movies in which the actors played that were released after the release date of the movie studied. Therefore the movie tenure of actors which is the tenures of the actors between the release date of the first movie in which he appeared and the released date of the movie studied, were computed.

These tenures were then added to the main dataset and can be seen in the column tenure of the following dataset:



In [32]:

    
df_ten = pd.read_csv('Saved_Datasets/NewFeaturesDataset.csv')



In [33]:

    
df_ten.head(2)









    Out[33]:







  
    
      
      id
      budget
      genres
      imdb_id
      overview
      production_companies
      release_date
      revenue
      title
      director_name
      ...
      actors_ids
      actors_tenures
      total_tenure
      average_tenure
      Total_profitability_actors
      Metacritic
      YouTube_Mean
      Profitability
      ROI
      success
    
  
  
    
      0
      12
      94000000
      Animation|Family
      266543
      Nemo, an adventurous young clownfish, is unexp...
      Pixar Animation Studios
      2003-05-30
      940335536
      Finding Nemo
      Andrew Stanton
      ...
      [14, 5293, 12, 13, 18]
      [18, 24, 2, 28, 14]
      86
      17.2
      7310194071
      90
      0.218
      846335536
      2.639
      1
    
    
      1
      16
      12800000
      Drama|Crime|Music
      168629
      Selma, a Czech immigrant on the verge of blind...
      Fine Line Features
      2000-05-17
      40031879
      Dancer in the Dark
      Lars von Trier
      ...
      [6748, 47, 52, 50, 53]
      [49, 19, 21, 44, 15]
      148
      29.6
      294261790
      61
      Error
      27231879
      2.127
      1
    
  

2 rows × 21 columns

Then , the total tenure of the movies, which is the sum of all the tenures of the actors at the date of the movie studied and the average tenure of the movies which is the average between all the tenures of the actors at the date of the movie, were computed and added to the main dataset.

2.4.2 Data exploration

First the distribution of the total tenures of the actors was observed. By looking at the plot below it was noticed that some actors have really high total tenures higher than 80 years which is not possible. By checking the actors with this tenures values, it was discovered that some data on the TMDb website were wrong. The higher right value of tenures was found of 72 years. Therefore when computing the movie tenures of actors the tenure values were saturated to minimum 0 and maximum 72 years.



In [34]:

    
Image("images/tot_tenures_frequency_distri.png")









    Out[34]:

Then the distribution of the frequency of total tenures and averaged tenures of the movies were plot and it can be seen that this distribution is similar to a gaussian function.



In [35]:

    
Image("images/sum_tenures_frequency_distri.png")









    Out[35]:



In [36]:

    
Image("images/avg_tenures_frequency_distri.png")









    Out[36]:

2.5 Directors



In [37]:

    
all_director = list(df['director_name'])
diff_all_director = list(set(all_director))
print('There is {} different directors in the dataset'.format(len(diff_all_director)))









    



There is 1294 different directors in the dataset



In [38]:

    
Image("images/nb_movie_per_dir.png")









    Out[38]:

As it can be seen in the distribution of the number of movies per director, there is a few percentage of the different directors in the dataset that have directed more than four movies. This features will be used in section exploitation to compare the movies.

2.6 Production companies



In [39]:

    
#Compute list of different companies
all_comp = list(df['production_companies'])
diff_all_comp = list(set(all_comp))

print('There is {} different production companies in the dataset'.format(len(diff_all_comp)))









    



There is 701 different production companies in the dataset



In [40]:

    
Image("images/nb_movie_per_company.png")









    Out[40]:

From this plot, it can be seen that most companies have only produced one or two films. In average a company have produced 3.74 movies in the dataset.

2.7. Storyline analysis

In this section, we want to explore the words contained in the movie's storyline. For example, which words apart from "stop words" are often used. To do this, the Natural Language Toolkit [5] was used to extract each word from the "overview" feature of the dataset and remove the "common words". Furthermore, to make this process easier, all words were set to lower case.

2.7.1. Most common words

To determine the most common words, all of the remaining extracted words were put in a list and the following two commands were used:

freq = nltk.FreqDist(Words)
freq.plot(30, cumulative=False)

Where "Words" is the list containing all the words. The first line of code uses the nltk library to compute the frequency at which the words from the list "Words" appear and the second one is used to plot the top N (in this case N = 30 for the sake of visualization) most frequent words.



In [41]:

    
Wordsdf = pd.read_csv('Saved_Datasets/MostCommonWords.csv', encoding='latin')



In [42]:

    
Words = Wordsdf['0'][:].values.tolist()

Plot the number of times the top N most common words appear



In [43]:

    
freq = nltk.FreqDist(Words) 
freq.plot(30, cumulative=False)

This can also be observed by using the following line of command



In [44]:

    
freq.most_common(30)









    Out[44]:





[('life', 370),
 ('new', 357),
 ('world', 350),
 ('young', 299),
 ('family', 259),
 ('man', 247),
 ('find', 241),
 ('story', 220),
 ('love', 207),
 ('years', 180),
 ('father', 179),
 ('home', 166),
 ('help', 163),
 ('back', 163),
 ('friends', 161),
 ('lives', 158),
 ('time', 154),
 ('woman', 154),
 ('together', 132),
 ('city', 132),
 ('son', 130),
 ('way', 128),
 ('become', 126),
 ('make', 125),
 ('true', 121),
 ('save', 120),
 ('wife', 120),
 ('group', 120),
 ('soon', 119),
 ('school', 119)]

The 100 most frequently used words are then used for the similarity between movies and will be explained in the data exploitation.

2.7.2. Success rate of the most common words

After determining the most commonly used words in our storylines, we can also determine their success rate. Only the top 14 are shown here:



In [45]:

    
Image("images/TopSuccessWords.png")









    Out[45]:

We notice that the success of the most common words is not very high indicating that the storyline does not necessarily impact the success of the movie. This will be explored further later on in the data exploitation section.

2.7.3. Least common words

We can also determine the least common words used in our dataset. This will also allow us to observe what kind of words are never used and/or what kind of words or names were not removed during the language processing.



In [46]:

    
last = nltk.FreqDist(dict(freq.most_common()[-30:]))
last









    Out[46]:





FreqDist({'allegiances': 1,
          'bolland': 1,
          'busker': 1,
          'city-wide': 1,
          'compartments': 1,
          'constraints': 1,
          'denier': 1,
          'dumpty': 1,
          'entries': 1,
          'evidenced': 1,
          'fletcher': 1,
          'g.': 1,
          'gems': 1,
          'gender': 1,
          'glenn': 1,
          'holocaust': 1,
          'humpty': 1,
          'libel': 1,
          'lipstadt': 1,
          'mirroring': 1,
          'observant': 1,
          'pakhtun': 1,
          'pakistan': 1,
          'pittsburgh': 1,
          'puss': 1,
          'scott-': 1,
          'softpaws': 1,
          'submerged': 1,
          'sues': 1,
          'visionary': 1})

As can be seen from this list, there are certain words or elements that were not removed from our set of extracted words such as 'g.'. Furthermore, certain verbs can be observed. To better treat these, we could have either used stemming or lemmatization. However, as these verbs did not seem to appear in the top 100 words that we will use later on, this was not done.

2.8 Metacritic analysis

This section shows the distribution of the Metacritic ratings. It follows a Gaussian distribution:



In [47]:

    
Image('images/Metacritic_distribution.png')









    Out[47]:

The zeros are due to errors in the ratings and are ignored during the analysis

2.8.1. Metacritic ratings and ROI

We tried to find a correlation between the financial success of a movie, represented by the ROI, and the qualitative quality of a movie, reprensented by the metacritic ratings.

Here we see the ROI in function of the metacritic rating:



In [48]:

    
Image('images/roi_vs_metacritic.png')









    Out[48]:

As we can see, it appears that there is no correlation between the qualitative success and the financial success of a movie. This result can be explained because the revenue is highly influenced by the marketing and the visibility of a movie and a lot of people go to the theater to see movies they heard about, even if it's not a very good one.

2.9 YouTube Trailer Views

We collected videos from eight different channels, however three of them had less than 15% of the movies of our dataset. We kept these five ones, which contains around 35% of the videos of our dataset:

Movieclips Trailers
Movieclips Trailer Vault
TrailersPlaygroundHD
JoBlo Movie Trailers
FilmIsNow Movie Trailers

By combining these sources, we achieved collecting almost 75% of the movies of our dataset. However, some movies have one trailer on each channel while others are found only once.



In [49]:

    
error = df.loc[df['YouTube_Mean']=='Error']
print("Number of trailers missing: "+str(len(error))+" ("+str(len(error)/len(df)*100)[:4]+"%)")









    



Number of trailers missing: 716 (27.3%)

To make the view count comparable between channels, we saturated it to the the value of the third quartile for each channel. Then, we normalized this number by the maximum views of each respective channel (i.e. the value of the third quartile since it's saturated).



In [50]:

    
df_yt = df.drop(df[df['YouTube_Mean'] == 'Error'].index)
df_yt['YouTube_Mean'] = df_yt['YouTube_Mean'].astype(float)
plt.hist(df_yt['YouTube_Mean'],bins='auto');
plt.xlabel('Normalized Trailer views');

3. Data exploitation

To first determine if each feature had any impact on the success of the movie, similarity graphs were created between movies based on their genre, their storyline, the actors, etc. Graph embedding using Laplacian eigenmaps was then used in hope of potentially observing some separability in the data according to their labels.

3.1 Budget

In this section, the goal is to observe if the budget of movies impacts their succes. A weight matrix was build by comparing the difference of budget between pairs of movies.

The normalization of the weights was done as follows:

$diff[i][j] = 0 : W[i][j] = 1$
$diff[i][j] > 0 $ : $W[i][j] = 1- \frac{W[i][j]}{max(W)}$

This gives the following matrix:



In [51]:

    
DiffNormBudgW = pd.read_csv('Saved_Datasets/DiffNormBudgW.csv')
DiffNormBudgW = DiffNormBudgW.as_matrix()



In [52]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffNormBudgW)
axes[1].hist(DiffNormBudgW.reshape(-1),bins=50);
axes[1].set_xlabel('Weights')
axes[1].set_ylabel('Number of weights')









    Out[52]:





<matplotlib.text.Text at 0x17305382fd0>



In [53]:

    
G_budg = graphs.Graph(DiffNormBudgW)
G_budg.compute_laplacian('normalized')
G_budg.compute_fourier_basis(recompute=True)
plt.plot(G_budg.e[0:10]);

From the graph of the eigenvalues, it can be seen that the first eigenvector explain 90% of the data.



In [54]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_budg.set_coordinates(G_budg.U[:,1:3])



In [55]:

    
G_budg.plot_signal(labels, vertex_size=20,limits=[0,3])



In [56]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI'])
G_budg.plot_signal(labels_reg, vertex_size=20)

It can be seen that on the plot above the movies cannot be separated. Therefore it was tried to sparsify the matrix.



In [57]:

    
DiffSparsBudgW = pd.read_csv('Saved_Datasets/DiffNormSparsBudgW.csv')
DiffSparsBudgW = DiffSparsBudgW.as_matrix()



In [58]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffSparsBudgW)
axes[1].hist(DiffSparsBudgW.reshape(-1),bins=50);
axes[1].set_xlabel('Weights')
axes[1].set_ylabel('Number of Weights')









    Out[58]:





<matplotlib.text.Text at 0x1731f5c05c0>



In [59]:

    
G_budg_sp = graphs.Graph(DiffSparsBudgW)
G_budg_sp.compute_laplacian('normalized')
G_budg_sp.compute_fourier_basis(recompute=True)
plt.plot(G_budg_sp.e[0:10]);



In [60]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_budg_sp.set_coordinates(G_budg_sp.U[:,1:3])



In [61]:

    
G_budg_sp.plot_signal(labels, vertex_size=20)



In [62]:

    
G_budg_sp.plot_signal(labels_reg, vertex_size=20)

Despite sparsifying the weight matrix, the data can still not be separated.

3.2. Genre

3.2.1 Similarity graph

A similarity graph between movies was created based on whether a pair of movies were of the same genres. Initially, for each pair, we determined how many genres the pair had in common and divided the resulting value by the number of genres of the movie containing the most genres between the pair:

For example, with film i and j, we have:

$W_{ij} = \frac{Number \ of \ similar \ genres \ between \ i \ and \ j}{Highest \ number \ of \ genres \ between \ i \ and \ j} \in [0; 1]$

A value of 1 would indicate complete similarity in terms of genre and a value of 0 would therefore indicate zero similarity in terms of genre.

However, as can be seen in the plot below, graph embedding with Laplacian eigenmaps on the second and third eigenvectors did not give us a good representation of the data. Indeed, despite the first two eigenvalues seemingly representing most of the variability of the data, from the plot we can see that this is mostly due to a dozen datapoints which are far from most of the others.



In [63]:

    
GenreW = pd.read_csv('Saved_Datasets/NormalizedGenreW.csv')



In [64]:

    
plt.spy(GenreW)









    Out[64]:





<matplotlib.image.AxesImage at 0x173311c4198>

Computation of the normalized Laplacian of this weighted graph.



In [65]:

    
Ggenre = graphs.Graph(GenreW)
Ggenre.compute_laplacian('normalized')

Display of its eigenvalues



In [66]:

    
Ggenre.compute_fourier_basis(recompute=True)
plt.plot(Ggenre.e[0:10]);



In [67]:

    
Ggenre.set_coordinates(Ggenre.U[:, 1:3])
Ggenre.plot()



In [68]:

    
genres = preprocessing.LabelEncoder().fit_transform(df['success'])
Ggenre.plot_signal(genres, vertex_size=20)

We could of course embed this graph on another pair of eigenvectors, however we later wish to create a similarity graph by combining all of these subgraphs together. Since the variability should always be greatest along the first two eigenvectors as they are sorted, it would be best to find a normalization that gives us a better representation of the data when embedded on these eigenvectors.

As such, we decided to do another normalization by considering the 75-percentile of the weights. The 75th percentile value indicated that most pairs of movies had a similarity between 0 and 1 in terms of genre. However, as can be seen in the histogram of number of movies per genre, almost half of the movies have the genre "drama". As such, it was decided to consider that all pairs of movies that have 2 or more similar genres would have a weight of 1 whereas those of 1 similar genre should have a weight of 0.5 and those of 0 similar genres should have 0:

$$W_{ij} \begin{cases}1 & Nb \ similar \ genres \geq 2\\0.5 & Nb \ similar \ genres = 1 \\ 0 & Nb \ similar \ genres =0 \end{cases}$$

The graph was then sparsed such that only 300 of the most important neighbours was kept.



In [69]:

    
GenreW = pd.read_csv('Saved_Datasets/NormalizedGenreWSparse.csv')



In [70]:

    
plt.spy(GenreW)









    Out[70]:





<matplotlib.image.AxesImage at 0x17302c36940>

3.2.2. Graph Laplacian



In [71]:

    
Ggenre = graphs.Graph(GenreW)
Ggenre.compute_laplacian('normalized')

3.2.3. Graph embedding: Laplacian eigenmaps

Display of the Laplacian's eigenvalues



In [72]:

    
Ggenre.compute_fourier_basis(recompute=True)
plt.plot(Ggenre.e[0:10]);

Embed the graph on the first two eigenvectors



In [73]:

    
Ggenre.set_coordinates(Ggenre.U[:, 1:3])
Ggenre.plot()

3.2.3.1. Classification

We then assign a signal to our datapoints according to the notion of success defined previously with the ROI to observe whether the data would potentially be separable along one or both of the eigenvectors.



In [74]:

    
genres = preprocessing.LabelEncoder().fit_transform(df['success'])
Ggenre.plot_signal(genres, vertex_size=20)

From this plot, we observe that the two classes do not seem to be separable along any of these eigenvectors. Unsurprisingly, this would indicate that the genres of a movie do not seem to greatly impact the success of a movie.

3.2.3.2. Regression

In this case we directly assign the ROI of the movie as label for each datapoint to observe whether movies with similar text words generated the same values of ROI.



In [75]:

    
genres_reg = preprocessing.LabelEncoder().fit_transform(df['ROI']/2.64)
Ggenre.plot_signal(genres_reg, vertex_size=20)

3.3 Actors in movies

3.3.1 Actors in common between pairs of movies

In this section the actors in common bewteen movies is studied. The actors names between movies were compared and the weight on the edges of the graph is equal to the number of actor in common between the movies. Since there is only the five principals actors of each movies in the dataset, there can be maximum five actors in common between two movies.

The normalization of the weight matrix is then done as follow:

If $0 ≤ W[i][j] ≤ 3$: $W[i][j] = \frac{W[i][j]}{3}$
If $ W[i][j] > 3$ : $W[i][j] = 1$

It is very rare that two movies have more than 3 actors in common an if it is the case, it is considered that the two movies are very similar. This way of normalizing weights allow to give more weights to movies that have only one actor in common which is yet not very common.



In [76]:

    
NormActW = pd.read_csv('Saved_Datasets/Normalized2ActorW.csv')
NormActW = NormActW.as_matrix()



In [77]:

    
#Compute degree distribution 
deg_Act = np.zeros(len(NormActW)) 
for i in range(0, len(NormActW)):
    deg_Act[i] = sum(NormActW[i])

fix, axes = plt.subplots(1, 2)
axes[1].hist(deg_Act.reshape(-1), bins=50);
axes[1].set_xlabel('Degrees of nodes')
axes[1].set_ylabel('Number of nodes')
axes[0].spy(NormActW)









    Out[77]:





<matplotlib.image.AxesImage at 0x1737c619390>

As it can be seen in the degrees distribution, some nodes have a degree of 0 which means that some movies are not connected to the graph because there are not similar to any other movie. However, it is impossible to compute de laplacian if there is nodes with a degree of 0.

3.3.2 Similarity in actors tenures between pairs of movies

As explained in section data exploration, the total tenures of actors and the averaged tenures of actors in the dataset were computed. The idea is now to study if the fact that actors have a long career impacts the sucess of a movie. Therefore, the difference between the total tenures of pairs of movies, which is the sum of all the tenures of the actors present in the movies, was computed. This difference allows then to compute the weight matrix.

Several kernel of normalization were tried like a gaussian kernel, normalizing by the maximum weight, normalizing by exp(-x) but in each of these normalization type, the normalized weights were not corresponding to the expected behavior. Normalizing the weights by dividing the values by the 3rd percentile of the weights appeared as the best solution.

The weights were then normalized as follows:

$diff[i][j] = 0 : W[i][j] = 1$
0 < $diff[i][j] <= $ 3rd percentile of diff: $W[i][j] = 1- \frac{W[i][j]}{3rd \ percentile \ of \ diff}$
$diff[i][j] > $ 3rd percentile of diff : $W[i][j]=0$

In this case, the 3rd percentile of weight matrix was 55, which means that 75% of the differences of the total tenures between pairs of movies is comprising between 0 and 55.



In [78]:

    
NormActTenW = pd.read_csv('Saved_Datasets/DiffNorm75ActTenW.csv')
NormActTenW = NormActTenW.as_matrix()



In [79]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(NormActTenW)
axes[1].hist(NormActTenW.reshape(-1), bins=50);
axes[1].set_xlabel('Weights of edges')
axes[1].set_ylabel('Number of weights')









    Out[79]:





<matplotlib.text.Text at 0x173312bbba8>



In [80]:

    
G_act_ten = graphs.Graph(NormActTenW)
G_act_ten.compute_laplacian('normalized')
G_act_ten.compute_fourier_basis(recompute=True)
plt.plot(G_act_ten.e[0:10]);



In [81]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_act_ten.set_coordinates(G_act_ten.U[:,1:3])
G_act_ten.plot_signal(labels, vertex_size=20)

It can be seen on the plot above that the data are not well separated between sucess and failed movies.

In order to try improving the results, the matrix was sparsified by keeping only the 200 neighbors of each node that have the highest weights. This allows to keep at one the weights of pair of movies that are very similar.



In [82]:

    
NormSparsActTenW = pd.read_csv('Saved_Datasets/DiffNorm75SparsActTenW.csv')
NormSparsActTenW = NormSparsActTenW.as_matrix()



In [83]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(NormSparsActTenW)
axes[1].hist(NormSparsActTenW.reshape(-1), bins=50);
axes[1].set_xlabel('Weights of edges')
axes[1].set_ylabel('Number of weights')









    Out[83]:





<matplotlib.text.Text at 0x17340c18c88>



In [84]:

    
G_act_ten_spars = graphs.Graph(NormSparsActTenW)
G_act_ten_spars.compute_laplacian('normalized')
G_act_ten_spars.compute_fourier_basis(recompute=True)
plt.plot(G_act_ten_spars.e[0:10]);

By looking at the eigenvalues, it can be seen than when the weight matrix is sparsed the first eigenvalue doesn't describe the data well.



In [85]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_act_ten_spars.set_coordinates(G_act_ten_spars.U[:,1:3])
G_act_ten_spars.plot_signal(labels, vertex_size=20)

It can be see, taht to sparse the weight matrix doesn't allow to better separate the data.

3.3.3 Similarity between profitability of Actors

In this section, the goal was to study if the fact that actors present in a movie have played in movies which have generated a lot profit (revenue-budget) impacts the sucess of a movie. Numbers concerning how much profit actors have generated in their career are available on the website Box Office Mojo: http://www.boxofficemojo.com/people/. However, no API was available to collect the data, only webscrapping was possible. Since it would take too much time to collect these data, it was decided to compute the total actors profitability for each movie in the dataset by suming the profitability of each movie in the dataset, in which actors have played. The profitability of each actor can be seen in the dataset Actors.

To compute the weight matrix the difference of the total actors profitability of each movie was computed between pairs of movies. Then the weight matrix was normalized as follows:

$diff[i][j] = 0 : W[i][j] = 1$
0 < $diff[i][j] <= $ 3rd percentile of diff: $W[i][j] = 1- \frac{W[i][j]}{3rd \ percentile \ of \ diff}$
$diff[i][j] > $ 3rd percentile of diff : $W[i][j]=0$

where the value of the 3rd percentil is 5037883028.0

File too large to upload on moodle, see sparse matrix instead



In [86]:

    
#DiffNormActProfW = pd.read_csv('Saved_Datasets/DiffNormActProfW.csv')
#DiffNormActProfW = DiffNormActProfW.as_matrix()



In [87]:

    
#fix, axes = plt.subplots(1, 2)
#axes[0].spy(DiffNormActProfW)
#axes[1].hist(DiffNormActProfW.reshape(-1), bins=50);
#axes[1].set_xlabel('Weights of edges')
#axes[1].set_ylabel('Number of weights')









    Out[87]:





<matplotlib.text.Text at 0x17307ce4438>



In [88]:

    
#G_act_prof = graphs.Graph(DiffNormActProfW)
#G_act_prof.compute_laplacian('normalized')
#G_act_prof.compute_fourier_basis(recompute=True)
#plt.plot(G_act_prof.e[0:10]);



In [89]:

    
#labels = preprocessing.LabelEncoder().fit_transform(df['success'])
#G_act_prof.set_coordinates(G_act_prof.U[:,1:3])
#G_act_prof.plot_signal(labels, vertex_size=20)



In [90]:

    
#labels = preprocessing.LabelEncoder().fit_transform(df['ROI'])
#G_act_prof.set_coordinates(G_act_prof.U[:,1:3])
#G_act_prof.plot_signal(labels, vertex_size=20)



In [91]:

    
NormActProfSparsW = pd.read_csv('Saved_Datasets/DiffNormActProfSparsW.csv')
NormActProfSparsW = NormActProfSparsW.as_matrix()



In [92]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(NormActProfSparsW)
axes[1].hist(NormActProfSparsW.reshape(-1), bins=50);
axes[1].set_xlabel('Weights of edges')
axes[1].set_ylabel('Number of weights')









    Out[92]:





<matplotlib.text.Text at 0x173ad31f278>



In [93]:

    
G_act_prof_s = graphs.Graph(NormActProfSparsW)
G_act_prof_s.compute_laplacian('normalized')
G_act_prof_s.compute_fourier_basis(recompute=True)
plt.plot(G_act_prof_s.e[0:10]);



In [94]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_act_prof_s.set_coordinates(G_act_prof_s.U[:,1:3])
G_act_prof_s.plot_signal(labels, vertex_size=20)



In [95]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI'])
G_act_prof_s.plot_signal(labels_reg, vertex_size=20)

3.4 Directors in movies

3.4.1 Similarties of directos between movies

In this section the directors in common between movies is studied. In fact, a lot of people come to see movies because the director of the movie is famous. The weight matrix was build by putting 1 on the edge of the graph if the two movies has the same director and 0 otherwise.



In [96]:

    
NormDicW = pd.read_csv('Saved_Datasets/NormalizedDirectorW.csv')
NormDicW  = NormDicW .as_matrix()



In [97]:

    
#Compute degree distribution 
degrees = np.zeros(len(NormDicW)) 
for i in range(0, len(NormDicW)):
    degrees[i] = sum(NormDicW[i])

fix, axes = plt.subplots(1, 2)
axes[0].spy(NormDicW)
axes[1].hist(degrees, bins=50);
axes[1].set_xlabel('Degrees of nodes')
axes[1].set_ylabel('Number of nodes')









    Out[97]:





<matplotlib.text.Text at 0x173bfdd2ac8>

As mentionned in Actors section with this type of simlarity graph, the laplacian matrix can not be computed because there is degrees equal to zero.

3.4.2 Similarity between movies of the number of movies in the dataset produced by directors

An other type of similarity graph was built, by comparing between pairs of movies, the number of movies that directors have directed in our dataset. Even if this number of movies per director was computed with only the movies of our dataset, it is yet a good representation of which directors have experience. For real representation of the experience of director, data of all movies directed by all directors should have been collected.

A weight was given to each movies corresponding to the number of movies in dataset produced by the director of each movies. The weight matrix was build by computing the difference between the weight of each pair of movies.

The weights matrix was then normalized as follows:

$diff[i][j] = 0 : W[i][j] = 1$
0 < $diff[i][j] <= $ 3rd percentile of diff: $W[i][j] = 1- \frac{W[i][j]}{3rd \ percentile \ of \ diff}$
$diff[i][j] > $ 3rd percentile of diff : $W[i][j]=0$

This gives the following normalized weight matrix:



In [98]:

    
DiffNormDirW = pd.read_csv('Saved_Datasets/DiffNormDirW.csv')
DiffNormDirW= DiffNormDirW.as_matrix()



In [99]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffNormDirW)
axes[1].hist(DiffNormDirW.reshape(-1),bins=50);
axes[1].set_xlabel('Weights of edges')
axes[1].set_ylabel('Number of weights')









    Out[99]:





<matplotlib.text.Text at 0x173acf68c50>



In [100]:

    
G_diff_dir = graphs.Graph(DiffNormDirW)
G_diff_dir .compute_laplacian('normalized')
G_diff_dir.compute_fourier_basis(recompute=True)
plt.plot(G_diff_dir.e[0:10]);



In [101]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_diff_dir.set_coordinates(G_diff_dir.U[:,1:3])
G_diff_dir.plot_signal(labels, vertex_size=20)



In [102]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI'])
G_diff_dir.plot_signal(labels_reg, vertex_size=20)

The matrix is already well sparsified, so we didn't try do to it.

3.5 Production companies in movies

3.5.1 Similarties of companies between movies

As for the actors and directors the companies in common between movies was studied. The weight matrix was build by putting 1 on the edge of the graph if the two movies were produced by the same company and 0 otherwise.



In [103]:

    
NormCompW = pd.read_csv('Saved_Datasets/NormalizedCompaniesW.csv')
NormCompW  = NormCompW.as_matrix()



In [104]:

    
#Compute degree distribution 
degrees_comp = np.zeros(len(NormCompW)) 
for i in range(0, len(NormCompW)):
    degrees_comp[i] = sum(NormCompW[i])

fix, axes = plt.subplots(1, 2)
axes[0].spy(NormCompW)
axes[1].hist(degrees, bins=50);
axes[1].set_xlabel('Degrees of nodes')
axes[1].set_ylabel('Number of nodes')









    Out[104]:





<matplotlib.text.Text at 0x173e0d0bef0>

As for the actors and directors it was observed that the company of some movies appears only in one movie which leads to unconnected movies (nodes) and the Laplacian of this weights matrix can't be computed.

3.5.2 Difference of number of movies produced per companies between movies

As for the Directors, the number of movies that companies have produces in our dataset were compared between pairs of movies. As said in directors, section even if the number of movies produced per companies is computed with only the data from our dataset, it gives an idea of the distriubtion between movies.

First, a weight was given to each movies corresponding to the number of movies in dataset produced by the company of each movies. The weight matrix was build by computing the difference between the weight of each pair of movies.

The weights matrix was then normalized as follows:

$diff[i][j] = 0 : W[i][j] = 1$
0 < $diff[i][j] <= $ 3rd percentile of diff: $W[i][j] = 1- \frac{W[i][j]}{3rd \ percentile \ of \ diff}$
$diff[i][j] > $ 3rd percentile of diff : $W[i][j]=0$

This gives the following normalized weight matrix:



In [105]:

    
DiffNormCompW = pd.read_csv('Saved_Datasets/DiffNormCompW.csv')
DiffNormCompW  = DiffNormCompW.as_matrix()



In [106]:

    
fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffNormCompW)
axes[1].hist(DiffNormCompW.reshape(-1),bins=50);
axes[1].set_xlabel('Weights of edges')
axes[1].set_ylabel('Number of weights')









    Out[106]:





<matplotlib.text.Text at 0x173ec43fc88>



In [107]:

    
G_diff_comp = graphs.Graph(DiffNormCompW)
G_diff_comp .compute_laplacian('normalized')
G_diff_comp.compute_fourier_basis(recompute=True)
plt.plot(G_diff_comp.e[0:10]);



In [108]:

    
labels = preprocessing.LabelEncoder().fit_transform(df['success'])
G_diff_comp.set_coordinates(G_diff_comp.U[:,1:3])
G_diff_comp.plot_signal(labels, vertex_size=20)



In [109]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI']/2.64)
G_diff_comp.plot_signal(labels_reg, vertex_size=20)

3.6. Storyline

In this section, we are interested in knowing how much the story seems to impact of the success and ROI of a movie. To do this, we first had to determine how similar movies were based on their overview.

3.6.1 Similarity graph

A similarity graph between movies was created based on storylines of each movies. This was done by first determining the words amongst the 100 previously determined most common words each storyline contained to avoid comparing strange characters or familiar words that were not removed from the procedures desribed previously. The similarity between each pairs of films was then computed based on whether they contained the same most common words.

$$W_{ij} = Number \ of \ similar \ common \ words \ between \ i \ and \ j$$

$\textbf{For example, with films i and j:}$

List of words amongst 100 most common for film i = ['life', 'love', 'death']
List of words amongst 100 most common for film j = ['love', 'death', 'assassin', 'kill']
Number of similar common words between i and j = 2

$$W_{ij} = 2 $$

In terms of normalization, we initially tried to do this similarly to the first method shown in section 3.1 for the genres:

$$W_{ij} = \frac{Number \ of \ similar \ common \ words \ between \ i \ and \ j}{Highest \ number \ of \ common \ words \ between \ i \ and \ j} \in [0; 1]$$

such that in the example shown above:

Highest number of common words between i and j = 4

$$W_{ij} = 0.5 $$

However, this penalizes a pair of films for having many words amongst the list of 100 even if they have many of them in common.

As such, we decided to observe the distribution of the number of similar common words between pairs of films. From the distribution, we observed that the greatest number of similar words is 8. However the average number of similar words is 0.24 and the 75th percentile is at zero. This would mean that already having 1 similar word in common is relatively rare.

We therefore decided to apply a binary weight as defined below:

$$ W_{ij} = \begin{cases}0, & Nb \ of \ similar \ common \ words \ between \ i \ and \ j =0 \\ 1, & Nb \ of \ similar \ common \ words \ between \ i \ and \ j \geq 1 \end{cases}$$



In [110]:

    
NormTextW = pd.read_csv('Saved_Datasets/TextWSparsePerc.csv')



In [111]:

    
plt.spy(NormTextW)









    Out[111]:





<matplotlib.image.AxesImage at 0x174071c4e10>

3.6.2. Graph Laplacian

Computation of the normalized Laplacian of the graph.



In [112]:

    
GText = graphs.Graph(NormTextW)
GText.compute_laplacian('normalized')

3.6.3. Graph embedding: Laplacian eigenmaps

Display of the Laplacian's eigenmaps



In [113]:

    
GText.compute_fourier_basis(recompute=True)
plt.plot(GText.e[0:10]);

From this plot, we can see that in addition to the first eigenvalue, the second eigenvalue seems to be zero which would indicate the presence of two giant components! Let us now observe the actual value of the second eigenvalue:



In [114]:

    
print('The value of the 1st eigenvalue is: {}'.format(GText.e[1]))









    



The value of the 1st eigenvalue is: 1.4430679104324977e-15

We then embed the data on the second and third eigenvectors.



In [115]:

    
GText.set_coordinates(GText.U[:, 1:3])
GText.plot()

3.6.3.1. Classification



In [116]:

    
GText.plot_signal(genres, vertex_size=20)

From this plot, we can observe that there does not seem to be separable into the two desired classes. Indeed, movies with the same features seem to have both successful and un-successful labels.

3.6.3.2. Regression

In this case we directly assign the ROI of the movie as label for each datapoint to observe whether movies with similar text words generated the same values of ROI.



In [117]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI']/2.64)
GText.plot_signal(labels_reg, vertex_size=20)

Similarly to the classification case of the previous subsection, we notice that the similar movies in terms of storyline do not necessarily generate the same values of ROI.

3.7 Metacritic of movies



In [118]:

    
DiffMetaW = pd.read_csv('Saved_Datasets/NormalizedMetacriticW.csv')
DiffMetaW = DiffMetaW.as_matrix()

3.7.1 Similarity Graph

We then created a similarity weight matrix between movies, where we ignored the movies with a metacritic rating equal to zero. We simply took the difference of ratings between two movies.

$$ W(i,j) = \begin{cases} 0 & \text{if } Metacritic(i,j) = 0\\ \frac{1-\|Metacritic(i) - Metacritic(j)\|}{100} & \text{otherwise} \end{cases}$$

With this following formula, the weights are between 0 (opposed ratings) and 1 (same ratings).

As this is a Gaussian distribution, a lot of movies have similar ratings :



In [119]:

    
#Compute degree distribution 
degrees = np.zeros(len(DiffMetaW)) 
for i in range(0, len(DiffMetaW)):
    degrees[i] = sum(DiffMetaW[i])

fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffMetaW)
axes[1].hist(degrees.reshape(-1),bins=50);
axes[1].set_xlabel('Weights of edges');
axes[1].set_ylabel('Number of weights');

As we can see in the figure above, the weight matrix contains some zeros. Hence, it is impossible to compute the Laplacian directly. We must wait to include this weights in the total weight matrix.

3.8 YouTube Trailers Analysis

As we have a different dataset for the Trailer views (because of the missing data), we tried to see if we could explain the success of a movie only with the help of the trailer views. The similarity matrix is simply the difference between two normalized view counts.



In [120]:

    
DiffTrailerW = pd.read_csv('Saved_Datasets/NormalizedTrailerW.csv')
DiffTrailerW = DiffTrailerW.as_matrix()



In [121]:

    
#Compute degree distribution 
fix, axes = plt.subplots(1, 2)
axes[0].spy(DiffTrailerW)
axes[1].hist(DiffTrailerW.reshape(-1),bins=50);
axes[1].set_xlabel('Weights of edges');
axes[1].set_ylabel('Degree distribution');



In [122]:

    
G = graphs.Graph(DiffTrailerW)
G.compute_laplacian('normalized')
G.compute_fourier_basis(recompute=True)
plt.plot(G.e[0:10]);

G.set_coordinates(G.U[:, 1:3])
G.plot()



In [123]:

    
df_noerror = df.drop(df[df['YouTube_Mean'] == 'Error'].index)
labels = preprocessing.LabelEncoder().fit_transform(df_noerror['success'])
G.set_coordinates(G.U[:,1:3])



In [124]:

    
G.plot_signal(labels, vertex_size=20)

3.9. Combination of the weight matrices

In this section, we will study the combination of all the weight matrices from above. As their values are already normalized between 0 and 1, we can simply add all of them and divide the result by the number of weight matrices:

$$ W_{Tot} = \frac{1}{N}\sum_{i=1}^{N} W_i $$

where $N$ is the number of retained weight matrices and $W_{i}$ is one of the weight matrices computed above.

$\underline{\textbf{Note:}}$ Only some of the weight matrices were kept (for example those sparsed and those with the appropriate normalization)



In [125]:

    
WTot = NormActProfSparsW + DiffSparsBudgW + GenreW.as_matrix(columns=None) + NormSparsActTenW + DiffNormDirW + DiffNormCompW + NormTextW.as_matrix(columns=None)
WTot = WTot/7

We can then plot the weighted matrix to observe its connectivity as well as the weight and degree distribution.



In [126]:

    
plt.spy(WTot)









    Out[126]:





<matplotlib.image.AxesImage at 0x1731ed9af60>



In [127]:

    
plt.hist(WTot.reshape(-1), bins=50);
plt.title('Weight distribution of the fully connected matrix');
print('The mean value is: {}'.format(WTot.mean()))
print('The max value is: {}'.format(WTot.max()))
print('The min value is: {}'.format(WTot.min()))









    



The mean value is: 0.29468697781783304
The max value is: 0.9839849056236751
The min value is: 0.0



In [128]:

    
degrees = np.zeros(len(WTot)) 

#reminder: the degrees of a node for a weighted graph are the sum of its weights

for i in range(0, len(WTot)):
    degrees[i] = sum(WTot[i])

plt.hist(degrees, bins=50);
plt.title('Degree distribution of the fully connected matrix');
print('The mean value is: {}'.format(degrees.mean()))
print('The max value is: {}'.format(degrees.max()))
print('The min value is: {}'.format(degrees.min()))









    



The mean value is: 772.3745688605394
The max value is: 1013.7666635191015
The min value is: 353.75875513882113

3.8.1 Fully connected matrix

In this section, we want to try and perform graph embedding with Laplacian eigenmaps using the full weight matrix $W_{Tot}$ explained above.

Let us first compute the normalized Laplacian of the graph and its eigenvalues:



In [129]:

    
G = graphs.Graph(WTot)
G.compute_laplacian('normalized')



In [130]:

    
G.compute_fourier_basis(recompute=True)
plt.plot(G.e[0:10]);

From this plot, we observe that it would seem that the second and third eigenvalues seem to explain most of the variability of the data.

Let us now embed the data along their respective eigenvectors.

3.8.1.1. Classification

In this case, the graph is embedded on the first two eigenvectors with the labels of 'successful' (1) or 'unsuccessful' (0).



In [131]:

    
G.set_coordinates(G.U[:,1:3])
G.plot_signal(genres, vertex_size=20)

From this plot, we observe that despite a tendency to have more "successful" movies for negative values along the 2nd eigenvector (x-axis), the data does not appear to be separable as a lot of "successful" and "unsuccessful" movies seem to be regrouped together on the right side of the plot.

3.8.1.2. Regression

In this second case, the graph is embedded on the first two eigenvectors with the their signals being the ROI.



In [132]:

    
G.plot_signal(labels_reg, vertex_size=20)

We observe that there is no gradient of the signal along any axis such that we are unable to predict the ROI of a movie based solely on the features used to construct our graph.

3.8.2 Sparsed matrix

In order to potentially be able to better separate our data, it was decided to sparse the obtained weight matrix such that similar movies would be more easily regrouped together.



In [133]:

    
NEIGHBORS = 300

#sort the order of the weights
sort_order = np.argsort(WTot, axis = 1)

#declaration of a sorted weight matrix
sorted_weights = np.zeros((len(WTot), len(WTot)))

for i in range (0, len(WTot)):  
    for j in range(0, len(WTot)):
        if (j >= len(WTot) - NEIGHBORS):
            #copy the k strongest edges for each node
            sorted_weights[i, sort_order[i,j]] = WTot[i,sort_order[i,j]]
        else:
            #set the other edges to zero
            sorted_weights[i, sort_order[i,j]] = 0

#ensure the matrix is symmetric
bigger = sorted_weights.transpose() > sorted_weights
sorted_weights = sorted_weights - sorted_weights*bigger + sorted_weights.transpose()*bigger



In [134]:

    
WTot = sorted_weights
plt.spy(WTot)









    Out[134]:





<matplotlib.image.AxesImage at 0x1744166aba8>



In [135]:

    
plt.hist(WTot.reshape(-1), bins=50);
plt.title('Weight distribution of the sparsed matrix');
print('The mean value of the weight matrix is: {}'.format(WTot.mean()))
print('The max value of the weight matrix is: {}'.format(WTot.max()))
print('The min value of the weight matrix is: {}'.format(WTot.min()))









    



The mean value of the weight matrix is: 0.07263556640555706
The max value of the weight matrix is: 0.9839849056236751
The min value of the weight matrix is: 0.0



In [136]:

    
degrees = np.zeros(len(WTot)) 

#reminder: the degrees of a node for a weighted graph are the sum of its weights

for i in range(0, len(WTot)):
    degrees[i] = sum(WTot[i])

plt.hist(degrees, bins=50);
plt.title('Degree distribution of the sparsed matrix');
print('The mean value is: {}'.format(degrees.mean()))
print('The max value is: {}'.format(degrees.max()))
print('The min value is: {}'.format(degrees.min()))









    



The mean value is: 190.37781954896502
The max value is: 443.81372297024495
The min value is: 108.4502931268722



In [137]:

    
G = graphs.Graph(WTot)
G.compute_laplacian('normalized')



In [138]:

    
G.compute_fourier_basis(recompute=True)
plt.plot(G.e[0:10]);

3.8.2.1. Classification

In this case, the graph is embedded on the first two eigenvectors with the labels of 'successful' (1) or 'unsuccessful' (0).



In [139]:

    
G.set_coordinates(G.U[:, 1:3])
G.plot_signal(genres, vertex_size=20)

3.8.2.2. Regression

In this second case, the graph is embedded on the first two eigenvectors with the their signals being the ROI.



In [140]:

    
labels_reg = preprocessing.LabelEncoder().fit_transform(df['ROI']/2.64)
G.plot_signal(labels_reg, vertex_size=20)

Similarly to the case of the unsparsified graph, the plot shown above indicates that we are unable to predict the ROI of a movie based solely on these features.

4. Discussion

The results obtained above suggest that none of the features observed seem to directly impact the success of a movie. Indeed, the data did not seem to be separable during our embedding with Laplacian eigenmaps on any of the obtained graphs. However, this is not necessarily representative of reality as only a certain number of movies were collected and used due to either missing or strange values in our dataset. Indeed, ideally it would be necessary to obtain data about all the english movies for both our data exploration and exploitation to be representative of reality.

Furthermore, the dataset also sometimes contained different actors and movies with the same names thus impacting our results due to the way the data is treated and it was also noticed that some of the data featured on TMBd was not in agreement with the values on IMDB.

Multiple aspects of our project could therefore be improved to potentially better understand what impacts the success of a movie.

4.1. Possible improvements

To further improve this project, multiple factors in our analysis could be revised, such as more work on the text analysis and the collection of additional data outside of our dataset. Indeed, to fully analyse the text and observe whether two storylines are similar, it would be necessary to consider both the synonyms and antonyms of each word of the storyline when comparing it with that of another film such that we could get a better notion of similarity or dissimilarity. Additionally, lemmatization and stemming could be done since it could impact the computation of the most frequent words.

It should also be noted that most of our data comes from the dataset collected from kaggle. As such, it might have been potentially more wise to correct our data (if necessary) with the help of other sources and/or collect more data from other websites such that more features or movies could be obtained.

Furthermore, additional factors could potentially impact the success of a movie and could somehow be taken into account, such as the trends of the year during which the film was made, the effectiveness of its publicity, as well as the economy and events that could have happened throughout that year.

4.2. Additional encountered problems

Aside from the possible improvements and problems discussed above, other problems were encountered which disabled us from further exploiting the data. Indeed, as mentionned above, an attempt was made to observe whether the sucess of a film was due to a "network effect", i.e. the publicity that a film generates before its release. This was done by collecting the number of views a trailer of a certain movie had on YouTube. However, for this to be unbiased, the data should only be collected from one or all YouTube channels that publish trailers. The problem was that no YouTube channel seemed to contain all the trailers of the movies contained in our dataset.

A certain weight could potentially have been given based on the number of subscribers for each channel such that the number of views of the trailers of all of the movies our dataset could have been taken into account in our analysis, however the number of subscribers is not necessarily representative of the number of people who follow a channel but are not necessarily subscribed.

Due to lack of time, this subject was not elaborated further.

5. Conclusion

In this project, the influence of each of the features was studied to try and determine their impact on and contribution to the success. This was done by first defining the notion of success for movies and observing the similarities between movies for each of the obtained features.

The obtained results would seem to indicate the inability to predict the success of a movie based purely on the features of a movie such as its cast, storyline, budget, production company, etc. However, as discussed previously, this could be due to the size of our dataset as not all movies are considered during our analysis such that our dataset might not be sufficiently representative of reality. Furthermore, certain aspects could potentially highly impact the success of a movie such as the trends, economy and effectiveness of the publicity through media. However, due to lack of time, these aspects could not further be elaborated.

6. References

1) http://ew.com/movies/2017/09/04/summer-box-office-fail/

2) http://www.imdb.com/

3) https://www.kaggle.com/rounakbanik/the-movies-dataset/

4) https://www.themoviedb.org/?language=en

5) Bird, Steven, Edward Loper and Ewan Klein (2009), "Natural Language Processing with Python". O’Reilly Media Inc. (http://www.nltk.org/#natural-language-toolkit)

	id	budget	genres	imdb_id	original_language	original_title	overview	popularity	production_companies	production_countries	...	runtime	spoken_languages	status	tagline	title	vote_average	vote_count	production_companies_number	production_countries_number	spoken_languages_number
0	2	0	Drama\|Crime	tt0094675	fi	Ariel	Taisto Kasurinen is a Finnish coal miner whose...	0.823904	Villealfa Filmproduction Oy	Finland	...	69.0	suomi	Released	NaN	Ariel	7.1	40	2	1	2
1	3	0	Drama\|Comedy	tt0092149	fi	Varjoja paratiisissa	An episode in the life of Nikander, a garbage ...	0.47445	Villealfa Filmproduction Oy	Finland	...	76.0	English	Released	NaN	Shadows in Paradise	7.0	32	1	1	3
2	5	4000000	Crime\|Comedy	tt0113101	en	Four Rooms	It's Ted the Bellhop's first night on the job....	1.698	Miramax Films	United States of America	...	98.0	English	Released	Twelve outrageous guests. Four scandalous requ...	Four Rooms	6.5	485	2	1	1
3	6	0	Action\|Thriller\|Crime	tt0107286	en	Judgment Night	While racing to a boxing match, Frank, Mike, J...	1.32287	Universal Pictures	Japan	...	110.0	English	Released	Don't move. Don't whisper. Don't even breathe.	Judgment Night	6.5	69	3	2	1

	id	actor1_name	actor1_gender	actor2_name	actor2_gender	actor3_name	actor3_gender	actor4_name	actor4_gender	actor5_name	actor5_gender	actor_number	director_name	director_gender	director_number	producer_name	producer_number	screeplay_name	editor_name
0	2	Turo Pajala	0	Susanna Haavisto	0.0	Matti PellonpÃ¤Ã¤	2	Eetu Hilkamo	0	none	0	4	Aki KaurismÃ¤ki	0.0	1	none	0	Aki KaurismÃ¤ki	Raija Talvio
1	3	Matti PellonpÃ¤Ã¤	2	Kati Outinen	1.0	Sakari Kuosmanen	2	Esko Nikkari	2	Kylli KÃ¶ngÃ¤s	0	7	Aki KaurismÃ¤ki	0.0	1	Mika KaurismÃ¤ki	1	Aki KaurismÃ¤ki	Raija Talvio
2	5	Tim Roth	2	Antonio Banderas	2.0	Jennifer Beals	1	Madonna	1	Marisa Tomei	1	24	Allison Anders	1.0	4	Lawrence Bender	1	none	Margaret Goodspeed
3	6	Emilio Estevez	2	Cuba Gooding Jr.	2.0	Denis Leary	2	Jeremy Piven	2	Peter Greene	2	15	Stephen Hopkins	2.0	1	Gene Levy	1	Lewis Colick	Tim Wellburn

	id	budget	genres	imdb_id	overview	production_companies	release_date	revenue	title	director_name	actor_names	Metacritic	YouTube_Mean
0	12	94000000	Animation\|Family	266543	Nemo, an adventurous young clownfish, is unexp...	Pixar Animation Studios	2003-05-30	940335536	Finding Nemo	Andrew Stanton	['Albert Brooks', 'Ellen DeGeneres', 'Alexande...	90	0.218
1	16	12800000	Drama\|Crime\|Music	168629	Selma, a Czech immigrant on the verge of blind...	Fine Line Features	2000-05-17	40031879	Dancer in the Dark	Lars von Trier	['Björk', 'Catherine Deneuve', 'David Morse', ...	61	Error
2	22	140000000	Adventure\|Fantasy\|Action	325980	Jack Sparrow, a freewheeling 17th-century pira...	Walt Disney Pictures	2003-09-07	655011224	Pirates of the Caribbean: The Curse of the Bla...	Gore Verbinski	['Johnny Depp', 'Geoffrey Rush', 'Orlando Bloo...	63	1.0
3	24	30000000	Action\|Crime	266697	An assassin is shot at the altar by her ruthle...	Miramax Films	2003-10-10	180949000	Kill Bill: Vol. 1	Quentin Tarantino	['Uma Thurman', 'Lucy Liu', 'Vivica A. Fox', '...	69	1.0

	Action	Adventure	Animation	Comedy	Crime	Documentary	Drama	Family	Fantasy	Foreign	History	Horror	Music	Mystery	Romance	Science Fiction	TV Movie	Thriller	War	Western
0	Thriller	Action	Family	Drama	Thriller	Music	Thriller	Comedy	Adventure	Drama	Drama	Thriller	Drama	Thriller	Drama	Action	Drama	Drama	Drama	Drama
1	Adventure	Family	Comedy	Romance	Drama	Comedy	Romance	Adventure	Action	Documentary	War	Mystery	Comedy	Drama	Comedy	Adventure	Thriller	Action	Action	Adventure
2	Crime	Comedy	Adventure	Family	Action	Family	Comedy	Animation	Family	Crime	Action	Drama	Romance	Crime	Fantasy	Thriller	Crime	Crime	History	Action
3	Drama	Fantasy	Fantasy	Action	Comedy	Drama	Crime	Fantasy	Comedy	Thriller	Thriller	Action	Family	Horror	Thriller	Drama	Science Fiction	Horror	Thriller	Thriller
4	Science Fiction	Science Fiction	Action	Adventure	Mystery	Foreign	Action	Action	Drama	Comedy	Adventure	Science Fiction	Fantasy	Action	Adventure	Fantasy	Fantasy	Mystery	Adventure	Crime
5	Comedy	Thriller	Science Fiction	Animation	Adventure	Action	Mystery	Drama	Science Fiction	Action	Romance	Fantasy	Animation	Science Fiction	Music	Comedy	Romance	Science Fiction	Romance	Comedy
6	Fantasy	Drama	Music	Crime	Horror	Adventure	Adventure	Science Fiction	Thriller	Family	Crime	Comedy	Documentary	Adventure	Family	Horror	Action	Adventure	Comedy	Mystery
7	Horror	Animation	Drama	Fantasy	Romance	TV Movie	History	Romance	Romance	Horror	Comedy	Crime	Action	Romance	Action	Family	Animation	Comedy	Science Fiction	Fantasy
8	Mystery	Romance	Thriller	Thriller	Science Fiction	Science Fiction	Science Fiction	Music	Animation	Science Fiction	Mystery	Adventure	Crime	Comedy	Mystery	Mystery	Adventure	Fantasy	Fantasy	Family
9	Family	Crime	Romance	Science Fiction	History	Fantasy	War	Thriller	Horror	Fantasy	Western	Romance	Adventure	Fantasy	Crime	Animation	History	Romance	Crime	Animation
10	War	Mystery	Western	Music	Fantasy	Romance	Fantasy	Mystery	Mystery	Romance	Science Fiction	War	War	Family	Science Fiction	Romance	Mystery	War	Mystery	War
11	Romance	War	Mystery	Horror	Western	Animation	Horror	Documentary	Music	Animation	Fantasy	Music	Horror	Western	History	Crime	Family	History	Music	Science Fiction
12	Animation	Western	War	Mystery	Music	Mystery	Music	Crime	Crime	TV Movie	Animation	Foreign	Science Fiction	History	War	War	Foreign	Western	Western	History
13	History	History	Crime	Documentary	War	Horror	Family	Western	War	History	TV Movie	Family	TV Movie	War	Horror	Western	Comedy	Family	Horror	Romance
14	Western	Horror	TV Movie	Western	Foreign	Crime	Western	Horror	Western	Western	Horror	TV Movie	Foreign	Animation	Animation	TV Movie	Western	Animation	Animation	TV Movie
15	Music	Music	History	War	Family	Western	Animation	Foreign	TV Movie	Music	Family	History	Western	TV Movie	TV Movie	Foreign	Music	TV Movie	TV Movie	Foreign
16	Documentary	Documentary	Horror	History	TV Movie	Thriller	Foreign	War	Foreign	War	Foreign	Western	Thriller	Music	Foreign	Music	War	Foreign	Foreign	Music
17	Foreign	Foreign	Foreign	Foreign	Animation	War	Documentary	History	History	Adventure	Music	Animation	History	Foreign	Western	History	Horror	Music	Family	Horror
18	TV Movie	TV Movie	Documentary	TV Movie	Documentary	History	TV Movie	TV Movie	Documentary	Mystery	Documentary	Documentary	Mystery	Documentary	Documentary	Documentary	Documentary	Documentary	Documentary	Documentary

	tmdb_id	Name	date	total_tenure	nb_total_movies	movies_in_dataset	Realease_date_of_movies_in_dataset	Actors_tenure_in_movies	Profitability
0	15295	Vicky Haughton	['2000', '2010']	11	5	['Whale Rider']	['2003']	[4]	33400000
1	16940	Jeremy Irons	['1974', '2016']	43	90	['Kingdom of Heaven', 'Eragon', 'Dungeons & Dr...	['2005', '2006', '2000', '2008', '2011', '2012...	[32, 33, 27, 35, 38, 39, 40, 43, 43]	369419665
2	41087	Leslie Mann	['1996', '2016']	21	31	['Knocked Up', 'I Love You Phillip Morris', '1...	['2007', '2009', '2009', '2009', '2011', '2011...	[12, 14, 14, 14, 16, 16, 15, 17, 19, 19]	1314569622
3	52262	Sean Maguire	['2000', '2015']	16	7	['Meet the Spartans']	['2008']	[9]	54646831