Extract Box Office


In [1]:
import csv
import pandas as pd
from webcrawl import WebCrawl

Reading filtered dataset


In [2]:
movie = pd.read_csv('datasetWithoutBoxOffice.csv') 

movie.head(3)


Out[2]:
TMDB ID IMDB ID TITLE YEAR GENRE RATING RELEASED ACTORS AWARDS COUNTRY LANGUAGE
0 862 tt0114709 Toy Story 1995 Animation, Adventure, Comedy 8.3 22 Nov 1995 Tom Hanks, Tim Allen, Don Rickles, Jim Varney Nominated for 3 Oscars. Another 23 wins & 18 n... USA English
1 8844 tt0113497 Jumanji 1995 Action, Adventure, Family 6.9 15 Dec 1995 Robin Williams, Jonathan Hyde, Kirsten Dunst, ... 4 wins & 9 nominations. USA English, French
2 15602 tt0113228 Grumpier Old Men 1995 Comedy, Romance 6.6 22 Dec 1995 Walter Matthau, Jack Lemmon, Sophia Loren, Ann... 2 wins & 2 nominations. USA English

List with

  • First Index - IMDB ID
  • Second Index - Box Office

In [3]:
listWithBoxOffice = [[],[]]

Function to extract Box Office by Web Crawling


In [4]:
def extract_boxOffice(tmdbid,imdbid):
    
    # Extract Box office by crawling IMDB page using IMDB ID
    boxOffice = WebCrawl().extractBoxOfficeByIMDB(imdbid)
    
    if boxOffice == 'N/A':
        # If 'N/A' in box office, crawl TMDB web page by using TMDB ID
        boxOffice = WebCrawl().extractBoxOfficeByTMDB(str(tmdbid))
    
    if boxOffice != 'N/A':    
        listWithBoxOffice[0].append(imdbid)
        listWithBoxOffice[1].append(boxOffice)
    else:
        listWithBoxOffice[0].append(imdbid)
        # Appending NaN to Box office so the column will have uniform data type (float) 
        listWithBoxOffice[1].append(float('nan'))

Optional(but suggested):

**Duplicate this file and then crawl in each file 1000 entries and then merge the values**


In [6]:
for movieID in movie.values[:]:
    # parameters TMDB ID, IMDB ID
    extract_boxOffice(movieID[0], movieID[1])

Creating a csv file for all entries


In [ ]:
with open('boxoffice.csv', 'w') as csvfile:
    fieldnames = ['IMDB ID','BOX OFFICE']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for i in range(len(listWithBoxOffice[0])):
        writer.writerow({'IMDB ID': listWithBoxOffice[0][i], 'BOX OFFICE': listWithBoxOffice[1][i]})

Reading csv's into dataframes


In [ ]:
boxoffice = pd.read_csv('boxoffice.csv')
datasetWithouBoxOffice = pd.read_csv('datasetWithoutBoxOffice.csv')

Merge using IMDB ID as key


In [ ]:
result = datasetWithouBoxOffice.merge(boxoffice, left_on='IMDB ID', right_on='IMDB ID', how = 'inner' )

Converting Dataframe to csv


In [ ]:
result.to_csv('datasetWithBoxoffice.csv',index = False)