Filtering Dataset


In [1]:
import pandas as pd
import csv

Reading dataset without Box Office


In [3]:
movie = pd.read_csv('dataset02.csv')

movie.head(3)


Out[3]:
TMDB ID IMDB ID TITLE YEAR GENRE RATING RELEASED ACTORS AWARDS COUNTRY LANGUAGE
0 862 tt0114709 Toy Story 1995 Animation, Adventure, Comedy 8.3 22 Nov 1995 Tom Hanks, Tim Allen, Don Rickles, Jim Varney Nominated for 3 Oscars. Another 23 wins & 18 n... USA English
1 8844 tt0113497 Jumanji 1995 Action, Adventure, Family 6.9 15 Dec 1995 Robin Williams, Jonathan Hyde, Kirsten Dunst, ... 4 wins & 9 nominations. USA English, French
2 15602 tt0113228 Grumpier Old Men 1995 Comedy, Romance 6.6 22 Dec 1995 Walter Matthau, Jack Lemmon, Sophia Loren, Ann... 2 wins & 2 nominations. USA English

Filtering data with YEAR range 1990-2014 and COUNTRY as USA and LANGUAGE as English


In [4]:
filteringDataset = movie[(movie.YEAR >= 1990) & (movie.YEAR <= 2014) & 
                     (movie.COUNTRY.str.contains('USA') & (movie.LANGUAGE.str.contains('English')))]

Creating a dataframe and checking for multiple IMDB ID's


In [14]:
datasetDataframe = pd.DataFrame(filteringDataset)

datasetDataframe['IMDB ID'].value_counts()[0:3]


Out[14]:
tt2279864    2
tt0427969    1
tt0463034    1
Name: IMDB ID, dtype: int64

One IMDB ID multiple entry found 'tt2279864'


In [15]:
datasetDataframe[datasetDataframe['IMDB ID'] == 'tt2279864']


Out[15]:
TMDB ID IMDB ID TITLE YEAR GENRE RATING RELEASED ACTORS AWARDS COUNTRY LANGUAGE
21296 0 tt2279864 Clear History 2013 Comedy 6.5 10 Aug 2013 Larry David, Bill Hader, Philip Baker Hall, Jo... NaN USA English
25630 133790 tt2279864 Clear History 2013 Comedy 6.5 10 Aug 2013 Larry David, Bill Hader, Philip Baker Hall, Jo... NaN USA English

Getting the correct row of IMDB ID 'tt2279864'


In [18]:
idtt2279864 = datasetDataframe.loc[25630]

Dataframe without IMDB ID 'tt2279864'


In [19]:
without_tt2279864 = datasetDataframe[datasetDataframe['IMDB ID'] != 'tt2279864']

Appending the single entry of IMDB ID 'tt2279864' to Dataframe without IMDB ID 'tt2279864'


In [20]:
finalDataset = without_tt2279864.append(idtt2279864, ignore_index=True)

No multiple entries of IMDB ID found


In [21]:
finalDataset['IMDB ID'].value_counts()[0:3]


Out[21]:
tt0427969    1
tt0463034    1
tt0119098    1
Name: IMDB ID, dtype: int64

Converting Dataframe to csv


In [22]:
finalDataset.to_csv('datasetWithoutBoxOffice.csv', index=False)

In [ ]: