In [1]:
%matplotlib inline
import pandas as pd

In [2]:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

In [ ]:

First Step: Get the data from storage into the dataframe. Simple and easy method: pd.DataFrame.from_csv

  • CSV: Comma Separated Values, used in spreadsheets. Popular for import & export of smaller datasets.
  • Arguments: path-location of data file, index_col-Column to use as the row labels of the DataFrame.

In [4]:
imdb = pd.DataFrame.from_csv('imdb.csv', index_col=None)

Now, we have the data stored as a DataFrame titled "imdb". As a simple first step, we'd like to see the structure of this DataFrame. We'll use different ways to do this ("df" is the name of an imaginary dataframe):

  • Using the length function on the dataframe: "len(df)" will return the number of rows in the dataframe.
  • Using the .shape property of DataFrames: "df.shape" will return a tuple showing the dimensions of the dataframe df (number of rows,number of columns).

In [5]:
len(imdb)


Out[5]:
249

In [6]:
imdb.shape


Out[6]:
(249, 9)

But I want to look at the Data...not just funny numbers!

  • Use the head method: "df.head(n)" will exhibit the first n rows of the dataframe. This is the easiest manner to get to see the structure, the names of the cloumns etc. (Highly recommended)
  • Use the tail method: "df.tail(n)" will exhibit the last n rows of the dataframe.
  • Enter the Dataframe's name. An excerpt of the data will be rendered in the notebook. (Not recommended)

In [7]:
imdb.head(5)


Out[7]:
title rank top_250_rank year kind imdb_id rating genre director
0 The Wizard of Oz 1 121 1939 movie 32138 8.3 Adventure, Family, Fantasy, Musical Victor Fleming
1 Star Wars 2 14 1977 movie 76759 8.8 Action, Adventure, Fantasy, Sci-Fi George Lucas
2 Cabiria 3 NaN 1914 movie 3740 6.6 Adventure, Drama, War Giovanni Pastrone
3 Psycho 4 24 1960 movie 54215 8.7 Horror, Mystery, Thriller Alfred Hitchcock
4 King Kong 5 244 1933 movie 24216 8.0 Adventure, Fantasy, Horror Merian C. Cooper

Please use the tail method to exhibit the last 3 rows of the imdb Dataframe


In [13]:
#Enter your code here

Individual columns of the dataframe can be accessed by df.column_name or df['column_name']. However, the result is not a DataFrame, but a Series structure.


In [8]:
imdb_top5=imdb.head(5) #smaller dataframe with only top 5 rows.
imdb_top5['title']     #The 'title' column of this smaller dataframe


Out[8]:
0    The Wizard of Oz
1           Star Wars
2             Cabiria
3              Psycho
4           King Kong
Name: title, dtype: object

In [9]:
a=imdb_top5.title

In [10]:
type(a)


Out[10]:
pandas.core.series.Series

Slicing/Filtering data

Culling the data based on some condition. Running a relational quiery (<,>,==...) on any column(s) returns a boolean vector. This boolean vector can be used to filter the data.


In [11]:
top_years=imdb_top5.year
top_years


Out[11]:
0    1939
1    1977
2    1914
3    1960
4    1933
Name: year, dtype: int64

In [12]:
top_years>1950 #get boolean vector. So, only 2 movies in the top 5 were made after 1950
#smh


Out[12]:
0    False
1     True
2    False
3     True
4    False
Name: year, dtype: bool

In [13]:
imdb_top5[imdb_top5.year>1950] #passing this boolean vector to the dataframe, filters the data
#rows which meet the condition (have a "True") are retained.
#rows not meeting the condition(have a "False) are removed.


Out[13]:
title rank top_250_rank year kind imdb_id rating genre director
1 Star Wars 2 14 1977 movie 76759 8.8 Action, Adventure, Fantasy, Sci-Fi George Lucas
3 Psycho 4 24 1960 movie 54215 8.7 Horror, Mystery, Thriller Alfred Hitchcock

Exercise 1

Find all the movies in the data made after the year 1950. Bonus question: how many such movies are there?


In [14]:
#type your code here
len(imdb[imdb.year>1950])


Out[14]:
169

If using multiple conditions in the filter, seperate each condition in brackets and use the logical operators:

  • & (AND)
  • | (or)

In [39]:
imdb[(imdb.year>1950) & (imdb.rating>8.8)]
#Filters all movies/shows made after 1950 AND having a rating of over 8.8


Out[39]:
title rank top_250_rank year kind imdb_id rating genre director
10 The Godfather 11 2 1972 movie 68646 9.2 Crime, Drama Francis Ford Coppola
42 Il buono, il brutto, il cattivo. 43 4 1966 movie 60196 9.0 Adventure, Western Sergio Leone
50 The Twilight Zone 51 NaN 1959 tv series 52520 9.5 Drama, Fantasy, Mystery, Sci-Fi, Thriller John Brahm
60 The Simpsons 61 NaN 1989 tv series 96697 9.0 Animation, Comedy Mark Kirkland
83 Pulp Fiction 84 5 1994 movie 110912 9.0 Crime, Drama Quentin Tarantino
90 The X Files 91 NaN 1993 tv series 106179 9.0 Drama, Mystery, Sci-Fi, Thriller Kim Manners
108 I Love Lucy 109 NaN 1951 tv series 43208 9.0 Comedy, Family William Asher
159 One Flew Over the Cuckoo's Nest 160 9 1975 movie 73486 8.9 Drama Milos Forman

This output can be sorted using the sort method. df.sort(column_name)


In [41]:
imdb[(imdb.year>1950) & (imdb.rating>8.8)].sort('year')
#Filters all movies/shows made after 1950 AND having a rating of over 8.8,
#then sort this Dataframe based on the year


Out[41]:
title rank top_250_rank year kind imdb_id rating genre director
108 I Love Lucy 109 NaN 1951 tv series 43208 9.0 Comedy, Family William Asher
50 The Twilight Zone 51 NaN 1959 tv series 52520 9.5 Drama, Fantasy, Mystery, Sci-Fi, Thriller John Brahm
42 Il buono, il brutto, il cattivo. 43 4 1966 movie 60196 9.0 Adventure, Western Sergio Leone
10 The Godfather 11 2 1972 movie 68646 9.2 Crime, Drama Francis Ford Coppola
159 One Flew Over the Cuckoo's Nest 160 9 1975 movie 73486 8.9 Drama Milos Forman
60 The Simpsons 61 NaN 1989 tv series 96697 9.0 Animation, Comedy Mark Kirkland
90 The X Files 91 NaN 1993 tv series 106179 9.0 Drama, Mystery, Sci-Fi, Thriller Kim Manners
83 Pulp Fiction 84 5 1994 movie 110912 9.0 Crime, Drama Quentin Tarantino

Exercise 2

How many movies are there in the top 250 list?


In [ ]:
#type your code here

Exercise 3

What are the 3 top rated tv series in the top 250 list?


In [ ]:
#type your code here

Exercise 4

How many movies or tv series from the 80's are there in the list?


In [ ]:
#type your code here

Dealing with nulls/NAs/NANs

Real data is always full of missing or bad entries. On a Series, a , we can use 2 methods

  • a.isnull() : returns a vector exhibiting if the row is a null/NA
  • a.notnull(): returns a vector exhibiting if the row is not a null/NA

In [46]:
imdb_top5=imdb.head(5)
a=imdb_top5.top_250_rank
a


Out[46]:
0    121
1     14
2    NaN
3     24
4    244
Name: top_250_rank, dtype: float64

In [47]:
a.isnull()


Out[47]:
0    False
1    False
2     True
3    False
4    False
Name: top_250_rank, dtype: bool

Exercise 5

How many movies or tv series do not have a properly entered "top 250 rank" attribute?


In [ ]:
#type your code here

In [55]:
temp=imdb
temp.fillna(0)


Out[55]:
title rank top_250_rank year kind imdb_id rating genre director
0 The Wizard of Oz 1 121 1939 movie 32138 8.3 Adventure, Family, Fantasy, Musical Victor Fleming
1 Star Wars 2 14 1977 movie 76759 8.8 Action, Adventure, Fantasy, Sci-Fi George Lucas
2 Cabiria 3 0 1914 movie 3740 6.6 Adventure, Drama, War Giovanni Pastrone
3 Psycho 4 24 1960 movie 54215 8.7 Horror, Mystery, Thriller Alfred Hitchcock
4 King Kong 5 244 1933 movie 24216 8.0 Adventure, Fantasy, Horror Merian C. Cooper
5 Metropolis 6 95 1927 movie 17136 8.4 Adventure, Drama, Sci-Fi Fritz Lang
6 Star Trek 7 0 1966 tv series 60028 8.6 Adventure, Sci-Fi Marc Daniels
7 Casablanca 8 16 1942 movie 34583 8.8 Drama, Romance, War Michael Curtiz
8 Snow White and the Seven Dwarfs 9 0 1937 movie 29583 7.9 Animation, Family, Fantasy, Musical, Romance William Cottrell
9 2001: A Space Odyssey 10 86 1968 movie 62622 8.4 Adventure, Mystery, Sci-Fi Stanley Kubrick
10 The Godfather 11 2 1972 movie 68646 9.2 Crime, Drama Francis Ford Coppola
11 The Birth of a Nation 12 0 1915 movie 4972 7.1 Drama, History, Romance, War, Western D.W. Griffith
12 Shadow of a Doubt 13 0 1943 movie 36342 8.1 Crime, Film-Noir, Mystery, Thriller Alfred Hitchcock
13 Jaws 14 112 1975 movie 73195 8.3 Thriller Steven Spielberg
14 Snow White 15 0 1916 movie 7361 6.4 Fantasy, Romance J. Searle Dawley
15 Apocalypse Now 16 37 1979 movie 78788 8.6 Drama, War Francis Ford Coppola
16 Gone with the Wind 17 151 1939 movie 31381 8.2 Drama, Romance, War Victor Fleming
17 The Merry Widow 18 0 1934 movie 25493 7.6 Musical, Comedy, Romance Ernst Lubitsch
18 The Searchers 19 0 1956 movie 49730 8.1 Adventure, Drama, Western John Ford
19 Vertigo 20 43 1958 movie 52357 8.6 Crime, Mystery, Romance, Thriller Alfred Hitchcock
20 Dr. No 21 0 1962 movie 55928 7.3 Action, Adventure, Thriller Terence Young
21 Touch of Evil 22 122 1958 movie 52311 8.3 Crime, Film-Noir, Thriller Orson Welles
22 Chang: A Drama of the Wilderness 23 0 1927 movie 17743 7.4 Adventure, Documentary Merian C. Cooper
23 The Exorcist 24 201 1973 movie 70047 8.1 Horror William Friedkin
24 Citizen Kane 25 36 1941 movie 33467 8.6 Drama, Mystery Orson Welles
25 The Terminator 26 167 1984 movie 88247 8.1 Action, Sci-Fi, Thriller James Cameron
26 Rosemary's Baby 27 225 1968 movie 63522 8.1 Drama, Horror, Mystery Roman Polanski
27 Bronenosets Potyomkin 28 0 1925 movie 15648 8.1 Drama, History, War Sergei M. Eisenstein
28 Star Wars: Episode V - The Empire Strikes Back 29 11 1980 movie 80684 8.8 Action, Adventure, Sci-Fi Irvin Kershner
29 The Lost World 30 0 1925 movie 16039 7.1 Adventure, Fantasy, Horror, Sci-Fi, Thriller Harry O. Hoyt
... ... ... ... ... ... ... ... ... ...
220 El 221 0 1953 movie 45361 7.8 Drama, Romance Luis Buñuel
221 Full Metal Jacket 222 85 1987 movie 93058 8.4 Drama, War Stanley Kubrick
222 The Pride and the Passion 223 0 1957 movie 50858 5.5 Action, Adventure, Drama, Romance, War Stanley Kramer
223 This Is Spinal Tap 224 0 1984 movie 88258 8.0 Comedy, Music Rob Reiner
224 Winchester '73 225 0 1950 movie 43137 7.8 Western Anthony Mann
225 Star Trek: The Next Generation 226 0 1987 tv series 92455 8.8 Action, Adventure, Sci-Fi Cliff Bole
226 Kojak 227 0 1973 tv series 69599 7.3 Crime, Drama, Mystery Charles S. Dubin
227 Gryozy 228 0 1915 movie 5414 7.4 Drama, Short Yevgeni Bauer
228 Baywatch 229 0 1989 tv series 96542 4.9 Drama, Action, Adventure Gregory J. Bonann
229 Heavy Metal 230 0 1981 movie 82509 6.4 Animation, Action, Adventure, Comedy, Crime, F... Gerald Potterton
230 The Day the Earth Stood Still 231 0 1951 movie 43456 8.0 Drama, Sci-Fi, Thriller Robert Wise
231 Three Little Pigs 232 0 1933 movie 24660 7.7 Animation, Musical, Family, Comedy, Short Burt Gillett
232 The Public Enemy 233 0 1931 movie 22286 7.8 Action, Crime, Drama William A. Wellman
233 The Rocky Horror Picture Show 234 0 1975 movie 73629 7.1 Comedy, Musical Jim Sharman
234 Rain Man 235 0 1988 movie 95953 8.0 Drama Barry Levinson
235 L'atalante 236 0 1934 movie 24844 8.0 Drama, Romance Jean Vigo
236 Persona 237 205 1966 movie 60827 8.2 Drama, Fantasy Ingmar Bergman
237 Dawn of the Dead 238 0 1978 movie 77402 8.0 Action, Horror George A. Romero
238 Bonanza 239 0 1959 tv series 52451 7.4 Action, Comedy, Drama, Romance, War, Western William F. Claxton
239 Fatal Attraction 240 0 1987 movie 93010 6.8 Drama, Thriller Adrian Lyne
240 It! The Terror from Beyond Space 241 0 1958 movie 51786 6.0 Horror, Sci-Fi Edward L. Cahn
241 Toast of the Town 242 0 1948 tv series 40053 8.3 Comedy, Music John Moffitt
242 The French Connection 243 0 1971 movie 67116 7.9 Action, Crime, Thriller William Friedkin
243 Peter Pan 244 0 1953 movie 46183 7.3 Animation, Adventure, Family, Fantasy, Music Clyde Geronimi
244 Miami Vice 245 0 1984 tv series 86759 7.8 Action, Crime, Drama, Thriller John Nicolella
245 Star Wars: Episode I - The Phantom Menace 246 0 1999 movie 120915 6.4 Action, Adventure, Fantasy, Sci-Fi George Lucas
246 Goodfellas 247 15 1990 movie 99685 8.8 Crime, Drama, Thriller Martin Scorsese
247 The Egyptian 248 0 1954 movie 46949 6.2 Drama, History Michael Curtiz
248 Hawaii Five-O 249 0 1968 tv series 62568 7.6 Crime, Drama, Mystery Michael O'Herlihy
249 Charlie's Angels 250 0 1976 tv series 73972 6.6 Action, Adventure, Crime, Drama, Mystery Dennis Donnelly

250 rows × 9 columns

Text Mining using Dataframes

The string methods from python can be applied to Series, with the prefix "str". So, for a Series, a, we can quiery:

  • a.str.contains("target")
  • a.str.startswith("target")

In [67]:
imdb[imdb.title.str.contains("Files")]


Out[67]:
title rank top_250_rank year kind imdb_id rating genre director
90 The X Files 91 NaN 1993 tv series 106179 9 Drama, Mystery, Sci-Fi, Thriller Kim Manners

Exercise 6

How many movies or tv series have names that start with "The"?


In [ ]:
#Enter your code here

In [ ]:


In [ ]:


In [ ]:


In [ ]: