Web Scraping & Data Analysis with Selenium and Python

Vinay Babu

Github: https://github.com/min2bro/WebScrapingwithSelenium

Twitter: @min2bro

Data Science ToolBox

IPython Notebook

Write, Edit, Replay python scripts
Interactive Data Visualization and report Presentation
Notebook can be saved and shared
Run Selenium Python Scripts

Pandas

Python Data Analysis Library

Matplotlib

plotting library for the Python

Analysis of the Filmfare Awards for Best Picture from 1955-2015

Web Scraping: Extracting Data from the Web

Some Import



In [31]:

    
%matplotlib inline 
from selenium import webdriver
import os,time,json
import pandas as pd
from collections import defaultdict,Counter
import matplotlib.pyplot as plt

Initial Setup and Launch the browser to open the URL



In [32]:

    
url = "http://www.imdb.com/list/ls061683439/"
with open('./img/filmfare.json',encoding="utf-8") as f:
    datatbl = json.load(f)
driver = webdriver.Chrome(datatbl['data']['chromedriver'])
driver.get(url)

Getting Data

Function to extract the data from Web using Selenium



In [33]:

    
def ExtractText(Xpath):
    textlist=[]
    if(Xpath=="Movies_Runtime_Xpath"):
        [textlist.append(item.text[-10:-7]) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    else:    
        [textlist.append(item.text) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    return textlist

Let's extract all the required data like Ratings,Votes,Genre, Year of Release for the Best Movies



In [34]:

    
#Extracting Data from Web
Movies_Votes,Movies_Name,Movies_Ratings,Movies_RunTime=[[] for i in range(4)]
datarepo = [[]]*4
Xpath_list = ['Movies_Name_Xpath','Movies_Rate_Xpath','Movies_Runtime_Xpath','Movies_Votes_Xpath']

for i in range(4):
    if(i==3):
        driver.find_element_by_xpath(datatbl['data']['listview']).click()
    datarepo[i] = ExtractText(Xpath_list[i])
    
driver.quit()

How does data looks now?

Is this Data is in the correct format to perform data manipulation?



In [35]:

    
# Movie Name List & Ratings
print(datarepo[0][:5])
print(datarepo[3][:5])









    



['Bajirao Mastani', 'Queen', 'Bhaag Milkha Bhaag', 'Barfi!', 'Zindagi Na Milegi Dobara']
['17,283', '39,478', '39,688', '52,270', '41,706']

Store Data in a Python Dictionary

The data is stored in a python dictionary which is more structured way to store the data here and all the movie attributes are now linked with the respective movie



In [36]:

    
# Result in a Python Dictionary
Years=range(2015,1954,-1)
result = defaultdict(dict)
for i in range(0,len(datarepo[0])):
    result[i]['Movie Name']= datarepo[0][i]
    result[i]['Year']= Years[i]
    result[i]['Rating']= datarepo[1][i]
    result[i]['Votes']= datarepo[3][i]
    result[i]['RunTime']= datarepo[2][i]

Let's see now how the data in dictionary looks like?



In [37]:

    
result









    Out[37]:





defaultdict(dict,
            {0: {'Movie Name': 'Bajirao Mastani',
              'Rating': '7.2',
              'RunTime': '158',
              'Votes': '17,283',
              'Year': 2015},
             1: {'Movie Name': 'Queen',
              'Rating': '8.4',
              'RunTime': '146',
              'Votes': '39,478',
              'Year': 2014},
             2: {'Movie Name': 'Bhaag Milkha Bhaag',
              'Rating': '8.3',
              'RunTime': '186',
              'Votes': '39,688',
              'Year': 2013},
             3: {'Movie Name': 'Barfi!',
              'Rating': '8.2',
              'RunTime': '151',
              'Votes': '52,270',
              'Year': 2012},
             4: {'Movie Name': 'Zindagi Na Milegi Dobara',
              'Rating': '8.1',
              'RunTime': '155',
              'Votes': '41,706',
              'Year': 2011},
             5: {'Movie Name': 'Dabangg',
              'Rating': '6.3',
              'RunTime': '126',
              'Votes': '19,768',
              'Year': 2010},
             6: {'Movie Name': '3 Idiots',
              'Rating': '8.4',
              'RunTime': '170',
              'Votes': '200,956',
              'Year': 2009},
             7: {'Movie Name': 'Jodhaa Akbar',
              'Rating': '7.6',
              'RunTime': '213',
              'Votes': '17,902',
              'Year': 2008},
             8: {'Movie Name': 'Like Stars on Earth',
              'Rating': '8.5',
              'RunTime': '165',
              'Votes': '82,648',
              'Year': 2007},
             9: {'Movie Name': 'Rang De Basanti',
              'Rating': '8.4',
              'RunTime': '157',
              'Votes': '68,550',
              'Year': 2006},
             10: {'Movie Name': 'Black',
              'Rating': '8.3',
              'RunTime': '122',
              'Votes': '22,851',
              'Year': 2005},
             11: {'Movie Name': 'Veer-Zaara',
              'Rating': '7.9',
              'RunTime': '192',
              'Votes': '33,468',
              'Year': 2004},
             12: {'Movie Name': 'Koi... Mil Gaya',
              'Rating': '7.1',
              'RunTime': '171',
              'Votes': '12,003',
              'Year': 2003},
             13: {'Movie Name': 'Devdas',
              'Rating': '7.6',
              'RunTime': '185',
              'Votes': '25,162',
              'Year': 2002},
             14: {'Movie Name': 'Lagaan: Once Upon a Time in India',
              'Rating': '8.2',
              'RunTime': '224',
              'Votes': '68,811',
              'Year': 2001},
             15: {'Movie Name': 'Kaho Naa... Pyaar Hai',
              'Rating': '6.9',
              'RunTime': '172',
              'Votes': '7,871',
              'Year': 2000},
             16: {'Movie Name': 'Straight from the Heart',
              'Rating': '7.6',
              'RunTime': '188',
              'Votes': '9,793',
              'Year': 1999},
             17: {'Movie Name': 'Kuch Kuch Hota Hai',
              'Rating': '7.8',
              'RunTime': '177',
              'Votes': '31,483',
              'Year': 1998},
             18: {'Movie Name': 'Dil To Pagal Hai',
              'Rating': '7.1',
              'RunTime': '179',
              'Votes': '13,890',
              'Year': 1997},
             19: {'Movie Name': 'Raja Hindustani',
              'Rating': '6.1',
              'RunTime': '165',
              'Votes': '4,802',
              'Year': 1996},
             20: {'Movie Name': 'Dilwale Dulhania Le Jayenge',
              'Rating': '8.3',
              'RunTime': '189',
              'Votes': '42,056',
              'Year': 1995},
             21: {'Movie Name': 'Hum Aapke Hain Koun...!',
              'Rating': '7.7',
              'RunTime': '206',
              'Votes': '10,945',
              'Year': 1994},
             22: {'Movie Name': 'Hum Hain Rahi Pyar Ke',
              'Rating': '7.5',
              'RunTime': '163',
              'Votes': '3,443',
              'Year': 1993},
             23: {'Movie Name': 'Jo Jeeta Wohi Sikandar',
              'Rating': '8.3',
              'RunTime': '174',
              'Votes': '12,286',
              'Year': 1992},
             24: {'Movie Name': 'Lamhe',
              'Rating': '7.4',
              'RunTime': '187',
              'Votes': '1,928',
              'Year': 1991},
             25: {'Movie Name': 'Ghayal',
              'Rating': '7.6',
              'RunTime': '163',
              'Votes': '2,631',
              'Year': 1990},
             26: {'Movie Name': 'Maine Pyar Kiya',
              'Rating': '7.5',
              'RunTime': '192',
              'Votes': '5,799',
              'Year': 1989},
             27: {'Movie Name': 'Qayamat Se Qayamat Tak',
              'Rating': '7.6',
              'RunTime': '162',
              'Votes': '6,161',
              'Year': 1988},
             28: {'Movie Name': 'Ram Teri Ganga Maili',
              'Rating': '6.8',
              'RunTime': '178',
              'Votes': '686',
              'Year': 1987},
             29: {'Movie Name': 'Sparsh',
              'Rating': '8.1',
              'RunTime': '145',
              'Votes': '380',
              'Year': 1986},
             30: {'Movie Name': 'Ardh Satya',
              'Rating': '8.2',
              'RunTime': '130',
              'Votes': '1,080',
              'Year': 1985},
             31: {'Movie Name': 'Shakti',
              'Rating': '7.9',
              'RunTime': '166',
              'Votes': '1,347',
              'Year': 1984},
             32: {'Movie Name': 'Kalyug',
              'Rating': '7.8',
              'RunTime': '152',
              'Votes': '369',
              'Year': 1983},
             33: {'Movie Name': 'Khubsoorat',
              'Rating': '7.8',
              'RunTime': '126',
              'Votes': '938',
              'Year': 1982},
             34: {'Movie Name': 'Junoon',
              'Rating': '7.6',
              'RunTime': '141',
              'Votes': '341',
              'Year': 1981},
             35: {'Movie Name': 'Main Tulsi Tere Aangan Ki',
              'Rating': '7.4',
              'RunTime': '151',
              'Votes': '75',
              'Year': 1980},
             36: {'Movie Name': 'Bhumika',
              'Rating': '7.6',
              'RunTime': '142',
              'Votes': '311',
              'Year': 1979},
             37: {'Movie Name': 'Mausam',
              'Rating': '8.1',
              'RunTime': '156',
              'Votes': '563',
              'Year': 1978},
             38: {'Movie Name': 'Deewaar',
              'Rating': '8.2',
              'RunTime': '174',
              'Votes': '5,567',
              'Year': 1977},
             39: {'Movie Name': 'Rajnigandha',
              'Rating': '7.5',
              'RunTime': '110',
              'Votes': '315',
              'Year': 1976},
             40: {'Movie Name': 'Anuraag',
              'Rating': '7.4',
              'RunTime': 'nes',
              'Votes': '45',
              'Year': 1975},
             41: {'Movie Name': 'Be-Imaan',
              'Rating': '7.4',
              'RunTime': '133',
              'Votes': '57',
              'Year': 1974},
             42: {'Movie Name': 'Anand',
              'Rating': '8.9',
              'RunTime': '122',
              'Votes': '10,718',
              'Year': 1973},
             43: {'Movie Name': 'Toy',
              'Rating': '7.3',
              'RunTime': '160',
              'Votes': '229',
              'Year': 1972},
             44: {'Movie Name': 'Aradhana',
              'Rating': '7.7',
              'RunTime': '169',
              'Votes': '1,091',
              'Year': 1971},
             45: {'Movie Name': 'Brahmachari',
              'Rating': '6.8',
              'RunTime': '157',
              'Votes': '239',
              'Year': 1970},
             46: {'Movie Name': 'Upkar',
              'Rating': '7.7',
              'RunTime': '175',
              'Votes': '364',
              'Year': 1969},
             47: {'Movie Name': 'Guide',
              'Rating': '8.6',
              'RunTime': '183',
              'Votes': '3,950',
              'Year': 1968},
             48: {'Movie Name': 'Himalay Ki Godmein',
              'Rating': '7.2',
              'RunTime': 'stu',
              'Votes': '57',
              'Year': 1967},
             49: {'Movie Name': 'Dosti',
              'Rating': '8.4',
              'RunTime': '163',
              'Votes': '1,010',
              'Year': 1966},
             50: {'Movie Name': 'Bandini',
              'Rating': '7.8',
              'RunTime': '157',
              'Votes': '546',
              'Year': 1965},
             51: {'Movie Name': 'Sahib Bibi Aur Ghulam',
              'Rating': '8.4',
              'RunTime': '152',
              'Votes': '1,076',
              'Year': 1964},
             52: {'Movie Name': 'Jis Desh Men Ganga Behti Hai',
              'Rating': '7.3',
              'RunTime': '167',
              'Votes': '301',
              'Year': 1963},
             53: {'Movie Name': 'Mughal-E-Azam',
              'Rating': '8.4',
              'RunTime': '197',
              'Votes': '3,868',
              'Year': 1962},
             54: {'Movie Name': 'Sujata',
              'Rating': '7.5',
              'RunTime': '161',
              'Votes': '191',
              'Year': 1961},
             55: {'Movie Name': 'Madhumati',
              'Rating': '8.1',
              'RunTime': '110',
              'Votes': '793',
              'Year': 1960},
             56: {'Movie Name': 'Mother India',
              'Rating': '8.1',
              'RunTime': '172',
              'Votes': '4,842',
              'Year': 1959},
             57: {'Movie Name': 'Jhanak Jhanak Payal Baaje',
              'Rating': '7.3',
              'RunTime': '143',
              'Votes': '68',
              'Year': 1958},
             58: {'Movie Name': 'Jagriti',
              'Rating': '7.8',
              'RunTime': 'sti',
              'Votes': '82',
              'Year': 1957},
             59: {'Movie Name': 'Boot Polish',
              'Rating': '8.1',
              'RunTime': '149',
              'Votes': '498',
              'Year': 1956},
             60: {'Movie Name': 'Do Bigha Zamin',
              'Rating': '8.4',
              'RunTime': '131',
              'Votes': '1,104',
              'Year': 1955}})



In [38]:

    
print(json.dumps(result[58], indent=2))









    



{
  "Movie Name": "Jagriti",
  "Rating": "7.8",
  "Year": 1957,
  "RunTime": "sti",
  "Votes": "82"
}

Oh! Something is wrong with the data, It's not in right shape to perform analysis on this data set

Let's clean the data

Replace the comma(,) in Vote Value and change the data type to int
Change the Data type for Rating and RunTime



In [39]:

    
for key,values in result.items():
    values['Votes'] = int(values['Votes'].replace(",",""))
    values['Rating']= float(values['Rating'])
    try:
        values['RunTime'] = int(values['RunTime'])
    except ValueError:
        values['RunTime'] = 154

Now let's look at the data and see how it looks like



In [40]:

    
result[58]









    Out[40]:





{'Movie Name': 'Jagriti',
 'Rating': 7.8,
 'RunTime': 154,
 'Votes': 82,
 'Year': 1957}

Data in Pandas Dataframe

Data is consumed in a Pandas Dataframe, Which is more convenient way to perform data analysis,manipulation or aggregation



In [55]:

    
# create dataframe
df = pd.DataFrame.from_dict(result,orient='index')
df = df[['Year', 'Movie Name', 'Rating', 'Votes','RunTime']]
df









    Out[55]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      0
      2015
      Bajirao Mastani
      7.2
      17283
      158
    
    
      1
      2014
      Queen
      8.4
      39478
      146
    
    
      2
      2013
      Bhaag Milkha Bhaag
      8.3
      39688
      186
    
    
      3
      2012
      Barfi!
      8.2
      52270
      151
    
    
      4
      2011
      Zindagi Na Milegi Dobara
      8.1
      41706
      155
    
    
      5
      2010
      Dabangg
      6.3
      19768
      126
    
    
      6
      2009
      3 Idiots
      8.4
      200956
      170
    
    
      7
      2008
      Jodhaa Akbar
      7.6
      17902
      213
    
    
      8
      2007
      Like Stars on Earth
      8.5
      82648
      165
    
    
      9
      2006
      Rang De Basanti
      8.4
      68550
      157
    
    
      10
      2005
      Black
      8.3
      22851
      122
    
    
      11
      2004
      Veer-Zaara
      7.9
      33468
      192
    
    
      12
      2003
      Koi... Mil Gaya
      7.1
      12003
      171
    
    
      13
      2002
      Devdas
      7.6
      25162
      185
    
    
      14
      2001
      Lagaan: Once Upon a Time in India
      8.2
      68811
      224
    
    
      15
      2000
      Kaho Naa... Pyaar Hai
      6.9
      7871
      172
    
    
      16
      1999
      Straight from the Heart
      7.6
      9793
      188
    
    
      17
      1998
      Kuch Kuch Hota Hai
      7.8
      31483
      177
    
    
      18
      1997
      Dil To Pagal Hai
      7.1
      13890
      179
    
    
      19
      1996
      Raja Hindustani
      6.1
      4802
      165
    
    
      20
      1995
      Dilwale Dulhania Le Jayenge
      8.3
      42056
      189
    
    
      21
      1994
      Hum Aapke Hain Koun...!
      7.7
      10945
      206
    
    
      22
      1993
      Hum Hain Rahi Pyar Ke
      7.5
      3443
      163
    
    
      23
      1992
      Jo Jeeta Wohi Sikandar
      8.3
      12286
      174
    
    
      24
      1991
      Lamhe
      7.4
      1928
      187
    
    
      25
      1990
      Ghayal
      7.6
      2631
      163
    
    
      26
      1989
      Maine Pyar Kiya
      7.5
      5799
      192
    
    
      27
      1988
      Qayamat Se Qayamat Tak
      7.6
      6161
      162
    
    
      28
      1987
      Ram Teri Ganga Maili
      6.8
      686
      178
    
    
      29
      1986
      Sparsh
      8.1
      380
      145
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      31
      1984
      Shakti
      7.9
      1347
      166
    
    
      32
      1983
      Kalyug
      7.8
      369
      152
    
    
      33
      1982
      Khubsoorat
      7.8
      938
      126
    
    
      34
      1981
      Junoon
      7.6
      341
      141
    
    
      35
      1980
      Main Tulsi Tere Aangan Ki
      7.4
      75
      151
    
    
      36
      1979
      Bhumika
      7.6
      311
      142
    
    
      37
      1978
      Mausam
      8.1
      563
      156
    
    
      38
      1977
      Deewaar
      8.2
      5567
      174
    
    
      39
      1976
      Rajnigandha
      7.5
      315
      110
    
    
      40
      1975
      Anuraag
      7.4
      45
      154
    
    
      41
      1974
      Be-Imaan
      7.4
      57
      133
    
    
      42
      1973
      Anand
      8.9
      10718
      122
    
    
      43
      1972
      Toy
      7.3
      229
      160
    
    
      44
      1971
      Aradhana
      7.7
      1091
      169
    
    
      45
      1970
      Brahmachari
      6.8
      239
      157
    
    
      46
      1969
      Upkar
      7.7
      364
      175
    
    
      47
      1968
      Guide
      8.6
      3950
      183
    
    
      48
      1967
      Himalay Ki Godmein
      7.2
      57
      154
    
    
      49
      1966
      Dosti
      8.4
      1010
      163
    
    
      50
      1965
      Bandini
      7.8
      546
      157
    
    
      51
      1964
      Sahib Bibi Aur Ghulam
      8.4
      1076
      152
    
    
      52
      1963
      Jis Desh Men Ganga Behti Hai
      7.3
      301
      167
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197
    
    
      54
      1961
      Sujata
      7.5
      191
      161
    
    
      55
      1960
      Madhumati
      8.1
      793
      110
    
    
      56
      1959
      Mother India
      8.1
      4842
      172
    
    
      57
      1958
      Jhanak Jhanak Payal Baaje
      7.3
      68
      143
    
    
      58
      1957
      Jagriti
      7.8
      82
      154
    
    
      59
      1956
      Boot Polish
      8.1
      498
      149
    
    
      60
      1955
      Do Bigha Zamin
      8.4
      1104
      131
    
  

61 rows × 5 columns

Let's use some of the Pandas functions now and start the Analysis



In [74]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 0 to 60
Data columns (total 6 columns):
Year           61 non-null int64
Movie Name     61 non-null object
Rating         61 non-null float64
Votes          61 non-null int64
RunTime        61 non-null int64
Ratingbyten    61 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 3.3+ KB

Movies with Highest Ratings

The top five movies with Maximum Rating since 1955



In [48]:

    
#Highest Rating Movies
df.sort_values('Rating',ascending=[False]).head(5)









    Out[48]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      42
      1973
      Anand
      8.9
      10718
      122
    
    
      47
      1968
      Guide
      8.6
      3950
      183
    
    
      8
      2007
      Like Stars on Earth
      8.5
      82648
      165
    
    
      60
      1955
      Do Bigha Zamin
      8.4
      1104
      131
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197

Movies with Maximum Run time

Top 10 movies with maximum Run time



In [49]:

    
#Movies with maximum Run Time
df.sort_values(['RunTime'],ascending=[False]).head(10)









    Out[49]:






  
    
      
      Year
      Movie Name
      Rating
      Votes
      RunTime
    
  
  
    
      14
      2001
      Lagaan: Once Upon a Time in India
      8.2
      68811
      224
    
    
      7
      2008
      Jodhaa Akbar
      7.6
      17902
      213
    
    
      21
      1994
      Hum Aapke Hain Koun...!
      7.7
      10945
      206
    
    
      53
      1962
      Mughal-E-Azam
      8.4
      3868
      197
    
    
      26
      1989
      Maine Pyar Kiya
      7.5
      5799
      192
    
    
      11
      2004
      Veer-Zaara
      7.9
      33468
      192
    
    
      20
      1995
      Dilwale Dulhania Le Jayenge
      8.3
      42056
      189
    
    
      16
      1999
      Straight from the Heart
      7.6
      9793
      188
    
    
      24
      1991
      Lamhe
      7.4
      1928
      187
    
    
      2
      2013
      Bhaag Milkha Bhaag
      8.3
      39688
      186

Best Movie Run time

Let's plot a graph to see the movie run time trend from 1955 thru 2015



In [75]:

    
df.plot(x=df.Year,y=['RunTime']);

Mean of the Movie Run Time



In [48]:

    
df['RunTime'].mean()









    Out[48]:





154.2622950819672

Best Movie Ratings

Perform some analysis on the ratings of all the Best won movies

No. of Movies Greater than IMDB 7 ratings



In [76]:

    
df[(df['Rating']>=7)]['Rating'].count()









    Out[76]:





56

Movie Ratings Visualization using Bar Graph



In [77]:

    
Rating_Histdic = defaultdict(dict)

Rating_Histdic['Btwn 6&7'] = df[(df['Rating']>=6)&(df['Rating']<7)]['Rating'].count()
Rating_Histdic['GTEQ 8'] = df[(df['Rating']>=8)]['Rating'].count()
Rating_Histdic['Btwn 7 & 8'] = df[(df['Rating']>=7)&(df['Rating']<8)]['Rating'].count()


plt.bar(range(len(Rating_Histdic)), Rating_Histdic.values(), align='center',color='brown',width=0.4)
plt.xticks(range(len(Rating_Histdic)), Rating_Histdic.keys(), rotation=25);

Percentage distribution of the Ratings in a Pie-Chart



In [78]:

    
Rating_Hist = []
import numpy as np
    
Rating_Hist.append(Rating_Histdic['Btwn 6&7'])
Rating_Hist.append(Rating_Histdic['GTEQ 8'])
Rating_Hist.append(Rating_Histdic['Btwn 7 & 8'])

labels = ['Btwn 6&7', 'GTEQ 8', 'Btwn 7 & 8']
colors = ['red', 'orange', 'green']

plt.pie(Rating_Hist,labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=90);

Best Picture by Genre

Let's analyze the Genre for the best won movies



In [54]:

    
Category=Counter(datatbl['data']['Genre'])
df1 = pd.DataFrame.from_dict(Category,orient='index')
df1 = df1.sort_values([0],ascending=[False]).head(5)
df1.plot(kind='barh',color=['g','c','m']);

Conclusion :

Movies with Ratings greater than 7
Run time more than 2hrs
Category Drama & Musical are most likely to be selcted for Best Picture

	Year	Movie Name	Rating	Votes	RunTime
0	2015	Bajirao Mastani	7.2	17283	158
1	2014	Queen	8.4	39478	146
2	2013	Bhaag Milkha Bhaag	8.3	39688	186
3	2012	Barfi!	8.2	52270	151
4	2011	Zindagi Na Milegi Dobara	8.1	41706	155
5	2010	Dabangg	6.3	19768	126
6	2009	3 Idiots	8.4	200956	170
7	2008	Jodhaa Akbar	7.6	17902	213
8	2007	Like Stars on Earth	8.5	82648	165
9	2006	Rang De Basanti	8.4	68550	157
10	2005	Black	8.3	22851	122
11	2004	Veer-Zaara	7.9	33468	192
12	2003	Koi... Mil Gaya	7.1	12003	171
13	2002	Devdas	7.6	25162	185
14	2001	Lagaan: Once Upon a Time in India	8.2	68811	224
15	2000	Kaho Naa... Pyaar Hai	6.9	7871	172
16	1999	Straight from the Heart	7.6	9793	188
17	1998	Kuch Kuch Hota Hai	7.8	31483	177
18	1997	Dil To Pagal Hai	7.1	13890	179
19	1996	Raja Hindustani	6.1	4802	165
20	1995	Dilwale Dulhania Le Jayenge	8.3	42056	189
21	1994	Hum Aapke Hain Koun...!	7.7	10945	206
22	1993	Hum Hain Rahi Pyar Ke	7.5	3443	163
23	1992	Jo Jeeta Wohi Sikandar	8.3	12286	174
24	1991	Lamhe	7.4	1928	187
25	1990	Ghayal	7.6	2631	163
26	1989	Maine Pyar Kiya	7.5	5799	192
27	1988	Qayamat Se Qayamat Tak	7.6	6161	162
28	1987	Ram Teri Ganga Maili	6.8	686	178
29	1986	Sparsh	8.1	380	145
...	...	...	...	...	...
31	1984	Shakti	7.9	1347	166
32	1983	Kalyug	7.8	369	152
33	1982	Khubsoorat	7.8	938	126
34	1981	Junoon	7.6	341	141
35	1980	Main Tulsi Tere Aangan Ki	7.4	75	151
36	1979	Bhumika	7.6	311	142
37	1978	Mausam	8.1	563	156
38	1977	Deewaar	8.2	5567	174
39	1976	Rajnigandha	7.5	315	110
40	1975	Anuraag	7.4	45	154
41	1974	Be-Imaan	7.4	57	133
42	1973	Anand	8.9	10718	122
43	1972	Toy	7.3	229	160
44	1971	Aradhana	7.7	1091	169
45	1970	Brahmachari	6.8	239	157
46	1969	Upkar	7.7	364	175
47	1968	Guide	8.6	3950	183
48	1967	Himalay Ki Godmein	7.2	57	154
49	1966	Dosti	8.4	1010	163
50	1965	Bandini	7.8	546	157
51	1964	Sahib Bibi Aur Ghulam	8.4	1076	152
52	1963	Jis Desh Men Ganga Behti Hai	7.3	301	167
53	1962	Mughal-E-Azam	8.4	3868	197
54	1961	Sujata	7.5	191	161
55	1960	Madhumati	8.1	793	110
56	1959	Mother India	8.1	4842	172
57	1958	Jhanak Jhanak Payal Baaje	7.3	68	143
58	1957	Jagriti	7.8	82	154
59	1956	Boot Polish	8.1	498	149
60	1955	Do Bigha Zamin	8.4	1104	131

Web Scraping & Data Analysis with Selenium and Python

Vinay Babu

Data Science ToolBox

IPython Notebook

Pandas

Matplotlib

Analysis of the Filmfare Awards for Best Picture from 1955-2015

Web Scraping: Extracting Data from the Web

Some Import

Initial Setup and Launch the browser to open the URL

Getting Data

Function to extract the data from Web using Selenium

Let's extract all the required data like Ratings,Votes,Genre, Year of Release for the Best Movies

How does data looks now?

Is this Data is in the correct format to perform data manipulation?

The individual movie related data is stored in a Python List, it's hard to corelated the data attributes with the respective Movies

Store Data in a Python Dictionary

The data is stored in a python dictionary which is more structured way to store the data here and all the movie attributes are now linked with the respective movie

Let's see now how the data in dictionary looks like?

Oh! Something is wrong with the data, It's not in right shape to perform analysis on this data set

Let's clean the data

Now let's look at the data and see how it looks like

Data in Pandas Dataframe

Data is consumed in a Pandas Dataframe, Which is more convenient way to perform data analysis,manipulation or aggregation

Let's use some of the Pandas functions now and start the Analysis

Movies with Highest Ratings

The top five movies with Maximum Rating since 1955

Movies with Maximum Run time

Top 10 movies with maximum Run time

Best Movie Run time

Let's plot a graph to see the movie run time trend from 1955 thru 2015

Mean of the Movie Run Time

Best Movie Ratings

Perform some analysis on the ratings of all the Best won movies

Movie Ratings Visualization using Bar Graph

Percentage distribution of the Ratings in a Pie-Chart

Best Picture by Genre

Let's analyze the Genre for the best won movies

Conclusion :