Web Scraping & Data Analysis with Selenium and Python

Vinay Babu

Github: https://github.com/min2bro/WebScrapingwithSelenium

Twitter: @min2bro

Data Science ToolBox

IPython Notebook

  • Write, Edit, Replay python scripts
  • Interactive Data Visualization and report Presentation
  • Notebook can be saved and shared
  • Run Selenium Python Scripts

Pandas

  • Python Data Analysis Library

Matplotlib

  • plotting library for the Python

Analysis of the Filmfare Awards for Best Picture from 1955-2015

Web Scraping: Extracting Data from the Web

Some Import


In [31]:
%matplotlib inline 
from selenium import webdriver
import os,time,json
import pandas as pd
from collections import defaultdict,Counter
import matplotlib.pyplot as plt

Initial Setup and Launch the browser to open the URL


In [32]:
url = "http://www.imdb.com/list/ls061683439/"
with open('./img/filmfare.json',encoding="utf-8") as f:
    datatbl = json.load(f)
driver = webdriver.Chrome(datatbl['data']['chromedriver'])
driver.get(url)

Getting Data

Function to extract the data from Web using Selenium


In [33]:
def ExtractText(Xpath):
    textlist=[]
    if(Xpath=="Movies_Runtime_Xpath"):
        [textlist.append(item.text[-10:-7]) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    else:    
        [textlist.append(item.text) for item in driver.find_elements_by_xpath(datatbl['data'][Xpath])]
    return textlist

Let's extract all the required data like Ratings,Votes,Genre, Year of Release for the Best Movies


In [34]:
#Extracting Data from Web
Movies_Votes,Movies_Name,Movies_Ratings,Movies_RunTime=[[] for i in range(4)]
datarepo = [[]]*4
Xpath_list = ['Movies_Name_Xpath','Movies_Rate_Xpath','Movies_Runtime_Xpath','Movies_Votes_Xpath']

for i in range(4):
    if(i==3):
        driver.find_element_by_xpath(datatbl['data']['listview']).click()
    datarepo[i] = ExtractText(Xpath_list[i])
    
driver.quit()

How does data looks now?

Is this Data is in the correct format to perform data manipulation?


In [35]:
# Movie Name List & Ratings
print(datarepo[0][:5])
print(datarepo[3][:5])


['Bajirao Mastani', 'Queen', 'Bhaag Milkha Bhaag', 'Barfi!', 'Zindagi Na Milegi Dobara']
['17,283', '39,478', '39,688', '52,270', '41,706']

Store Data in a Python Dictionary

The data is stored in a python dictionary which is more structured way to store the data here and all the movie attributes are now linked with the respective movie


In [36]:
# Result in a Python Dictionary
Years=range(2015,1954,-1)
result = defaultdict(dict)
for i in range(0,len(datarepo[0])):
    result[i]['Movie Name']= datarepo[0][i]
    result[i]['Year']= Years[i]
    result[i]['Rating']= datarepo[1][i]
    result[i]['Votes']= datarepo[3][i]
    result[i]['RunTime']= datarepo[2][i]

Let's see now how the data in dictionary looks like?


In [37]:
result


Out[37]:
defaultdict(dict,
            {0: {'Movie Name': 'Bajirao Mastani',
              'Rating': '7.2',
              'RunTime': '158',
              'Votes': '17,283',
              'Year': 2015},
             1: {'Movie Name': 'Queen',
              'Rating': '8.4',
              'RunTime': '146',
              'Votes': '39,478',
              'Year': 2014},
             2: {'Movie Name': 'Bhaag Milkha Bhaag',
              'Rating': '8.3',
              'RunTime': '186',
              'Votes': '39,688',
              'Year': 2013},
             3: {'Movie Name': 'Barfi!',
              'Rating': '8.2',
              'RunTime': '151',
              'Votes': '52,270',
              'Year': 2012},
             4: {'Movie Name': 'Zindagi Na Milegi Dobara',
              'Rating': '8.1',
              'RunTime': '155',
              'Votes': '41,706',
              'Year': 2011},
             5: {'Movie Name': 'Dabangg',
              'Rating': '6.3',
              'RunTime': '126',
              'Votes': '19,768',
              'Year': 2010},
             6: {'Movie Name': '3 Idiots',
              'Rating': '8.4',
              'RunTime': '170',
              'Votes': '200,956',
              'Year': 2009},
             7: {'Movie Name': 'Jodhaa Akbar',
              'Rating': '7.6',
              'RunTime': '213',
              'Votes': '17,902',
              'Year': 2008},
             8: {'Movie Name': 'Like Stars on Earth',
              'Rating': '8.5',
              'RunTime': '165',
              'Votes': '82,648',
              'Year': 2007},
             9: {'Movie Name': 'Rang De Basanti',
              'Rating': '8.4',
              'RunTime': '157',
              'Votes': '68,550',
              'Year': 2006},
             10: {'Movie Name': 'Black',
              'Rating': '8.3',
              'RunTime': '122',
              'Votes': '22,851',
              'Year': 2005},
             11: {'Movie Name': 'Veer-Zaara',
              'Rating': '7.9',
              'RunTime': '192',
              'Votes': '33,468',
              'Year': 2004},
             12: {'Movie Name': 'Koi... Mil Gaya',
              'Rating': '7.1',
              'RunTime': '171',
              'Votes': '12,003',
              'Year': 2003},
             13: {'Movie Name': 'Devdas',
              'Rating': '7.6',
              'RunTime': '185',
              'Votes': '25,162',
              'Year': 2002},
             14: {'Movie Name': 'Lagaan: Once Upon a Time in India',
              'Rating': '8.2',
              'RunTime': '224',
              'Votes': '68,811',
              'Year': 2001},
             15: {'Movie Name': 'Kaho Naa... Pyaar Hai',
              'Rating': '6.9',
              'RunTime': '172',
              'Votes': '7,871',
              'Year': 2000},
             16: {'Movie Name': 'Straight from the Heart',
              'Rating': '7.6',
              'RunTime': '188',
              'Votes': '9,793',
              'Year': 1999},
             17: {'Movie Name': 'Kuch Kuch Hota Hai',
              'Rating': '7.8',
              'RunTime': '177',
              'Votes': '31,483',
              'Year': 1998},
             18: {'Movie Name': 'Dil To Pagal Hai',
              'Rating': '7.1',
              'RunTime': '179',
              'Votes': '13,890',
              'Year': 1997},
             19: {'Movie Name': 'Raja Hindustani',
              'Rating': '6.1',
              'RunTime': '165',
              'Votes': '4,802',
              'Year': 1996},
             20: {'Movie Name': 'Dilwale Dulhania Le Jayenge',
              'Rating': '8.3',
              'RunTime': '189',
              'Votes': '42,056',
              'Year': 1995},
             21: {'Movie Name': 'Hum Aapke Hain Koun...!',
              'Rating': '7.7',
              'RunTime': '206',
              'Votes': '10,945',
              'Year': 1994},
             22: {'Movie Name': 'Hum Hain Rahi Pyar Ke',
              'Rating': '7.5',
              'RunTime': '163',
              'Votes': '3,443',
              'Year': 1993},
             23: {'Movie Name': 'Jo Jeeta Wohi Sikandar',
              'Rating': '8.3',
              'RunTime': '174',
              'Votes': '12,286',
              'Year': 1992},
             24: {'Movie Name': 'Lamhe',
              'Rating': '7.4',
              'RunTime': '187',
              'Votes': '1,928',
              'Year': 1991},
             25: {'Movie Name': 'Ghayal',
              'Rating': '7.6',
              'RunTime': '163',
              'Votes': '2,631',
              'Year': 1990},
             26: {'Movie Name': 'Maine Pyar Kiya',
              'Rating': '7.5',
              'RunTime': '192',
              'Votes': '5,799',
              'Year': 1989},
             27: {'Movie Name': 'Qayamat Se Qayamat Tak',
              'Rating': '7.6',
              'RunTime': '162',
              'Votes': '6,161',
              'Year': 1988},
             28: {'Movie Name': 'Ram Teri Ganga Maili',
              'Rating': '6.8',
              'RunTime': '178',
              'Votes': '686',
              'Year': 1987},
             29: {'Movie Name': 'Sparsh',
              'Rating': '8.1',
              'RunTime': '145',
              'Votes': '380',
              'Year': 1986},
             30: {'Movie Name': 'Ardh Satya',
              'Rating': '8.2',
              'RunTime': '130',
              'Votes': '1,080',
              'Year': 1985},
             31: {'Movie Name': 'Shakti',
              'Rating': '7.9',
              'RunTime': '166',
              'Votes': '1,347',
              'Year': 1984},
             32: {'Movie Name': 'Kalyug',
              'Rating': '7.8',
              'RunTime': '152',
              'Votes': '369',
              'Year': 1983},
             33: {'Movie Name': 'Khubsoorat',
              'Rating': '7.8',
              'RunTime': '126',
              'Votes': '938',
              'Year': 1982},
             34: {'Movie Name': 'Junoon',
              'Rating': '7.6',
              'RunTime': '141',
              'Votes': '341',
              'Year': 1981},
             35: {'Movie Name': 'Main Tulsi Tere Aangan Ki',
              'Rating': '7.4',
              'RunTime': '151',
              'Votes': '75',
              'Year': 1980},
             36: {'Movie Name': 'Bhumika',
              'Rating': '7.6',
              'RunTime': '142',
              'Votes': '311',
              'Year': 1979},
             37: {'Movie Name': 'Mausam',
              'Rating': '8.1',
              'RunTime': '156',
              'Votes': '563',
              'Year': 1978},
             38: {'Movie Name': 'Deewaar',
              'Rating': '8.2',
              'RunTime': '174',
              'Votes': '5,567',
              'Year': 1977},
             39: {'Movie Name': 'Rajnigandha',
              'Rating': '7.5',
              'RunTime': '110',
              'Votes': '315',
              'Year': 1976},
             40: {'Movie Name': 'Anuraag',
              'Rating': '7.4',
              'RunTime': 'nes',
              'Votes': '45',
              'Year': 1975},
             41: {'Movie Name': 'Be-Imaan',
              'Rating': '7.4',
              'RunTime': '133',
              'Votes': '57',
              'Year': 1974},
             42: {'Movie Name': 'Anand',
              'Rating': '8.9',
              'RunTime': '122',
              'Votes': '10,718',
              'Year': 1973},
             43: {'Movie Name': 'Toy',
              'Rating': '7.3',
              'RunTime': '160',
              'Votes': '229',
              'Year': 1972},
             44: {'Movie Name': 'Aradhana',
              'Rating': '7.7',
              'RunTime': '169',
              'Votes': '1,091',
              'Year': 1971},
             45: {'Movie Name': 'Brahmachari',
              'Rating': '6.8',
              'RunTime': '157',
              'Votes': '239',
              'Year': 1970},
             46: {'Movie Name': 'Upkar',
              'Rating': '7.7',
              'RunTime': '175',
              'Votes': '364',
              'Year': 1969},
             47: {'Movie Name': 'Guide',
              'Rating': '8.6',
              'RunTime': '183',
              'Votes': '3,950',
              'Year': 1968},
             48: {'Movie Name': 'Himalay Ki Godmein',
              'Rating': '7.2',
              'RunTime': 'stu',
              'Votes': '57',
              'Year': 1967},
             49: {'Movie Name': 'Dosti',
              'Rating': '8.4',
              'RunTime': '163',
              'Votes': '1,010',
              'Year': 1966},
             50: {'Movie Name': 'Bandini',
              'Rating': '7.8',
              'RunTime': '157',
              'Votes': '546',
              'Year': 1965},
             51: {'Movie Name': 'Sahib Bibi Aur Ghulam',
              'Rating': '8.4',
              'RunTime': '152',
              'Votes': '1,076',
              'Year': 1964},
             52: {'Movie Name': 'Jis Desh Men Ganga Behti Hai',
              'Rating': '7.3',
              'RunTime': '167',
              'Votes': '301',
              'Year': 1963},
             53: {'Movie Name': 'Mughal-E-Azam',
              'Rating': '8.4',
              'RunTime': '197',
              'Votes': '3,868',
              'Year': 1962},
             54: {'Movie Name': 'Sujata',
              'Rating': '7.5',
              'RunTime': '161',
              'Votes': '191',
              'Year': 1961},
             55: {'Movie Name': 'Madhumati',
              'Rating': '8.1',
              'RunTime': '110',
              'Votes': '793',
              'Year': 1960},
             56: {'Movie Name': 'Mother India',
              'Rating': '8.1',
              'RunTime': '172',
              'Votes': '4,842',
              'Year': 1959},
             57: {'Movie Name': 'Jhanak Jhanak Payal Baaje',
              'Rating': '7.3',
              'RunTime': '143',
              'Votes': '68',
              'Year': 1958},
             58: {'Movie Name': 'Jagriti',
              'Rating': '7.8',
              'RunTime': 'sti',
              'Votes': '82',
              'Year': 1957},
             59: {'Movie Name': 'Boot Polish',
              'Rating': '8.1',
              'RunTime': '149',
              'Votes': '498',
              'Year': 1956},
             60: {'Movie Name': 'Do Bigha Zamin',
              'Rating': '8.4',
              'RunTime': '131',
              'Votes': '1,104',
              'Year': 1955}})

In [38]:
print(json.dumps(result[58], indent=2))


{
  "Movie Name": "Jagriti",
  "Rating": "7.8",
  "Year": 1957,
  "RunTime": "sti",
  "Votes": "82"
}

Oh! Something is wrong with the data, It's not in right shape to perform analysis on this data set

Let's clean the data

  • Replace the comma(,) in Vote Value and change the data type to int
  • Change the Data type for Rating and RunTime

In [39]:
for key,values in result.items():
    values['Votes'] = int(values['Votes'].replace(",",""))
    values['Rating']= float(values['Rating'])
    try:
        values['RunTime'] = int(values['RunTime'])
    except ValueError:
        values['RunTime'] = 154

Now let's look at the data and see how it looks like


In [40]:
result[58]


Out[40]:
{'Movie Name': 'Jagriti',
 'Rating': 7.8,
 'RunTime': 154,
 'Votes': 82,
 'Year': 1957}

Data in Pandas Dataframe

Data is consumed in a Pandas Dataframe, Which is more convenient way to perform data analysis,manipulation or aggregation


In [55]:
# create dataframe
df = pd.DataFrame.from_dict(result,orient='index')
df = df[['Year', 'Movie Name', 'Rating', 'Votes','RunTime']]
df


Out[55]:
Year Movie Name Rating Votes RunTime
0 2015 Bajirao Mastani 7.2 17283 158
1 2014 Queen 8.4 39478 146
2 2013 Bhaag Milkha Bhaag 8.3 39688 186
3 2012 Barfi! 8.2 52270 151
4 2011 Zindagi Na Milegi Dobara 8.1 41706 155
5 2010 Dabangg 6.3 19768 126
6 2009 3 Idiots 8.4 200956 170
7 2008 Jodhaa Akbar 7.6 17902 213
8 2007 Like Stars on Earth 8.5 82648 165
9 2006 Rang De Basanti 8.4 68550 157
10 2005 Black 8.3 22851 122
11 2004 Veer-Zaara 7.9 33468 192
12 2003 Koi... Mil Gaya 7.1 12003 171
13 2002 Devdas 7.6 25162 185
14 2001 Lagaan: Once Upon a Time in India 8.2 68811 224
15 2000 Kaho Naa... Pyaar Hai 6.9 7871 172
16 1999 Straight from the Heart 7.6 9793 188
17 1998 Kuch Kuch Hota Hai 7.8 31483 177
18 1997 Dil To Pagal Hai 7.1 13890 179
19 1996 Raja Hindustani 6.1 4802 165
20 1995 Dilwale Dulhania Le Jayenge 8.3 42056 189
21 1994 Hum Aapke Hain Koun...! 7.7 10945 206
22 1993 Hum Hain Rahi Pyar Ke 7.5 3443 163
23 1992 Jo Jeeta Wohi Sikandar 8.3 12286 174
24 1991 Lamhe 7.4 1928 187
25 1990 Ghayal 7.6 2631 163
26 1989 Maine Pyar Kiya 7.5 5799 192
27 1988 Qayamat Se Qayamat Tak 7.6 6161 162
28 1987 Ram Teri Ganga Maili 6.8 686 178
29 1986 Sparsh 8.1 380 145
... ... ... ... ... ...
31 1984 Shakti 7.9 1347 166
32 1983 Kalyug 7.8 369 152
33 1982 Khubsoorat 7.8 938 126
34 1981 Junoon 7.6 341 141
35 1980 Main Tulsi Tere Aangan Ki 7.4 75 151
36 1979 Bhumika 7.6 311 142
37 1978 Mausam 8.1 563 156
38 1977 Deewaar 8.2 5567 174
39 1976 Rajnigandha 7.5 315 110
40 1975 Anuraag 7.4 45 154
41 1974 Be-Imaan 7.4 57 133
42 1973 Anand 8.9 10718 122
43 1972 Toy 7.3 229 160
44 1971 Aradhana 7.7 1091 169
45 1970 Brahmachari 6.8 239 157
46 1969 Upkar 7.7 364 175
47 1968 Guide 8.6 3950 183
48 1967 Himalay Ki Godmein 7.2 57 154
49 1966 Dosti 8.4 1010 163
50 1965 Bandini 7.8 546 157
51 1964 Sahib Bibi Aur Ghulam 8.4 1076 152
52 1963 Jis Desh Men Ganga Behti Hai 7.3 301 167
53 1962 Mughal-E-Azam 8.4 3868 197
54 1961 Sujata 7.5 191 161
55 1960 Madhumati 8.1 793 110
56 1959 Mother India 8.1 4842 172
57 1958 Jhanak Jhanak Payal Baaje 7.3 68 143
58 1957 Jagriti 7.8 82 154
59 1956 Boot Polish 8.1 498 149
60 1955 Do Bigha Zamin 8.4 1104 131

61 rows × 5 columns

Let's use some of the Pandas functions now and start the Analysis


In [74]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 61 entries, 0 to 60
Data columns (total 6 columns):
Year           61 non-null int64
Movie Name     61 non-null object
Rating         61 non-null float64
Votes          61 non-null int64
RunTime        61 non-null int64
Ratingbyten    61 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 3.3+ KB

Movies with Highest Ratings

The top five movies with Maximum Rating since 1955


In [48]:
#Highest Rating Movies
df.sort_values('Rating',ascending=[False]).head(5)


Out[48]:
Year Movie Name Rating Votes RunTime
42 1973 Anand 8.9 10718 122
47 1968 Guide 8.6 3950 183
8 2007 Like Stars on Earth 8.5 82648 165
60 1955 Do Bigha Zamin 8.4 1104 131
53 1962 Mughal-E-Azam 8.4 3868 197

Movies with Maximum Run time

Top 10 movies with maximum Run time


In [49]:
#Movies with maximum Run Time
df.sort_values(['RunTime'],ascending=[False]).head(10)


Out[49]:
Year Movie Name Rating Votes RunTime
14 2001 Lagaan: Once Upon a Time in India 8.2 68811 224
7 2008 Jodhaa Akbar 7.6 17902 213
21 1994 Hum Aapke Hain Koun...! 7.7 10945 206
53 1962 Mughal-E-Azam 8.4 3868 197
26 1989 Maine Pyar Kiya 7.5 5799 192
11 2004 Veer-Zaara 7.9 33468 192
20 1995 Dilwale Dulhania Le Jayenge 8.3 42056 189
16 1999 Straight from the Heart 7.6 9793 188
24 1991 Lamhe 7.4 1928 187
2 2013 Bhaag Milkha Bhaag 8.3 39688 186

Best Movie Run time

Let's plot a graph to see the movie run time trend from 1955 thru 2015


In [75]:
df.plot(x=df.Year,y=['RunTime']);


Mean of the Movie Run Time


In [48]:
df['RunTime'].mean()


Out[48]:
154.2622950819672

Best Movie Ratings

Perform some analysis on the ratings of all the Best won movies

  • No. of Movies Greater than IMDB 7 ratings

In [76]:
df[(df['Rating']>=7)]['Rating'].count()


Out[76]:
56

Movie Ratings Visualization using Bar Graph


In [77]:
Rating_Histdic = defaultdict(dict)

Rating_Histdic['Btwn 6&7'] = df[(df['Rating']>=6)&(df['Rating']<7)]['Rating'].count()
Rating_Histdic['GTEQ 8'] = df[(df['Rating']>=8)]['Rating'].count()
Rating_Histdic['Btwn 7 & 8'] = df[(df['Rating']>=7)&(df['Rating']<8)]['Rating'].count()


plt.bar(range(len(Rating_Histdic)), Rating_Histdic.values(), align='center',color='brown',width=0.4)
plt.xticks(range(len(Rating_Histdic)), Rating_Histdic.keys(), rotation=25);


Percentage distribution of the Ratings in a Pie-Chart


In [78]:
Rating_Hist = []
import numpy as np
    
Rating_Hist.append(Rating_Histdic['Btwn 6&7'])
Rating_Hist.append(Rating_Histdic['GTEQ 8'])
Rating_Hist.append(Rating_Histdic['Btwn 7 & 8'])

labels = ['Btwn 6&7', 'GTEQ 8', 'Btwn 7 & 8']
colors = ['red', 'orange', 'green']

plt.pie(Rating_Hist,labels=labels, colors=colors,autopct='%1.1f%%', shadow=True, startangle=90);


Best Picture by Genre

Let's analyze the Genre for the best won movies


In [54]:
Category=Counter(datatbl['data']['Genre'])
df1 = pd.DataFrame.from_dict(Category,orient='index')
df1 = df1.sort_values([0],ascending=[False]).head(5)
df1.plot(kind='barh',color=['g','c','m']);


Conclusion :

  • Movies with Ratings greater than 7
  • Run time more than 2hrs
  • Category Drama & Musical are most likely to be selcted for Best Picture