When the data that you want exists on a website with heavy JavaScript and requires interaction from the user, BeautifulSoup
will not be enough. This is when you need a webdriver. One of the most popular webdrivers is Selenium. Selenium is commonly used in industry to automate testing of the user experience, but it can also interact with content to collect data that are difficult to get otherwise.
This lesson is a short introduction to the Selenium webdriver. It includes:
Let's first import the necessary Python libraries:
In [ ]:
from selenium import webdriver # powers the browser interaction
from selenium.webdriver.support.ui import Select # selects menu options
from pyvirtualdisplay import Display # for JHub environment
from bs4 import BeautifulSoup # to parse HTML
import csv # to write CSV
import pandas # to see CSV
Selenium actually uses our web browser, and since the JupyterHub doesn't come with Firefox, we'll download the binaries:
In [ ]:
# download firefox binaries
!wget http://ftp.mozilla.org/pub/firefox/releases/54.0/linux-x86_64/en-US/firefox-54.0.tar.bz2
# untar binaries
!tar xvjf firefox-54.0.tar.bz2
We also need the webdriver for Firefox that allows Selenium to interact directly with the browser through the code we write. We can download the geckodriver
for Firefox from the github page:
In [ ]:
# download geckodriver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.17.0/geckodriver-v0.17.0-linux64.tar.gz
# untar geckdriver
!tar xzvf geckodriver-v0.17.0-linux64.tar.gz
Since we are in different environment and we can't use our regular graphical desktop, we need to tell Python to start a virutal display, onto which Selenium
can project the Firefox web browser (though we won't actually see it).
In [ ]:
display = Display(visible=0, size=(1024, 768))
display.start()
Now we can initialize the Selenium
web driver, giving it the path to the Firefox binary code and the driver:
In [ ]:
# setup driver
driver = webdriver.Firefox(firefox_binary='./firefox/firefox', executable_path="./geckodriver")
You can navigate Selenium
to a URL by using the get
method, exactly the same way we used the requests.get
before:
In [ ]:
driver.get("http://www.google.com")
print(driver.page_source)
Cool, right? You can see Google in your browser now. Let's go look at some West Bengal State election results:
To follow along as Selenium
navigates the website, try opening the site in another tab. You'll notice if you select options from the menu, it calls a script to generate a custom table. The URL doesn't change, and so we can't just call for the HTML of the page, it needs to be generated. That's where Selenium
shines. It can choose these menu options and wait for the generated table before grabbing the new HTML for the data.
In [ ]:
# go results page
driver.get("http://wbsec.gov.in/(S(eoxjutirydhdvx550untivvu))/DetailedResult/Detailed_gp_2013.aspx")
Similar to BeautifulSoup
, Selenium
has methods to find elements on a webpage. We can use the method find_element_by_name
to find an element on the page by its name.
In [ ]:
# find "district" drop down menu
district = driver.find_element_by_name("ddldistrict")
In [ ]:
district
Now if we want to get the different options in this drop down, we can do the same. You'll notice that each name is associated with a unique value. Since we're getting multiple elements here, we'll use find_elements_by_tag_name
In [ ]:
# find options in "disrict" drop down
district_options = district.find_elements_by_tag_name("option")
print(district_options[1].get_attribute("value"))
print(district_options[1].text)
Now we'll make a dictionary associating each name with its value.
In [ ]:
d_options = {option.text.strip(): option.get_attribute("value") for option in district_options if option.get_attribute("value").isdigit()}
print(d_options)
We can then select a district by using its name and our dictionary. First we'll make our own function using Selenium
's Select
, and then we'll call it on "Bankura".
In [ ]:
district_select = Select(district)
district_select.select_by_value(d_options["Bankura"])
You should have seen the dropdown menu select 'Bankura' by running the previous cell.
We can do the same as we did above to find the different blocks.
In [ ]:
# find the "block" drop down
block = driver.find_element_by_name("ddlblock")
In [ ]:
# get options
block_options = block.find_elements_by_tag_name("option")
print(block_options[1].get_attribute("value"))
print(block_options[1].text)
In [ ]:
b_options = {option.text.strip(): option.get_attribute("value") for option in block_options if option.get_attribute("value").isdigit()}
print(b_options)
In [ ]:
panchayat_select = Select(block)
panchayat_select.select_by_value(b_options["BANKURA-I"])
Great! One dropdown menu to go.
In [ ]:
# get options
gp = driver.find_element_by_name("ddlgp")
gp_options = gp.find_elements_by_tag_name("option")
print(gp_options[1].get_attribute("value"))
print(gp_options[1].text)
In [ ]:
gp_options = {option.text.strip(): option.get_attribute("value") for option in gp_options if option.get_attribute("value").isdigit()}
print(gp_options)
In [ ]:
gram_select = Select(gp)
gram_select.select_by_value(gp_options["ANCHURI"])
Once we selected the last dropdown menu parameter, the website automatically generate a table below. This table could not have been called up by a URL, as you can see that the URL in the browser did not change. This is why Selenium
is so helpful.
Now that the table has been rendered, it exists as HTML in our page source. If we wanted to, we could send this to BeautifulSoup
using the driver.page_source
method to get the text. But we can also use Selenium
's parsing methods.
First we'll identify it by its CSS selector, and then use the get_attribute
method.
In [ ]:
soup = BeautifulSoup(driver.page_source, 'html5lib')
In [ ]:
# get the html for the table
table = soup.select('#DataGrid1')[0]
First we'll get all the rows of the table using the tr
selector.
In [ ]:
# get list of rows
rows = [row for row in table.select("tr")]
But the first row is the header so we don't want that.
In [ ]:
rows = rows[1:]
Each cell in the row corresponds to the data we want.
In [ ]:
rows[0].select('td')
Now it's just a matter of looping through the rows and getting the information we want from each one.
In [ ]:
data = []
for row in rows:
d = {}
seat_names = row.select('td')[0].find_all("span")
d['seat'] = ' '.join([x.text for x in seat_names])
d['electors'] = row.select('td')[1].text.strip()
d['polled'] = row.select('td')[2].text.strip()
d['rejected'] = row.select('td')[3].text.strip()
d['osn'] = row.select('td')[4].text.strip()
d['candidate'] = row.select('td')[5].text.strip()
d['party'] = row.select('td')[6].text.strip()
d['secured'] = row.select('td')[7].text.strip()
data.append(d)
In [ ]:
print(data[1])
You'll notice that some of the information, such as total electors, is not supplied for each canddiate. This code will add that information for the candidates who don't have it.
In [ ]:
i = 0
while i < len(data):
if data[i]['seat']:
seat = data[i]['seat']
electors = data[i]['electors']
polled = data[i]['polled']
rejected = data[i]['rejected']
i = i+1
else:
data[i]['seat'] = seat
data[i]['electors'] = electors
data[i]['polled'] = polled
data[i]['rejected'] = rejected
i = i+1
In [ ]:
data
In [ ]:
header = data[0].keys()
with open('WBS-table.csv', 'w') as output_file:
dict_writer = csv.DictWriter(output_file, header)
dict_writer.writeheader()
dict_writer.writerows(data)
pandas.read_csv('WBS-table.csv')