Webscraping with Selenium


When the data that you want exists on a website with heavy JavaScript and requires interaction from the user, BeautifulSoup will not be enough. This is when you need a webdriver. One of the most popular webdrivers is Selenium. Selenium is commonly used in industry to automate testing of the user experience, but it can also interact with content to collect data that are difficult to get otherwise.

This lesson is a short introduction to the Selenium webdriver. It includes:

  1. Launching the webdriver
  2. Navigating the browser
  3. Collecting generated data
  4. Exporting data to CSV

Let's first import the necessary Python libraries:


In [ ]:
from selenium import webdriver  # powers the browser interaction
from selenium.webdriver.support.ui import Select  # selects menu options
from pyvirtualdisplay import Display  # for JHub environment
from bs4 import BeautifulSoup  # to parse HTML
import csv  # to write CSV
import pandas  # to see CSV

Selenium actually uses our web browser, and since the JupyterHub doesn't come with Firefox, we'll download the binaries:


In [ ]:
# download firefox binaries
!wget http://ftp.mozilla.org/pub/firefox/releases/54.0/linux-x86_64/en-US/firefox-54.0.tar.bz2
    
# untar binaries
!tar xvjf firefox-54.0.tar.bz2

We also need the webdriver for Firefox that allows Selenium to interact directly with the browser through the code we write. We can download the geckodriver for Firefox from the github page:


In [ ]:
# download geckodriver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.17.0/geckodriver-v0.17.0-linux64.tar.gz
    
# untar geckdriver
!tar xzvf geckodriver-v0.17.0-linux64.tar.gz

1. Launching the webdriver

Since we are in different environment and we can't use our regular graphical desktop, we need to tell Python to start a virutal display, onto which Selenium can project the Firefox web browser (though we won't actually see it).


In [ ]:
display = Display(visible=0, size=(1024, 768))
display.start()

Now we can initialize the Selenium web driver, giving it the path to the Firefox binary code and the driver:


In [ ]:
# setup driver
driver = webdriver.Firefox(firefox_binary='./firefox/firefox', executable_path="./geckodriver")

You can navigate Selenium to a URL by using the get method, exactly the same way we used the requests.get before:


In [ ]:
driver.get("http://www.google.com")
print(driver.page_source)

Cool, right? You can see Google in your browser now. Let's go look at some West Bengal State election results:

2. Navigating the browser

To follow along as Selenium navigates the website, try opening the site in another tab. You'll notice if you select options from the menu, it calls a script to generate a custom table. The URL doesn't change, and so we can't just call for the HTML of the page, it needs to be generated. That's where Selenium shines. It can choose these menu options and wait for the generated table before grabbing the new HTML for the data.


In [ ]:
# go results page
driver.get("http://wbsec.gov.in/(S(eoxjutirydhdvx550untivvu))/DetailedResult/Detailed_gp_2013.aspx")

Zilla Parishad

Similar to BeautifulSoup, Selenium has methods to find elements on a webpage. We can use the method find_element_by_name to find an element on the page by its name.


In [ ]:
# find "district" drop down menu
district = driver.find_element_by_name("ddldistrict")

In [ ]:
district

Now if we want to get the different options in this drop down, we can do the same. You'll notice that each name is associated with a unique value. Since we're getting multiple elements here, we'll use find_elements_by_tag_name


In [ ]:
# find options in "disrict" drop down
district_options = district.find_elements_by_tag_name("option")

print(district_options[1].get_attribute("value"))
print(district_options[1].text)

Now we'll make a dictionary associating each name with its value.


In [ ]:
d_options = {option.text.strip(): option.get_attribute("value") for option in district_options if option.get_attribute("value").isdigit()}
print(d_options)

We can then select a district by using its name and our dictionary. First we'll make our own function using Selenium's Select, and then we'll call it on "Bankura".


In [ ]:
district_select = Select(district)
district_select.select_by_value(d_options["Bankura"])

You should have seen the dropdown menu select 'Bankura' by running the previous cell.

Panchayat Samity

We can do the same as we did above to find the different blocks.


In [ ]:
# find the "block" drop down
block = driver.find_element_by_name("ddlblock")

In [ ]:
# get options
block_options = block.find_elements_by_tag_name("option")

print(block_options[1].get_attribute("value"))
print(block_options[1].text)

In [ ]:
b_options = {option.text.strip(): option.get_attribute("value") for option in block_options if option.get_attribute("value").isdigit()}
print(b_options)

In [ ]:
panchayat_select = Select(block)
panchayat_select.select_by_value(b_options["BANKURA-I"])

Great! One dropdown menu to go.

Gram Panchayat


In [ ]:
# get options
gp = driver.find_element_by_name("ddlgp")
gp_options = gp.find_elements_by_tag_name("option")

print(gp_options[1].get_attribute("value"))
print(gp_options[1].text)

In [ ]:
gp_options = {option.text.strip(): option.get_attribute("value") for option in gp_options if option.get_attribute("value").isdigit()}
print(gp_options)

In [ ]:
gram_select = Select(gp)
gram_select.select_by_value(gp_options["ANCHURI"])

Once we selected the last dropdown menu parameter, the website automatically generate a table below. This table could not have been called up by a URL, as you can see that the URL in the browser did not change. This is why Selenium is so helpful.

3. Collecting generated data

Now that the table has been rendered, it exists as HTML in our page source. If we wanted to, we could send this to BeautifulSoup using the driver.page_source method to get the text. But we can also use Selenium's parsing methods.

First we'll identify it by its CSS selector, and then use the get_attribute method.


In [ ]:
soup = BeautifulSoup(driver.page_source, 'html5lib')

In [ ]:
# get the html for the table
table = soup.select('#DataGrid1')[0]

First we'll get all the rows of the table using the tr selector.


In [ ]:
# get list of rows
rows = [row for row in table.select("tr")]

But the first row is the header so we don't want that.


In [ ]:
rows = rows[1:]

Each cell in the row corresponds to the data we want.


In [ ]:
rows[0].select('td')

Now it's just a matter of looping through the rows and getting the information we want from each one.


In [ ]:
data = []
for row in rows:
    d = {}
    seat_names = row.select('td')[0].find_all("span")
    d['seat'] = ' '.join([x.text for x in seat_names])
    d['electors'] = row.select('td')[1].text.strip()
    d['polled'] = row.select('td')[2].text.strip()
    d['rejected'] = row.select('td')[3].text.strip()
    d['osn'] = row.select('td')[4].text.strip()
    d['candidate'] = row.select('td')[5].text.strip()
    d['party'] = row.select('td')[6].text.strip()
    d['secured'] = row.select('td')[7].text.strip()
    data.append(d)

In [ ]:
print(data[1])

You'll notice that some of the information, such as total electors, is not supplied for each canddiate. This code will add that information for the candidates who don't have it.


In [ ]:
i = 0
while i < len(data):
    if data[i]['seat']:
        seat = data[i]['seat']
        electors = data[i]['electors']
        polled = data[i]['polled']
        rejected = data[i]['rejected']
        i = i+1
    else:
        data[i]['seat'] = seat
        data[i]['electors'] = electors
        data[i]['polled'] = polled
        data[i]['rejected'] = rejected
        i = i+1

In [ ]:
data

4. Exporting data to CSV

We can then loop through all the combinations of the dropdown menu we want, collect the information from the generated table, and append it to the data list. Once we're done, we can write it to a CSV.


In [ ]:
header = data[0].keys()

with open('WBS-table.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, header)
    dict_writer.writeheader()
    dict_writer.writerows(data)
    
pandas.read_csv('WBS-table.csv')