Introduction to Web Scraping

Often we are interested in getting data from a website. Modern websites are often built using a REST framework that has an Application Programming Interface (API) to make HTTP requests to retrieve structured data in the form of JSON or XML.

However, when there is not a clear API, we might need to perform web scraping by directly grabbing the data ourselves



In [ ]:

    
try:
    import requests
except:
    !pip install requests
try:
    from bs4 import BeautifulSoup
except:
    !pip install bs4

Getting data using requests

Requests is an excellent library for performing HTTP requests.

In this simple example, we will scrape data from the PBS faculty webpage.



In [ ]:

    
page = requests.get("http://pbs.dartmouth.edu/people")
print(page)

Here the response '200' indicates that the get request was successful. Now let's look at the actual text that was downloaded from the webpage.



In [ ]:

    
print(page.content)

Here you can see that we have downloaded all of the data from the PBS faculty page and that it is in the form of HTML. HTML is a markup language that tells a browser how to layout content. HTML consists of elements called tags. Each tag indicates a beginning and end. Here are a few examples:

<a></a> - indicates hyperlink
<p></p> - indicates paragraph
<div></div> - indicates a division, or area, of the page.
<b></b> - bolds any text inside.
<i></i> - italicizes any text inside.
<h1></h1> - indicates a header
<table></table> - creates a table.
<ol></ol> - ordered list
<ul></ul> - unordered list
<li></li> - list item

Parsing HTML using Beautiful Soup

There are many libraries that can be helpful for quickly parsing structured text such as HTML. We will be using Beautiful Soup as an example.



In [ ]:

    
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Here we are going to find the unordered list tagged with the id 'faculty-container'. We are then going to look for any nested tag that use the 'h4' header tag. This should give us all of the lines with the faculty names as a list.



In [ ]:

    
names_html = soup.find_all('ul',id='faculty-container')[0].find_all('h4')
names = [x.text for x in names_html]
print(names)

What if we wanted to get all of the faculty email addresses?



In [ ]:

    
email_html = soup.find_all('ul',id='faculty-container')[0].find_all('span',{'class' : 'contact'})
email = [x.text for x in email_html]
print(email)

Parsing string data

What if we wanted to grab the name from the list of email addresses?



In [ ]:

    
print([x.split('@')[0] for x in email])

One thing we might do with this data is create a dictionary with names and emails of all of the professors in the department. This could be useful if we wanted to send a bulk email to them.



In [ ]:

    
email_dict = dict([(x.split('@')[0],x) for x in email])
print(email_dict)

You can see that every name also includes an initial. Let's try to just pull out the first and last name.



In [ ]:

    
for x in email_dict.keys():
    old = x.split('.')
    email_dict[" ".join([i for i in old if len(i) > 2])] = email_dict[x]
    del email_dict[x]

print(email_dict)

Interacting with web page using Selenium

Sometimes we need to directly interact with aspects of the webpage. Maybe there is a form that needs to be submitted or a javascript button that needs to be pressed. Selenium is a useful tool for these types of tasks.



In [ ]: