Often we are interested in getting data from a website. Modern websites are often built using a REST framework that has an Application Programming Interface (API) to make HTTP requests to retrieve structured data in the form of JSON or XML.
However, when there is not a clear API, we might need to perform web scraping by directly grabbing the data ourselves
In [ ]:
try:
import requests
except:
!pip install requests
try:
from bs4 import BeautifulSoup
except:
!pip install bs4
Requests is an excellent library for performing HTTP requests.
In this simple example, we will scrape data from the PBS faculty webpage.
In [ ]:
page = requests.get("http://pbs.dartmouth.edu/people")
print(page)
Here the response '200' indicates that the get request was successful. Now let's look at the actual text that was downloaded from the webpage.
In [ ]:
print(page.content)
Here you can see that we have downloaded all of the data from the PBS faculty page and that it is in the form of HTML. HTML is a markup language that tells a browser how to layout content. HTML consists of elements called tags. Each tag indicates a beginning and end. Here are a few examples:
<a></a> - indicates hyperlink<p></p> - indicates paragraph<div></div> - indicates a division, or area, of the page.<b></b> - bolds any text inside.<i></i> - italicizes any text inside.<h1></h1> - indicates a header<table></table> - creates a table.<ol></ol> - ordered list<ul></ul> - unordered list<li></li> - list item
In [ ]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Here we are going to find the unordered list tagged with the id 'faculty-container'. We are then going to look for any nested tag that use the 'h4' header tag. This should give us all of the lines with the faculty names as a list.
In [ ]:
names_html = soup.find_all('ul',id='faculty-container')[0].find_all('h4')
names = [x.text for x in names_html]
print(names)
What if we wanted to get all of the faculty email addresses?
In [ ]:
email_html = soup.find_all('ul',id='faculty-container')[0].find_all('span',{'class' : 'contact'})
email = [x.text for x in email_html]
print(email)
In [ ]:
print([x.split('@')[0] for x in email])
One thing we might do with this data is create a dictionary with names and emails of all of the professors in the department. This could be useful if we wanted to send a bulk email to them.
In [ ]:
email_dict = dict([(x.split('@')[0],x) for x in email])
print(email_dict)
You can see that every name also includes an initial. Let's try to just pull out the first and last name.
In [ ]:
for x in email_dict.keys():
old = x.split('.')
email_dict[" ".join([i for i in old if len(i) > 2])] = email_dict[x]
del email_dict[x]
print(email_dict)
Sometimes we need to directly interact with aspects of the webpage. Maybe there is a form that needs to be submitted or a javascript button that needs to be pressed. Selenium is a useful tool for these types of tasks.
In [ ]: