In [1]:
#!conda install requests
#!conda install beautifulsoup4
#!conda install selenium
In [2]:
import requests
from bs4 import BeautifulSoup
import selenium.webdriver
import pandas as pd
Acknowledgements: The code below is very much inspired by Chris Bail's "Screen-Scraping in R". Thanks Chris!
Web scraping (also sometimes called "screen-scraping") is a method for extracting data from the web. There are many techniques which can be used for web scraping — ranging from requiring human involvement (“human copy-paste”) to fully automated systems. For research questions where you need to visit many webpages, and collect essentially very similar information from each, web scraping can be a great tool.
The typical web scraping program:
When the internet was young, web scraping was a common and legally acceptable practice for collecting data on the web. But with the rise of online platforms, some of which rely heavily on user-created content (e.g. Craigslist), the data made available on these sites has become recognized by their companies as highly valuable. Furthermore, from a website developer's perspective, web crawlers are able request many pages from your site in rapid succession, increasing server loads, and generally being a nuisance.
Thus many websites, especially large sites (e.g. Yelp, AllRecipes, Instagram, The New York Times, etc.), have now forbidden "crawlers" / "robots" / "spiders" from harvesting their data in their "Terms of Service" (TOS). From Yelp's Terms of Service:
Before embarking on a research project that will involve web scraping, it is important to understand the TOS of the site you plan on collecting data from.
If the site does allow web scraping (and you've consulted your legal professional), many websites have a robots.txt
file that tells search engines and web scrapers, written by researchers like you, how to interact with the site "politely" (i.e. the number of requests that can be made, pages to avoid, etc.).
When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. For example, to us, the Boulder Humane Society's listing of dogs available for adoption looks something like what's displayed at the top of the browser below:
But to your web browser, the page actually looks like the HTML source code (basically instructions for what text and images to show and how to do so) shown at the bottom of the page. To see the source code of a webpage, in Safari, go to Develop > Show Page Source
or in Chrome, go to Developer > View Source
.
To request the HTML for a page in Python, you can use the Python package requests
, as such:
In [3]:
pet_pages = ["https://www.boulderhumane.org/animals/adoption/dogs",
"https://www.boulderhumane.org/animals/adoption/cats",
"https://www.boulderhumane.org/animals/adoption/adopt_other"]
r = requests.get(pet_pages[0])
html = r.text
print(html[:500]) # Print the first 500 characters of the HTML. Notice how it's the same as the screenshot above.
Now that we've downloaded the HTML of the page, we next need to parse it. Let's say we want to extract all of the names, ages, and breeds of the dogs, cats, and small animals currently up for adoption at the Boulder Humane Society.
Actually, navigating to the location of those attributes in the page can be somewhat tricky. Luckily HTML has a tree-structure, as shown below, where tags fit inside other tags. For example, the title
of the document is located within its head
, and within the larger html
document (<html> <head> <title> </title> ... </head>... </html>
).
To find the first pet on the page, we'll find that HTML element's "CSS selector". Within Safari, hover your mouse over the image of the first pet and then control+click on the image. This should highlight the section of HTML where the object you are trying to parse is found. Sometimes you may need to move your mouse through the HTML to find the exact location of the object (see GIF).
(You can also go to 'Develop > Show Page Source' and then click 'Elements'. Hover your mouse over sections of the HTML until the object you are trying to find is highlighted within your browser.)
BeautifulSoup is a Python library for parsing HTML. We'll pass the CSS selector that we just copied to BeautifulSoup to grab that object. Notice below how select
-ing on that pet, shows us all of its attributes.
In [5]:
soup = BeautifulSoup(html, 'html.parser')
pet = soup.select("#block-system-main > div > div > div.view-content > div.views-row.views-row-1.views-row-odd.views-row-first.On.Hold")
print(pet)
Furthermore, we can select the name, breeds, age, and gender of the pet by find
-ing the div
tags which contain this information. Notice how the div
tag has the attribute (attrs
) class
with value "views-field views-field-field-pp-animalname".
In [6]:
name = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-animalname'})
primary_breed = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-primarybreed'})
secondary_breed = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-secondarybreed'})
age = pet[0].find('div', attrs = {'class': 'views-field views-field-field-pp-age'})
In [7]:
# We can call `get_text()` on those objects to print them nicely.
print({
"name": name.get_text(strip = True),
"primary_breed": primary_breed.get_text(strip = True),
"secondary_breed": secondary_breed.get_text(strip = True),
"age": age.get_text(strip=True)
})
Now to get at the HTML object for each pet, we could find the CSS selector for each. Or, we can exploit the fact that every pet lives in a similar HTML structure for each pet. That is, we can find all of the div tags with the class attribute which contain the string views-row
. We'll print out their attributes like we just did.
In [8]:
all_pets = soup.find_all('div', {'class': 'views-row'})
In [9]:
for pet in all_pets:
name = pet.find('div', {'class': 'views-field views-field-field-pp-animalname'}).get_text(strip=True)
primary_breed = pet.find('div', {'class': 'views-field views-field-field-pp-primarybreed'}).get_text(strip=True)
secondary_breed = pet.find('div', {'class': 'views-field views-field-field-pp-secondarybreed'}).get_text(strip=True)
age = pet.find('div', {'class': 'views-field views-field-field-pp-age'}).get_text(strip=True)
print([name, primary_breed, secondary_breed, age])
This may seem like a fairly silly example of webscraping, but one could imagine several research questions using this data. For example, if we collected this data over time (e.g. using Wayback Machine), could we identify what features of pets -- names, breeds, ages -- make them more likely to be adopted? Are there certain names that are more common for certain breeds? Or maybe your research question is something even wackier.
Pandas has really neat functionality in read_html
where you can download an HTML table directly from a webpage, and load it into a dataframe.
In [10]:
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches", header=0)[0]
#table.to_csv("filenamehere.csv") # Write table to CSV
In [11]:
table.head(20)
Out[11]:
Sometimes our interactions with webpages involve rendering Javascript. For example, think of visiting a webpage with a search box, typing in a query, pressing search, and viewing the result. Or visiting a webpage that requires a login, or clicking through pages in a list. To handle pages like these we'll use a package in Python called Selenium.
Installing Selenium can be a little tricky. You'll want to follow the directions as best you can here. Requirements (one of the below):
First a fairly simple example: let's visit xkcd and click through the comics.
In [12]:
driver = selenium.webdriver.Safari() # This command opens a window in Safari
# driver = selenium.webdriver.Chrome(executable_path = "<path to chromedriver>") # This command opens a window in Chrome
# driver = selenium.webdriver.Firefox(executable_path = "<path to geckodriver>") # This command opens a window in Firefox
# Get the xkcd website
driver.get("https://xkcd.com/")
In [13]:
# Let's find the 'random' buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a')
element.click()
In [14]:
# Find an attribute of this page - the title of the comic.
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")
Out[14]:
In [15]:
# Continue clicking throught the comics
driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a').click()
In [16]:
driver.quit() # Always remember to close your browser!
We'll now walk through how we can use Selenium to navigate the website to navigate a open source site called "Box Office Mojo".
In [17]:
driver = selenium.webdriver.Safari() # This command opens a window in Safari
# driver = selenium.webdriver.Chrome(executable_path = "<path to chromedriver>") # This command opens a window in Chrome
# driver = selenium.webdriver.Firefox(executable_path = "<path to geckodriver>") # This command opens a window in Firefox
driver.get('https://www.boxofficemojo.com')
Let's say I wanted to know which movie was has been more lucrative 'Wonder Woman', 'Blank Panther', or 'Avengers: Infinity War'. I could type into the search bar on the upper left: 'Avengers: Infinity War'.
In [18]:
# Type in the search bar, and click 'Search'
driver.find_element_by_xpath('//*[@id="leftnav"]/li[2]/form/input[1]').send_keys('Avengers: Infinity War')
driver.find_element_by_xpath('//*[@id="leftnav"]/li[2]/form/input[2]').click()
Now, I can parse the table returned using BeautifulSoup
.
In [19]:
# This is what the table looks like
table = driver.find_element_by_xpath('//*[@id="body"]/table[2]')
# table.get_attribute('innerHTML').strip()
In [20]:
pd.read_html(table.get_attribute('innerHTML').strip(), header=0)[2]
Out[20]:
In [21]:
# Find the link to more details about the Avengers movie and click it
driver.find_element_by_xpath('//*[@id="body"]/table[2]/tbody/tr/td/table[2]/tbody/tr[2]/td[1]/b/font/a').click()
Now, we can do the same for the remaining movies: 'Wonder Woman', and 'Black Panther' ...
In [22]:
driver.quit() # Always remember to close your browser!
Again, this might seem like a fairly simple example of web scraping, but there are some fun data science questions you could answer with this new technique. For example, for how many movie franchises, were their sequels more successful than their originals. For example, Furious 7 (see 'Adjusted for Ticket Price Inflation') was the most lucrative Fast and the Furious movie.
These two examples hopefully show you how fun web scraping can be, but as I hinted at earlier in some cases web scraping is illegal, and in some cases tedious. So when should you use this new tool? Here are some pointers:
In this notebook, we learned about how to scrape data from the web using Python's packages requests
, BeautifulSoup
, and Selenium
. Some of you have already had experience with web scraping, but for others, this may have been your first time collecting digital trace data.
This group exercise is designed to find a balance between practicing rudimentary skills (for those of you with little or no experience in this area) to cutting edge techniques (for those of you with extensive expertise in this area). As an added bonus, this exercise not only challenges you to practice your coding skills, but to think about how to ask questions that contribute new knowledge to sociological theory as well.
There is only one requirement: the group member with the least amount of experience coding should be responsible for typing the code into a computer. After 45 minutes, we will share our work with the group. Let us know if you'd like to present your group's potential project. Remember that these daily exercises are a way for you to explore new possible topics and to get to know each other better.
Here, is a short list of open source data and websites which may be useful for the group activity above. If you have any to add please tell Allie, or better yet, start an issue or submit a pull request.
In [ ]: