This activity allows you to practice using the Beautiful Soup library to scrape some data from the web. It also allows you to practice using a Jupyter Notebook to both document and perform your work. As you can see, you can write Markdown, as well as Python
quick tips
esc, m, enter to start writing Markdown rather than code shift and enter to run a code section).In order to use the python libraries, we'll need to ensure they're installed on your machine. You can do this easily by running the following command(s) on your terminal
# Install beautifulsoup using pip on the terminal
pip install beautifulsoup4
pip install pygeocoder
pip install plotly
You should now be able to import the library inside of this notebook by running the following line of Python code
In [6]:
from bs4 import BeautifulSoup as bs, SoupStrainer as ss
We'll also need to import a few other libraries, such as pandas to manage our data, and requests to make URL requests
In [7]:
import requests as r
import pandas as p
import re
from pygeocoder import Geocoder
import plotly.offline as py
py.init_notebook_mode() # For offline plotting
Our first task is to use python to identify the links to institution pages on their website. We'll begin by making a request of the page content. Due to peculiarities of how the page is built on the client side, we'll read a local version of the page using the codecs package.
In [8]:
import codecs
file = codecs.open("college-site.html", 'r', encoding="utf8")
page_content = file.read()
soup = bs(page_content, 'html.parser')
In [75]:
# Find the TuitionGrid table (or extract information as you see fit)
In [76]:
# Extract each row from the table
In [1]:
# Look at a single row of your table, and figure out how to extract the address from it
In [9]:
# Write a simple function to extract the link from each row
def extract_url(row):
# Write code here
In [71]:
# List to store links
links = []
# Iterate through table rows and use the `extract_url` function to get the URL and store it in `links`
In [26]:
# Write a function to retrieve the address of an institution given it's URL (go to the URL, extract address)
def get_address(url):
# Write code here
In [29]:
# List to store addresses
addresses = []
# Iterate through links and use your `get_address` function to get the address and store it in `addresses`
In [50]:
# List to store coordinates
coordinates = []
# Iterate through the addresses and use the `Geocoder.geocode` function to get the lat/long
In [ ]:
# Define coordinates as a dataframe
# Plot with plotly, from example: https://plot.ly/python/scatter-plots-on-maps/