Education Locations

Overview

This activity allows you to practice using the Beautiful Soup library to scrape some data from the web. It also allows you to practice using a Jupyter Notebook to both document and perform your work. As you can see, you can write Markdown, as well as Python

quick tips

  • Type esc, m, enter to start writing Markdown rather than code
  • Type shift and enter to run a code section).

Set up

In order to use the python libraries, we'll need to ensure they're installed on your machine. You can do this easily by running the following command(s) on your terminal

# Install beautifulsoup using pip on the terminal
pip install beautifulsoup4
pip install pygeocoder
pip install plotly

You should now be able to import the library inside of this notebook by running the following line of Python code


In [6]:
from bs4 import BeautifulSoup as bs, SoupStrainer as ss


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-a4d99ae1974e> in <module>()
----> 1 from bs4 import BeautifulSoup as bs, SoupStrainer as ss

ModuleNotFoundError: No module named 'bs4'

We'll also need to import a few other libraries, such as pandas to manage our data, and requests to make URL requests


In [7]:
import requests as r
import pandas as p
import re
from pygeocoder import Geocoder
import plotly.offline as py
py.init_notebook_mode() # For offline plotting


Our first task is to use python to identify the links to institution pages on their website. We'll begin by making a request of the page content. Due to peculiarities of how the page is built on the client side, we'll read a local version of the page using the codecs package.


In [8]:
import codecs
file = codecs.open("college-site.html", 'r', encoding="utf8")
page_content = file.read()
soup = bs(page_content, 'html.parser')

Now that we have all the page content, you should open up the website in your browser to identify the part of the DOM where the relevant information is.


In [75]:
# Find the TuitionGrid table (or extract information as you see fit)

In [76]:
# Extract each row from the table

In [1]:
# Look at a single row of your table, and figure out how to extract the address from it

In this section, we'll iterate through the table rows and extract the links from each one.


In [9]:
# Write a simple function to extract the link from each row
def extract_url(row):
    # Write code here

In [71]:
# List to store links
links = []

# Iterate through table rows and use the `extract_url` function to get the URL and store it in `links`

In [26]:
# Write a function to retrieve the address of an institution given it's URL (go to the URL, extract address)
def get_address(url):
    # Write code here

In [29]:
# List to store addresses
addresses = []

# Iterate through links and use your `get_address` function to get the address and store it in `addresses`

In [50]:
# List to store coordinates
coordinates = []

# Iterate through the addresses and use the `Geocoder.geocode` function to get the lat/long

Mapping with Plotly


In [ ]:
# Define coordinates as a dataframe
# Plot with plotly, from example: https://plot.ly/python/scatter-plots-on-maps/