This is a personal project I started to help me tie together general python usage, web scraping with BeautifulSoup, data extraction with regular expressions, data manipulation with pandas, writing functions to perform hypothesis testing, building linear regression models from scratch, basic machine learning, and more. I greatly appreciate any feedback!
In my previous work running the education startup Newton China, I frequently ended up using the website CollegeData.com as a reference - it lists a ton of data for a lot of schools.
As an example, searching my alma mater brings up Penn State's page, which is actually one six seperate CollegeData pages of information: Overview, Admission, Money Matters, Academics, Campus Life, and Students.
Scrolling down, you'll find a bunch of numbers organized in somewhat regularly into rows. Some table rows, like Average GPA
, are easy enough, containing only one value. But others, like the row labeled Students Enrolled
contain three pieces of data.
The first number in 'Students Enrolled'
is the straightforward number of students enrolled, presumably 9,800 students attending school in the fall as incoming freshmen. The third number - 29,878 - is the number of students who were offered admission. The second number - 33% - is called the yield rate, and it is the number that raised questions that inspired me to collect and analyze this data.
The yield rate is the percentage of offered students who actually enrolled and it is of considerable importance to many college admissions departments. If the admissions department enrolls too few students, the school will not receive enough tuition to cover its costs. If it enrolls too many students, the extra students will bring dorms and classrooms beyond capacity.
Not every student who is given an offer letter - a letter of acceptance - will accept it. In fact it seems only 33% of the Penn State's offer letters were accepted. Presumably, the other 67% of students who were offered elected to attend another school. If this 33% yield rate is relatively consistent year to year, which it normally is, the school will have a good idea how many offer letters should be given out so they may end up with the right yield.
Curiously, the yield rate at Penn State for men (37%) seems a good amount higher than that for women (29%). Men seem more likely to accept Penn State's offers than women are. After looking at a few other schools' yield rates, I became curious if there actually is a significant disparity in yield rate among schools - and if so, how is this disparity distributed and what factors may predict it?
The collegedata.com url for Penn State's Admission information page looks like this:
https://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId=59
The 59
at the end of the URL is the schoolId
. The 02
part of the URL indicates you are on the 'Admission'
information page. Not every schoolId
corresponds to a school, but most do, especially in the lower range of numbers. Attempting to access a page for a number with no school will load a default page with the title "Retrieve a Saved Search"
. After poking around a bit, I found at larger schoolId
, especially over 1000, real school pages became more sparse, and I'm fairly confident there are no schools listed with a schoolId
over 2500.
In [29]:
BASE_URL = "http://www.collegedata.com/cs/data/college/college_pg02_tmpl.jhtml?schoolId="
START_ID = 0
END_ID = 2500
In order to determine if the yield disparity is real, I only need to scrape the yield rate from each page. But grabbing as much other data as possible from the six collegedata.com pages of information for each school at the same time could prove useful later when trying to find predictors for yield rate.
I decided a good balance would be to only get the data in the <td>
tags under the heading 'Profile of Fall Admission'
, identified as <div id='section8'>
in the HTML, in addition to the school's name and its location. I can always come back to scrape more if I need it in the future.
I used the wonderful BeautifulSoup library to help me get what I need from CollegeData's pages. It maps the HTML into a tree that can be descended down (and up) and side to side to access the elements and strings you want.
The string 'schoolname'
is conveniently located in the first <h1>
tag. The 'location'
is a bit more difficult to grab, as it isn't nested inside a unique tag, but can be found easily by using the search feature to find the text 'City, State, Zip'
and going from there.
Finding 'section8'
is no problem, from there a relatively simple for loop through all of its <tr>
descendants, each of which may have a <th>
child to be used as a key and a <td>
child to be used as a value. Though most of the <th>
values are unique, some are not, including 'Women'
,'Men'
, and the SAT Score information further down the section.
In [30]:
from bs4 import BeautifulSoup
def scrape(response, REJECT_PAGE_TITLE = "Retrieve a Saved Search"):
# response is from requests.get()
page = BeautifulSoup(response.text, "lxml")
if page.h1 is None or page.h1.string == REJECT_PAGE_TITLE:
return None
else:
scraped = {}
scraped['schoolname'] = page.h1.string
scraped['location'] = page.find('th', text="City, State, Zip").next_sibling.string
for tag in page.find(id="section8").descendants:
if tag.name == 'tr' and tag.th is not None and tag.td is not None:
key = " ".join(string for string in tag.th.stripped_strings)
val = " ".join(string for string in tag.td.stripped_strings)
while key in scraped: # Temporarily deal with identical table headers cells
key += "*"
scraped[key] = val
return scraped
An empty dictionary 'scraped'
is created to hold our scraped values. We add 'schoolname'
and 'location'
before looping through id=section8
, saving every non-empty pair of <th>
and <td>
. To avoid duplicates, we add a '*'
to the key as a temporary marker until we can later clean up the key names. Finally, a 'scraped'
dictionary is returned to the caller.
In [33]:
from requests import get
from IPython.core.display import clear_output
from warnings import warn
scraped_pages = {}
for schoolId in range(START_ID, END_ID+1):
url = BASE_URL + str(schoolId)
print(url) # a status update
response = get(url)
if response.status_code != 200:
warn('Request {} caused status code {}.'.format(schoolId, response.status_code))
scraped = scrape(response)
if scraped:
scraped_pages[schoolId] = scraped
clear_output(wait = True) # clear the status update
print('Requested schoolId {} to {}, scraping {} pages.'.format(START_ID, END_ID, len(scraped_pages)))
We now have a dictionary, 'scraped_pages'
, filled with keys 'schoolId'
associated with scraped dictionaries, which themselves are filled with scraped <th>
keys and <td>
values, in addition to 'location'
and 'schoolname'
. Pandas' 'DataFrame.from_dict'
method will quickly convert this to a DataFrame, which we will then backup to .csv before moving forward.
In [40]:
scraped_pages.info()
I scraped 53 seperate columns for 1813 schools. According to the US Dept. of Education's National Center for Education Statistics, there were 2870 4-year colleges as of 2011. It would be very helpful to know what criteria the people behind CollegeData used to add schools to their site. Are CollegeData's schools only the largest? Only those who filled out a survey? Only those who paid?
I'm not sure how accurately this sample represents the overall population of 4-year colleges - and how accurately insights drawn from analysis of it can be generalized to the greater population of all schools - and more investigation into CollegeData's data acquisition process is warranted. Be warned!
In [ ]:
import pandas as pd
college = pd.DataFrame.from_dict(scraped_pages, orient='index') # convert scraped pages to a pandas DataFrame
college.to_csv('collegedata_scraped_backup.csv') # backup
With that said, I will move on to cleaning the data. Perhaps after some exploration we will have a better idea the quality of this sample and the quality of inferences made about the yield rate disparity.