Job register web scraping,

I. Scraping the web

Michael Gully-Santiago, October 2, 2014

Basically I'm scaping the Job Register website to compile all the job add into into a single Excel file.

I am following this example from Greg Reda.


In [1]:
from bs4 import BeautifulSoup
from urllib2 import urlopen

In [2]:
import re
#import time
#import pandas as pd
from astropy.table import Table, Column
import numpy as np
#import copy

In [3]:
BASE_URL = "https://jobregister.aas.org/"
html = urlopen(BASE_URL).read()
soup = BeautifulSoup(html, "lxml")
pppcp2 = soup.find("div", "panel-pane pane-custom pane-2")
paneContent = pppcp2.find("div", "pane-content")
pcTab = paneContent.find("table")
allRows = pcTab.findAll("tr")

In [4]:
lordList = []
for row in allRows:
    td = row.find("td")
    if (td != None):
        link = td.a["href"]
        lordList.append(BASE_URL+link)

print 'There are ', len(lordList), ' jobs listed on the AAS job register.'


There are  213  jobs listed on the AAS job register.

woohoo, it works!!

The next step is to now go to each page and scrape the desired information.

By the way, you'll notice that it's not sorted by the job category (postdocs, Faculty, etc). This is OK because there is a "Job Category" listing that we can use to sort things out later.

Let's define the strategy for extracting each element. We are sticking to the DRY- "Don't repeat yourself" programming style, which is the right way to do things.


In [5]:
def extract_and_format_AAS_sibling_entry(cup_of_soup, sub_tag_name):
    entry = "---"
    thisTag = cup_of_soup.find('div',sub_tag_name)
    if (thisTag != None):
        thisLabel = thisTag.find("div", "field-label-inline-first")
        sibling = thisLabel.next_sibling
        formatted_content = re.sub(' +',' ',sibling).encode('utf-8', 'ignore').replace("\r\n", "")
        entry = unicode(formatted_content, errors='ignore')
    return entry

We have to define the fields

Unfortunately not every job ad has every field. This is a missing data problem. Because of this missing data, we can't merely say "find all elements in this tag", because we wouldn't necessarily know which-one-is-which. So we have to individually check for each tag, and then enter the value or "N/A" if None.

Unfortuntately I have not exploited clever python looping, zipping, vectorization, etc. I just repeated over and over the same calls.


In [6]:
# Job Details
institute_field = "field field-type-text field-field-institution-name"
jobCat_field = "field field-type-text field-field-job-category"

# Submission Address for Resumes/CVs
attn_to_field = 'field field-type-text field-field-attention-to'
attn_to_title_field = 'field field-type-text field-field-attention-to-title'
attn_to_org_field = 'field field-type-text field-field-attention-to-rganization' #[sic]
attn_to_address_field = 'field field-type-text field-field-attention-to-street-addres' #[sic]
attn_to_city_field = 'field field-type-text field-field-attention-to-city'   
attn_to_state_field = 'field field-type-text field-field-attention-state-province'
attn_to_zip_field = 'field field-type-text field-field-zip-postal-code'         
attn_to_country_field = 'field field-type-text field-field-attention-to-country' 
attn_to_email_field =   'field field-type-text field-field-attention-to-email'

# Inquiries
inquiry_email_field = "field field-type-text field-field-inquirie-email" #[sic]

# Desired columns:
PostDate = []
Deadline = []
JobCategory = []
Institution = []
attn_to = []
attn_to_title = []
attn_to_org = []
attn_to_address = []
attn_to_city = []
attn_to_state = []
attn_to_zip = []
attn_to_country = []
attn_to_email = []
inquiry_email = []

#this is a bad coding strategy because the memory/entry will be that of the largest string:
announce = []

Big for loop for all the jobs


In [7]:
#Just deal with a subset at first
subLordList = lordList

i = 0

for webLink in subLordList:
    i+=1
    #print i
    if ((i % 10) == 0):
        print i
    thisHtml = urlopen(webLink).read()
    soup = BeautifulSoup(thisHtml, "lxml")
    #time.sleep(1)
    
    # ---Submission Dates---
    # n.b non-standard extraction strategy here.
    gsd = soup.find("fieldset", "fieldgroup group-submission-dates")
    dds = gsd.findAll("span","date-display-single")
    
    PostDate.append(str(dds[0].contents[0]))
    Deadline.append(str(dds[2].contents[0]))
    
    # ---Job Details---
    gjd = soup.find("fieldset", "fieldgroup group-job-details")

    JobCategory.append(extract_and_format_AAS_sibling_entry(gjd,jobCat_field))
    Institution.append(extract_and_format_AAS_sibling_entry(gjd,institute_field))
    
    # ---Submission Address for Resumes/CVs---
    gsa = soup.find("fieldset", "fieldgroup group-submission-address")

    attn_to.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_field))
    attn_to_title.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_title_field))
    attn_to_org.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_org_field))
    attn_to_address.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_address_field))
    attn_to_city.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_city_field))
    attn_to_state.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_state_field))
    attn_to_zip.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_zip_field))
    attn_to_country.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_country_field))
    attn_to_email.append(extract_and_format_AAS_sibling_entry(gsa, attn_to_email_field))
    
    # ---Contact Information For Inquiries about the Job---
    gin = soup.find("fieldset", "fieldgroup group-inquiries")
    
    if (gin != None):
        inquiry_email.append(extract_and_format_AAS_sibling_entry(gin, inquiry_email_field))
    else:
        inquiry_email.append(unicode('---'))
    
    # Announcement 
    # nb. Slightly different parsing than the others above
    gga = soup.find("fieldset", "fieldgroup group-announcement")
    ann_tag = gga.find('div', 'field-items')
    announce_raw = ann_tag.getText().encode('utf-8', 'ignore')
    thisAnnounce = unicode(announce_raw, errors='ignore')

    announce.append(thisAnnounce)


10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210

In [21]:
out_arr  = [PostDate,
            Deadline, 
            JobCategory,
            Institution,
            lordList,
            attn_to,
            attn_to_title,
            attn_to_org,
            attn_to_address,
            attn_to_city,
            attn_to_state,
            attn_to_zip,
            attn_to_country,
            attn_to_email,
            inquiry_email,
            announce]
out_names = ('PostDate',
            'Deadline',
            'JobCategory',
            'Institution',
            'webURL',
            'attn_to',
            'attn_to_title',
            'attn_to_org',
            'attn_to_address',
            'attn_to_city',
            'attn_to_state',
            'attn_to_zip',
            'attn_to_country',
            'attn_to_email',
            'inquiry_email',
            'announce')

Let's make an abbreviated table by cutting the announcement. t.remove_column() will actually delete the column. We will simply make a separate file with a line for each announcement.


In [22]:
t = Table(out_arr, names = out_names)
t.remove_column('announce')
t.show_in_browser(jsviewer = True)


Out[22]:
<open file '<fdopen>', mode 'w+b' at 0x109635420>

Save it to a semi colon delimited text file (commas exist in the strings)


In [23]:
t.write('data/AllAASjobReg_abbreviated.dat', format='ascii', delimiter=';')

And that's it! I can load the data into an Excel file and make notes.

Let's save the announcement text to a document with one line per job. This will make it easy to read in again later.


In [ ]:
f = open('data/ItemizedAnnouncements.txt', 'w')

for item in announce:
    cleanedNewLines = item.replace("\n", "")
    jobAnnouncement = cleanedNewLines.replace("\t", "")+"\n"
    f.write(jobAnnouncement)
f.close()

The end!