Job register web scraping,

III. Natural Language Processing

Michael Gully-Santiago, October 5, 2014

Extract keyword frequencies from the job announcement


In [1]:
%pylab inline
from astropy.table import Table, Column
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt


Populating the interactive namespace from numpy and matplotlib

In [2]:
announceFile = open('data/ItemizedAnnouncements.txt', 'r')
announce = []

for line in announceFile:
    announce.append(line)
announceFile.close()

# The lord list of jobs, separate from the announcements
dat = pd.read_csv('data/AllAASjobReg_abbreviated.dat', sep=';')

Questions we want to ask:

Easy:

Does the job ad include these stems?

  1. "instrumenta" (-l or -tion)
  2. "theor" (-y or -etical)
  3. "observation"

Hard(er):

Ideally, we'd also like to find out:

  • What are the most unique words in each job ad?
  • What does the entire corpus of job ads look like?
  • Are there significant differences between the Faculty job ads and the Postdoc job ads?

Let's attack the first question since it is relatively easy.


In [3]:
thisWord = "jenga"

searchWords = ['instrumenta', 'theor', 'observation', 'data-intensive', 'computational', 
               'galax', 'exoplanet', 'star formation', 'SDSS','GRB', 'hardware', 'software', 'data science',
               'nano', 'optics', 'x-ray', 'brown dwarf', 'LSST', 'HST', 'statistics', 'GAIA',
               'LBT', 'adaptive optics', 'Kepler', 'Keck', 'ALMA', 'VLT', 'Spitzer', 'simulations', 
               'interdisciplinary', 'MHD', 'high performance computing', 'planetary', 'quantum', 'infrared']

totHits = []

for thisWord in searchWords:
    occurences = []
    for advert in announce:
        thisAdvert = advert.lower()
        thisCount = thisAdvert.count(thisWord.lower())
        occurences.append(thisCount)
        #loc = thisAdvert.find(thisWord.lower())
        #if (thisCount > 0):
            #print advert[loc-40:loc+80]
            
    occurences = np.asarray(occurences)
    dat[thisWord] = occurences
    hits = (occurences > 0)
    totHits.append(sum(hits))
    print sum(hits), thisWord


62 instrumenta
108 theor
113 observation
3 data-intensive
34 computational
79 galax
35 exoplanet
22 star formation
20 SDSS
2 GRB
10 hardware
26 software
3 data science
3 nano
21 optics
18 x-ray
2 brown dwarf
13 LSST
9 HST
4 statistics
6 GAIA
8 LBT
11 adaptive optics
3 Kepler
11 Keck
18 ALMA
15 VLT
10 Spitzer
26 simulations
14 interdisciplinary
8 MHD
9 high performance computing
43 planetary
7 quantum
25 infrared

In [4]:
dct = {'word':searchWords, 'counts':totHits}
tdf = pd.DataFrame(dct)
sortedvi = [x for (y,x) in sorted(zip(totHits,searchWords))]

In [5]:
sns.set()
sns.set_context("paper", font_scale=1.5, rc={"lines.linewidth": 2.5})

In [18]:
sns.factorplot('word', 'counts', data=tdf, x_order=sortedvi)
plt.xticks(rotation=90)
savefig('AAS_wordFreq.png')



In [7]:
dat.head()


Out[7]:
PostDate Deadline JobCategory Institution webURL attn_to attn_to_title attn_to_org attn_to_address attn_to_city ... ALMA VLT Spitzer simulations interdisciplinary MHD high performance computing planetary quantum infrared
0 October 1, 2014 November 30, 2014 Faculty Positions (visiting and non-tenure) Johns Hopkins University https://jobregister.aas.org/job_view?JobID=48914 Margaret Gier Administrative Manager Johns Hopkins University 3400 N. Charles Street Baltimore ... 0 0 0 1 0 0 0 0 0 0
1 October 1, 2014 November 15, 2014 Faculty Positions (visiting and non-tenure) Niels Bohr Institute https://jobregister.aas.org/job_view?JobID=49155 Martin Pessah --- Niels Bohr International Academy Niels Bohr Institute Copenhagen ... 0 0 0 0 0 1 1 2 0 0
2 October 1, 2014 November 15, 2014 Faculty Positions (visiting and non-tenure) Niels Bohr Institute https://jobregister.aas.org/job_view?JobID=49157 Martin Pessah --- Niels Bohr International Academy Niels Bohr Institute Copenhagen ... 0 0 0 0 0 1 1 2 0 0
3 October 1, 2014 November 10, 2014 Faculty Positions (visiting and non-tenure) Florida Gulf Coast University https://jobregister.aas.org/job_view?JobID=49320 Florida Gulf Coast Universuty --- --- --- --- ... 0 0 0 0 0 0 0 0 0 0
4 October 1, 2014 January 15, 2015 Faculty Positions (visiting and non-tenure) University of South Florida https://jobregister.aas.org/job_view?JobID=49327 Dr. Gerald Woods Chair, Search Committee University of South Florida USF/Physics Department Tampa ... 0 0 0 0 0 0 0 0 0 0

5 rows × 50 columns


In [8]:
dat.to_excel('AASjobRegExcel_basicNLP.xls')

A few key comparisons: Frequency of hardware to software:


In [9]:
len(np.nonzero(dat['software']))


Out[9]:
1

Notes from meeting with Karl P (UTexas CS grad)

  1. tfidf
  2. compare to PMI
  3. "vowpal-wabbit" (spelling?)
  4. burr-settles- active learning literature survey
  5. mailing list for CS talks
  6. Blei paper on literature impact

In [ ]: