Job register web scraping,

III. Natural Language Processing

Michael Gully-Santiago, October 5, 2014

Extract keyword frequencies from the job announcement



In [1]:

    
%pylab inline
from astropy.table import Table, Column
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
announceFile = open('data/ItemizedAnnouncements.txt', 'r')
announce = []

for line in announceFile:
    announce.append(line)
announceFile.close()

# The lord list of jobs, separate from the announcements
dat = pd.read_csv('data/AllAASjobReg_abbreviated.dat', sep=';')

Questions we want to ask:

Easy:

Does the job ad include these stems?

"instrumenta" (-l or -tion)
"theor" (-y or -etical)
"observation"

Hard(er):

Ideally, we'd also like to find out:

What are the most unique words in each job ad?
What does the entire corpus of job ads look like?
Are there significant differences between the Faculty job ads and the Postdoc job ads?

Let's attack the first question since it is relatively easy.



In [3]:

    
thisWord = "jenga"

searchWords = ['instrumenta', 'theor', 'observation', 'data-intensive', 'computational', 
               'galax', 'exoplanet', 'star formation', 'SDSS','GRB', 'hardware', 'software', 'data science',
               'nano', 'optics', 'x-ray', 'brown dwarf', 'LSST', 'HST', 'statistics', 'GAIA',
               'LBT', 'adaptive optics', 'Kepler', 'Keck', 'ALMA', 'VLT', 'Spitzer', 'simulations', 
               'interdisciplinary', 'MHD', 'high performance computing', 'planetary', 'quantum', 'infrared']

totHits = []

for thisWord in searchWords:
    occurences = []
    for advert in announce:
        thisAdvert = advert.lower()
        thisCount = thisAdvert.count(thisWord.lower())
        occurences.append(thisCount)
        #loc = thisAdvert.find(thisWord.lower())
        #if (thisCount > 0):
            #print advert[loc-40:loc+80]
            
    occurences = np.asarray(occurences)
    dat[thisWord] = occurences
    hits = (occurences > 0)
    totHits.append(sum(hits))
    print sum(hits), thisWord









    



62 instrumenta
108 theor
113 observation
3 data-intensive
34 computational
79 galax
35 exoplanet
22 star formation
20 SDSS
2 GRB
10 hardware
26 software
3 data science
3 nano
21 optics
18 x-ray
2 brown dwarf
13 LSST
9 HST
4 statistics
6 GAIA
8 LBT
11 adaptive optics
3 Kepler
11 Keck
18 ALMA
15 VLT
10 Spitzer
26 simulations
14 interdisciplinary
8 MHD
9 high performance computing
43 planetary
7 quantum
25 infrared



In [4]:

    
dct = {'word':searchWords, 'counts':totHits}
tdf = pd.DataFrame(dct)
sortedvi = [x for (y,x) in sorted(zip(totHits,searchWords))]



In [5]:

    
sns.set()
sns.set_context("paper", font_scale=1.5, rc={"lines.linewidth": 2.5})



In [18]:

    
sns.factorplot('word', 'counts', data=tdf, x_order=sortedvi)
plt.xticks(rotation=90)
savefig('AAS_wordFreq.png')



In [7]:

    
dat.head()









    Out[7]:






  
    
      
      PostDate
      Deadline
      JobCategory
      Institution
      webURL
      attn_to
      attn_to_title
      attn_to_org
      attn_to_address
      attn_to_city
      ...
      ALMA
      VLT
      Spitzer
      simulations
      interdisciplinary
      MHD
      high performance computing
      planetary
      quantum
      infrared
    
  
  
    
      0
       October 1, 2014
       November 30, 2014
       Faculty Positions (visiting and non-tenure)
            Johns Hopkins University
       https://jobregister.aas.org/job_view?JobID=48914
                       Margaret Gier
        Administrative Manager
               Johns Hopkins University
       3400 N. Charles Street
        Baltimore
      ...
       0
       0
       0
       1
       0
       0
       0
       0
       0
       0
    
    
      1
       October 1, 2014
       November 15, 2014
       Faculty Positions (visiting and non-tenure)
                Niels Bohr Institute
       https://jobregister.aas.org/job_view?JobID=49155
                       Martin Pessah
                           ---
       Niels Bohr International Academy
         Niels Bohr Institute
       Copenhagen
      ...
       0
       0
       0
       0
       0
       1
       1
       2
       0
       0
    
    
      2
       October 1, 2014
       November 15, 2014
       Faculty Positions (visiting and non-tenure)
                Niels Bohr Institute
       https://jobregister.aas.org/job_view?JobID=49157
                       Martin Pessah
                           ---
       Niels Bohr International Academy
         Niels Bohr Institute
       Copenhagen
      ...
       0
       0
       0
       0
       0
       1
       1
       2
       0
       0
    
    
      3
       October 1, 2014
       November 10, 2014
       Faculty Positions (visiting and non-tenure)
       Florida Gulf Coast University
       https://jobregister.aas.org/job_view?JobID=49320
       Florida Gulf Coast Universuty
                           ---
                                    ---
                          ---
              ---
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
    
      4
       October 1, 2014
        January 15, 2015
       Faculty Positions (visiting and non-tenure)
         University of South Florida
       https://jobregister.aas.org/job_view?JobID=49327
                    Dr. Gerald Woods
       Chair, Search Committee
            University of South Florida
       USF/Physics Department
            Tampa
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
    
  

5 rows × 50 columns



In [8]:

    
dat.to_excel('AASjobRegExcel_basicNLP.xls')

A few key comparisons: Frequency of hardware to software:



In [9]:

    
len(np.nonzero(dat['software']))









    Out[9]:





1

Notes from meeting with Karl P (UTexas CS grad)

tfidf
compare to PMI
"vowpal-wabbit" (spelling?)
burr-settles- active learning literature survey
mailing list for CS talks
Blei paper on literature impact



In [ ]:

	PostDate	Deadline	JobCategory	Institution	webURL	attn_to	attn_to_title	attn_to_org	attn_to_address	attn_to_city	...	simulations	MHD	high performance computing	planetary
0	October 1, 2014	November 30, 2014	Faculty Positions (visiting and non-tenure)	Johns Hopkins University	https://jobregister.aas.org/job_view?JobID=48914	Margaret Gier	Administrative Manager	Johns Hopkins University	3400 N. Charles Street	Baltimore	...	1	0	0	0
1	October 1, 2014	November 15, 2014	Faculty Positions (visiting and non-tenure)	Niels Bohr Institute	https://jobregister.aas.org/job_view?JobID=49155	Martin Pessah	---	Niels Bohr International Academy	Niels Bohr Institute	Copenhagen	...	0	1	1	2
2	October 1, 2014	November 15, 2014	Faculty Positions (visiting and non-tenure)	Niels Bohr Institute	https://jobregister.aas.org/job_view?JobID=49157	Martin Pessah	---	Niels Bohr International Academy	Niels Bohr Institute	Copenhagen	...	0	1	1	2
3	October 1, 2014	November 10, 2014	Faculty Positions (visiting and non-tenure)	Florida Gulf Coast University	https://jobregister.aas.org/job_view?JobID=49320	Florida Gulf Coast Universuty	---	---	---	---	...	0	0	0	0
4	October 1, 2014	January 15, 2015	Faculty Positions (visiting and non-tenure)	University of South Florida	https://jobregister.aas.org/job_view?JobID=49327	Dr. Gerald Woods	Chair, Search Committee	University of South Florida	USF/Physics Department	Tampa	...	0	0	0	0