Job register web scraping,

II. Deadline and category distribution

Michael Gully-Santiago, October 3, 2014

This notebook looks at some of the properties of the job adverts.


In [1]:
%pylab inline
import pandas as pd
from astropy.table import Table, Column
import numpy as np
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from dateutil import parser


Populating the interactive namespace from numpy and matplotlib

In [2]:
# Must use unconventional delimiter because there are commas in the corpus
t = Table.read('data/AASjobReg.dat', format='ascii', delimiter=';')
t


WARNING: AstropyDeprecationWarning: Config parameter 'name_resolve_timeout' in section [coordinates.name_resolve] of the file '/Users/gully/.astropy/config/astropy.cfg' is deprecated. Use 'remote_timeout' in section [utils.data] instead. [astropy.config.configuration]
WARNING:astropy:AstropyDeprecationWarning: Config parameter 'name_resolve_timeout' in section [coordinates.name_resolve] of the file '/Users/gully/.astropy/config/astropy.cfg' is deprecated. Use 'remote_timeout' in section [utils.data] instead.
WARNING: AstropyDeprecationWarning: Config parameter 'remote_timeout' in section [utils.data] of the file '/Users/gully/.astropy/config/astropy.cfg' is given by more than one alias (astropy.utils.data.remote_timeout, coordinates.name_resolve.name_resolve_timeout). Using the first. [astropy.config.configuration]
WARNING:astropy:AstropyDeprecationWarning: Config parameter 'remote_timeout' in section [utils.data] of the file '/Users/gully/.astropy/config/astropy.cfg' is given by more than one alias (astropy.utils.data.remote_timeout, coordinates.name_resolve.name_resolve_timeout). Using the first.
WARNING: AstropyDeprecationWarning: Config parameter 'max_lines' in section [table.pprint] of the file '/Users/gully/.astropy/config/astropy.cfg' is deprecated. Use 'max_lines' at the top-level instead. [astropy.config.configuration]
WARNING:astropy:AstropyDeprecationWarning: Config parameter 'max_lines' in section [table.pprint] of the file '/Users/gully/.astropy/config/astropy.cfg' is deprecated. Use 'max_lines' at the top-level instead.
Out[2]:
postDatedeadlinejobCatinstitute
October 1, 2014November 30, 2014Faculty Positions (visiting and non-tenure)Johns Hopkins University
October 1, 2014November 15, 2014Faculty Positions (visiting and non-tenure)Niels Bohr Institute
October 1, 2014November 15, 2014Faculty Positions (visiting and non-tenure)Niels Bohr Institute
October 1, 2014November 10, 2014Faculty Positions (visiting and non-tenure)Florida Gulf Coast University
October 1, 2014January 15, 2015Faculty Positions (visiting and non-tenure)University of South Florida
October 1, 2014November 15, 2014Faculty Positions (tenure and tenure-track)PONT UNIVERSIDAD CATOLICA DE CHILE
October 1, 2014June 15, 2015Faculty Positions (tenure and tenure-track)University of Washington, Department of Physics
October 1, 2014December 5, 2014Faculty Positions (tenure and tenure-track)Brown University - Department of Physics
October 1, 2014March 1, 2015Faculty Positions (tenure and tenure-track)University of Nevada, Las Vegas - UNLV
October 1, 2014November 1, 2014Faculty Positions (tenure and tenure-track)Baylor University
October 1, 2014November 21, 2014Faculty Positions (tenure and tenure-track)Massachusetts Institute of Technology
............
October 1, 2014October 31, 2014Science ManagementCanada-France-Hawaii Telescope Corp.
October 1, 2014May 31, 2015Science ManagementUniversity of California Office of the President
August 1, 2014December 1, 2014Science ManagementNational Science Foundation
October 1, 2014October 31, 2014Scientific/Technical StaffUniversities Space Research Association
October 1, 2014November 1, 2014Scientific/Technical StaffStanford University / SLAC National Accelerator Laboratory
October 1, 2014November 1, 2014Scientific/Technical StaffSmithsonian Astrophysical Observatory
October 1, 2014November 7, 2014Scientific/Technical StaffBay Area Environmental Research Institute
October 1, 2014October 31, 2014Scientific/Technical StaffThe Research Corporation of the University of Hawaii
October 1, 2014November 1, 2014Scientific/Technical StaffAURA/National Optical Astronomy Observatory
October 1, 2014October 31, 2014Scientific/Technical StaffCarnegie Institution for Science
September 1, 2014October 15, 2014Scientific/Technical StaffSKA Organisation

In [3]:
N_jobs = len(t['jobCat'])
cats = set(t['jobCat'])
N_cats = len(cats)

In [4]:
vi, vc = np.unique(t['jobCat'].data.tolist(), return_counts=True)
dct = {'name':vi, 'counts':vc}
tdf = pd.DataFrame(dct)
sortedvi = [x for (y,x) in sorted(zip(vc,vi))]

In [5]:
sns.set_context("talk")
sns.factorplot('name', 'counts', data=tdf,
               x_order = sortedvi)
plt.xticks(rotation=90)


Out[5]:
(array([0, 1, 2, 3, 4, 5, 6, 7]), <a list of 8 Text xticklabel objects>)

In [6]:
dateArr = []
for i in range(N_jobs):
    dt = parser.parse(t['deadline'].data[i])
    dateArr.append(dt)

In [7]:
days = np.asarray([t.timetuple().tm_yday for t in dateArr])

today = int(datetime.datetime.now().timetuple().tm_yday)

nextYear = (days < today-90)
days[nextYear] += 365
days = days - today

nov1 = datetime.datetime(2014, 11, 1).timetuple().tm_yday - today
dec1 = datetime.datetime(2014, 12, 1).timetuple().tm_yday - today
jan1 = 365 - today

In [8]:
ax = plt.axes()
sns.set_context('talk')
b, g, r, p = sns.color_palette("muted", 4)

sns.distplot(days, kde=False, color=r)
plot([0, 0], [0, 100], '--', color=g)
ax.set_xlabel('Days from today')
ax.set_title('Job Deadlines from the AAS Job Register')
ax.text(jan1-8, 35, 'January 1', color =b, rotation=90, fontsize=28)
plot([jan1, jan1], [0, 100], '--', color=b)
ax.text(dec1-8, 35, 'December 1', color =p, rotation=90, fontsize=28)
plot([dec1, dec1], [0, 100], '--', color=p)
ax.set_xlim([-30, 120])
ax.set_ylim([0, 65])


Out[8]:
(0, 65)

The End!