Looking at Payscale's Job Satisfaction Data


In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import pandas as pd

Get the data

Let us use requests and BeautifulSoup to look at the data available on the Payscale website


In [3]:
import requests
from bs4 import BeautifulSoup
# soup = BeautifulSoup(html_doc)

In [4]:
r = requests.get('http://www.payscale.com/data-packages/most-and-least-meaningful-jobs/full-list')

In [5]:
r.status_code


Out[5]:
200

In [6]:
soup = BeautifulSoup(r.text)

In [8]:
#print soup.prettify()
#soup.find_all('table')
soup.table

Using pandas.io

As it turns out, once inspected there is only one table. We can easily import it using pandas.io


In [3]:
url = 'http://www.payscale.com/data-packages/most-and-least-meaningful-jobs/full-list'

In [4]:
pd.io.html.read_html?

In [5]:
dflist = pd.io.html.read_html(url, match='.+', flavor=None, header=0, index_col=None, skiprows=None, infer_types=None, attrs=None, parse_dates=False, tupleize_cols=False, thousands=',')

In [6]:
df = dflist[0]

In [7]:
df.dtypes


Out[7]:
Detailed Occupation        object
Median Pay                 object
% High Meaning             object
% High Satisfaction        object
% High Stress              object
Typical Education Level    object
% Female                   object
% Male                     object
Job Level                  object
dtype: object

In [7]:
df.head()


Out[7]:
Detailed Occupation Median Pay % High Meaning % High Satisfaction % High Stress Typical Education Level % Female % Male Job Level
0 Clergy $45,400 97% 89% 68% Masters (non-MBA) 14% 86% Mid-Career
1 Directors, Religious Activities & Education $35,900 97% 88% 61% Bachelors 50% 50% Mid-Career
2 Surgeons $299,600 94% 82% 79% Doctors of Medicine (MD) 18% 82% Mid-Career
3 Education Administrators, Elementary/Secondary $75,900 93% 87% 85% Masters (non-MBA) 55% 45% Senior
4 Chiropractors $58,700 93% 65% 52% Doctorate (Ph.D.) 27% 73% Mid-Career

Data Munging


In [8]:
for c in df.columns:
    df[c] = df[c].str.replace('%','')

In [9]:
df[df.columns[1]] = df[df.columns[1]].str.replace('$','')

In [10]:
df[df.columns[1]] = df[df.columns[1]].str.replace(',','')

In [11]:
df.head()


Out[11]:
Detailed Occupation Median Pay % High Meaning % High Satisfaction % High Stress Typical Education Level % Female % Male Job Level
0 Clergy 45400 97 89 68 Masters (non-MBA) 14 86 Mid-Career
1 Directors, Religious Activities & Education 35900 97 88 61 Bachelors 50 50 Mid-Career
2 Surgeons 299600 94 82 79 Doctors of Medicine (MD) 18 82 Mid-Career
3 Education Administrators, Elementary/Secondary 75900 93 87 85 Masters (non-MBA) 55 45 Senior
4 Chiropractors 58700 93 65 52 Doctorate (Ph.D.) 27 73 Mid-Career

In [12]:
df = df.convert_objects(convert_numeric=True)

In [13]:
df.dtypes


Out[13]:
Detailed Occupation         object
Median Pay                   int64
% High Meaning               int64
% High Satisfaction        float64
% High Stress                int64
Typical Education Level     object
% Female                     int64
% Male                       int64
Job Level                   object
dtype: object

Exploring the data

EDA


In [14]:
pd.options.display.mpl_style = 'default'

In [15]:
figsize(10, 10)

In [20]:
plt.scatter(x=df['% High Stress'],y=df['% High Satisfaction'], s=df['Median Pay']*.001, alpha=0.5)
plt.gca().set_xlabel('% High Stress')
plt.gca().set_ylabel('% High Satisfaction')
plt.gca().set_title('High Satisfaction vs High Stress Jobs', fontsize=16)
plt.show()