In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.
We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.
Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN.
Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.
We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.
First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")
Notice, each job listing is underneath a div
tag with a class name of result
. We can use BeautifulSoup to extract those.
The URL here has many query parameters
In [2]:
#Using a random forest regressor, with one other classifier.
In [3]:
url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"
In [4]:
import requests
import bs4
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(url).read()
In [5]:
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
In [6]:
#http://stackoverflow.com/questions/9907492/how-to-get-firefox-working-with-selenium-webdriver-on-mac-osx
In [7]:
## YOUR CODE HERE
b.find_all('span', {'class','summary'})
Out[7]:
In [8]:
#List of summaries for New York 20,000.
for entry in b.find_all('span', {'class','summary'}):
print entry.text
Let's look at one result more closely. A single result looks like
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&campaignid=serp-linkcompanyname&fromjk=2480d203f7e97210&jcid=b374f2a780e04789" target="_blank">
AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
While this has some more verbose elements removed, we can see that there is some structure to the above:
Example
def extract_location_from_result(result):
return result.find ...
- Remember to check if a field is empty or None for attempting to call methods on it
- Remember to use try/except if you anticipate errors.
In [9]:
def extract_job_from_result(result):
url = result
html = urllib.urlopen(url).read()
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
for entry in b.find_all('h2', {'class':'jobtitle'}):
entry.text
In [10]:
extract_job_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')
In [11]:
def extract_location_from_result(result):
url = result
html = urllib.urlopen(url).read()
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
for entry in b.find_all('span', {'class':'location'}):
entry.text
In [12]:
extract_location_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')
In [13]:
def extract_company_from_result(result):
url = result
html = urllib.urlopen(url).read()
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
for entry in b.find_all('span', {'class':'company'}):
entry.text
In [14]:
extract_company_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')
In [15]:
#The salary is available in a nobr element inside of a td element with class='snip'.
def extract_salary_from_result(result):
url = result
html = urllib.urlopen(url).read()
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
for entry in b.find_all('td', {'class':'snip'}):
try:
entry.find('nobr').renderContents()
except:
'NONE LISTED'
In [16]:
extract_salary_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')
Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.
There are two query parameters here we can alter to collect more results, the l=New+York and the start=10. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).
In [17]:
YOUR_CITY = 'Boston'
In [22]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 10000 # Set this to a high-value (5000) to generate more results.
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.
results = []
ny = []
chic = []
sf = []
aus = []
sea = []
la = []
phil = []
atl = []
dal = []
pitt = []
port = []
ph = []
den = []
hou = []
mi = []
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle',
'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh',
'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):
for start in range(0, max_results_per_city, 10):
# Grab the results from the request (as above)
url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=" + city +"&start="+ str(start)
# Make a list for each city
if city=='New+York':
ny.append(url)
if city=='Chicago':
chic.append(url)
if city=='San+Francisco':
sf.append(url)
if city=='Austin':
aus.append(url)
if city=='Seattle':
sea.append(url)
if city=='Los+Angeles':
la.append(url)
if city=='Philadelphia':
phil.append(url)
if city=='Atlanta':
atl.append(url)
if city=='Dallas':
dal.append(url)
if city=='Pittsburgh':
pitt.append(url)
if city=='Portland':
port.append(url)
if city=='Philadelphia':
ph.append(url)
if city=='Phoenix':
ph.append(url)
if city=='Denver':
den.append(url)
if city=='Houston':
hou.append(url)
if city=='Miami':
mi.append(url)
# Make a full set of results just in case
results.append(url)
pass
In [34]:
import pandas as pd
job_details = pd.DataFrame(columns=['location','title','company', 'salary'])
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
In [39]:
#Take the overall entry and excract info from there first.
for result in results:
url = result
html = urllib.urlopen(url).read()
for entry in b.find_all('div', {'class':' row result'}):
try:
location = b.find('span', {'class':'location'}).text
except:
location = 'NA'
try:
title = b.find('h2', {'class':'jobtitle'}).text
except:
title = 'NA'
try:
company = b.find('span', {'class':'company'}).text
except:
company = 'NA'
try:
salary = b.find('td', {'class':'snip'}).find('nobr').renderContents()
except:
salary = 'NA'
job_details.loc[len(job_details)]=[location, title, company, salary]
Lastly, we need to clean up salary data.
In [51]:
job_details = job_details[job_details.salary != 'NONE LISTED']
In [53]:
job_details = job_details.reset_index()
In [56]:
job_details = job_details.drop('index', 1)
In [59]:
job_details = job_details.drop('level_0', 1)
In [61]:
## YOUR CODE HERE
job_details = job_details[job_details.salary.str.contains("a month") == False]
job_details = job_details[job_details.salary.str.contains("an hour") == False]
job_details = job_details[job_details.salary.str.contains("a week") == False]
job_details = job_details[job_details.salary.str.contains("a day") == False]
In [147]:
job_details
Out[147]:
In [63]:
## YOUR CODE HERE
job_details['salary'] = (job_details['salary'].replace( '[\a year,)]','', regex=True))
job_details['salary'] = (job_details['salary'].replace( '[\$,)]','', regex=True))
job_details['company'] = (job_details['company'].replace( '[\\n,)]','', regex=True))
job_details['company'] = (job_details['company'].replace( '[\\n\n,)]','', regex=True))
job_details['title'] = (job_details['title'].replace( '[\\n,)]','', regex=True))
In [388]:
#Checkpoint.
job_details = all_jobs
In [390]:
job_details_2 = job_details.drop_duplicates()
#533 results. Left with 34 in total.
In [392]:
#Need to convert the ranges.
In [393]:
job_details_2['salary'] = (job_details_2['salary'].replace( '[\-,)]',' ', regex=True))
In [394]:
job_details_2.reset_index()
Out[394]:
In [395]:
salaries = job_details_2.salary.str.split(' ', expand=True)
In [397]:
salaries = salaries.astype(float)
In [398]:
salaries.dtypes
Out[398]:
In [399]:
salaries = salaries.rename(columns = {0:'salary_1', 1:'salary_2'})
In [401]:
salaries.salary_2 = salaries.salary_2.fillna(salaries.salary_1)
In [403]:
final_salary = salaries.median(axis=1)
In [404]:
final_salary = pd.DataFrame(final_salary)
In [412]:
final_salary = final_salary.rename(columns = {0:'final_salary'})
In [413]:
final_salary.head()
Out[413]:
In [414]:
jobs = pd.concat([job_details_2, final_salary], axis=1)
In [415]:
jobs = jobs.drop('salary', axis=1)
In [416]:
jobs
Out[416]:
In [417]:
jobs.dtypes
Out[417]:
In [418]:
# Export to csv
job_details_csv = jobs.to_csv
In [419]:
job_details_csv
Out[419]:
In [616]:
## YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
In [617]:
rfc = RandomForestClassifier()
knn = KNeighborsClassifier()
We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a binary classification problem, by predicting two classes, HIGH vs LOW salary.
While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the median
as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.
In fact, the ideal scenario may be to predict many levels of salaries,
In [424]:
# Median salary is 72,500 per year
# Should do the mean or the 50th percentile instead as there aren't that many salaries above 72,500.
In [427]:
jobs.final_salary.describe()
Out[427]:
In [428]:
#Upper 50% above 108,750
jobs['high_or_low'] = jobs['final_salary'].map(lambda x: 1 if x > 108750 else 0)
In [429]:
jobs
Out[429]:
In [448]:
jobs_with_locations = pd.concat([jobs, pd.get_dummies(jobs.location)], axis=1)
In [450]:
jobs_with_locations.head(3)
Out[450]:
It is a measure of how well our selected features will be at predicting a high or low salary.
In [612]:
## YOUR CODE HERE
X_1 = jobs_with_locations.drop(jobs_with_locations[[0,1,2,3,4]], axis=1)
y_1 = jobs_with_locations.high_or_low
In [ ]:
# NO TRAIN TEST SPLIT.
In [490]:
rfc.fit(X_train,y_train)
Out[490]:
In [491]:
#Accuracy score.
rfc.score(X_train,y_train)
Out[491]:
In [532]:
from sklearn.cross_validation import cross_val_score
In [493]:
cross_val_score(rfc, X_train, y_train)
Out[493]:
In [ ]:
#The accuracy is high, but the cross validation score expresses substantially less
# confidence. May be due to the smallness of the sample.
In [614]:
#Try with KNN
In [668]:
knn.fit(X_train, y_train)
Out[668]:
In [669]:
knn.score(X_train,y_train)
Out[669]:
In [621]:
#Less accurate than with rfc.
In [622]:
cross_val_score(knn, X_train, y_train)
Out[622]:
In [ ]:
#Same cross val score as with rfc.
In [520]:
## YOUR CODE HERE
senior_variable = jobs['title'].map(lambda x: 1 if 'Senior' in x else 0)
In [521]:
senior_variable = pd.DataFrame(senior_variable)
In [523]:
senior_variable = senior_variable.rename(columns = {'title':'senior_variable'})
In [525]:
jobs_with_seniors = pd.concat([jobs, senior_variable], axis=1)
In [526]:
jobs_with_seniors
Out[526]:
In [606]:
X_2 = jobs_with_seniors.drop(jobs_with_seniors[[0,1,2,3,4]], axis=1)
y_2 = jobs_with_seniors.high_or_low
In [624]:
rfc.fit(X_train,y_train)
Out[624]:
In [625]:
rfc.score(X_train,y_train)
Out[625]:
In [ ]:
#Not the most accurate for "senior" in job title, but could be because of the small size of sample
In [626]:
cross_val_score(rfc, X_train, y_train)
Out[626]:
In [ ]:
# "Senior" not highly predictive of high salaries, at least from this dataset
In [671]:
knn.fit(X_train,y_train)
Out[671]:
In [672]:
knn.score(X_train,y_train)
Out[672]:
In [630]:
cross_val_score(knn, X_train, y_train)
Out[630]:
In [597]:
# Saw a few high salaries with "Quantitative" in there--should test for that
In [598]:
quant_variable = jobs['title'].map(lambda x: 1 if 'Quantitative' in x else 0)
In [600]:
quant_variable = pd.DataFrame(quant_variable)
quant_variable = quant_variable.rename(columns = {'title':'quant_variable'})
In [601]:
jobs_with_quant = pd.concat([jobs, quant_variable], axis=1)
In [31]:
X_3 = jobs_with_quant.quant_variable
y_3 = jobs_with_quant.high_or_low
In [673]:
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)
In [674]:
rfc.fit(X_train,y_train)
Out[674]:
In [609]:
rfc.score(X_train,y_train)
Out[609]:
In [610]:
#Accuracy score likely lower than usual because of the limited sample from the scraping.
In [611]:
cross_val_score(rfc, X_train,y_train)
Out[611]:
In [ ]:
# Not predictive according to cross val score.
In [675]:
knn.fit(X_train,y_train)
Out[675]:
In [676]:
knn.score(X_train,y_train)
Out[676]:
In [677]:
cross_val_score(knn,X_train,y_train)
Out[677]:
In [659]:
#I've already turned them to dummies, so I'm not sure scaling would be beneficial.
In [476]:
## YOUR CODE HERE
In [ ]:
## YOUR CODE HERE
In [635]:
## YOUR CODE HERE
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()
In [ ]:
#X_1, y_1: location; X_2, y_2: senior; X_3, y_3: quant
In [636]:
#First for location
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, test_size=0.5, random_state=50)
In [637]:
rfr.fit(X_train,y_train)
Out[637]:
In [665]:
import matplotlib.pyplot as plt
In [638]:
rfr.score(X_train,y_train)
Out[638]:
In [639]:
#A slightly improved score from others.
In [642]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')
Out[642]:
In [ ]:
#All negative scores.
In [645]:
#Now for "seniors"
In [646]:
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.5, random_state=50)
In [647]:
rfr.fit(X_train,y_train)
Out[647]:
In [648]:
rfr.score(X_train,y_train)
Out[648]:
In [649]:
#Not accurate at all.
In [650]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')
Out[650]:
In [ ]:
#Again, negative scores.
In [651]:
#Maybe Quantitative could work.
In [653]:
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)
In [654]:
rfr.fit(X_train,y_train)
Out[654]:
In [655]:
rfr.score(X_train, y_train)
Out[655]:
In [656]:
#Also very low.
In [657]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')
Out[657]: