Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN.

Question: Why would we want this to be a classification problem?
Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a div tag with a class name of result. We can use BeautifulSoup to extract those.

Setup a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

q for the job search
This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
l for a location
start for what result number to start on



In [2]:

    
#Using a random forest regressor, with one other classifier.



In [3]:

    
url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"



In [4]:

    
import requests
import bs4
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(url).read()



In [5]:

    
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")



In [6]:

    
#http://stackoverflow.com/questions/9907492/how-to-get-firefox-working-with-selenium-webdriver-on-mac-osx



In [7]:

    
## YOUR CODE HERE
b.find_all('span', {'class','summary'})









    Out[7]:





[<span class="summary">Manage, mentor, and grow our team of <b>data</b> <b>scientists</b> and <b>data</b> analysts. 2+ years leading a team of <b>data</b> <b>scientists</b> and or/analysts and a track record of being a...</span>,
 <span class="summary">KPMG is currently seeking a Manager - Cognitive <b>Data</b> <b>Scientist</b> Natural Language Processing, to join our National Organization....</span>,
 <span class="summary">A passion for manipulating massive amounts of <b>data</b>. You will also work on our Personalization engine which uses Machine Learning and Big <b>Data</b> technologies to...</span>,
 <span class="summary" itemprop="description">\nExperience with big <b>data</b>. Work with partners in <b>Data</b>. The in-depth, statistical &amp; quantitative analyses of consumer <b>data</b>....</span>,
 <span class="summary" itemprop="description">\nHands-on programming skills experience with R, Python etc. Experience in machine learning, <b>data</b> mining, and predictive analysis Experience with natural language...</span>,
 <span class="summary" itemprop="description">\nBe part of a team of <b>Data</b> <b>Scientists</b> that focus on the design and execution of advanced analytical <b>data</b> exploration, mining, inference, models and systems...</span>,
 <span class="summary" itemprop="description">\n(JPMIS) is a new group considering ways to transform our <b>data</b> assets into opportunities for JPMorgan Chase by leveraging the vast amount of proprietary <b>data</b>...</span>,
 <span class="summary" itemprop="description">\nIntegral Ad Science is seeking <b>Data</b> <b>Scientist</b> to work on challenging fundamental <b>data</b> science problems in online advertising;...</span>,
 <span class="summary" itemprop="description">\nCustomer Graph Digital <b>Data</b> <b>Scientist</b>. We are currently looking for a <b>data</b> <b>scientist</b> that has extensive theoretical and practical experience implementing graph...</span>,
 <span class="summary" itemprop="description">\n<b>Data</b> <b>Scientists</b> analyze PlaceIQ hyperlocal <b>data</b> sources to develop accurate predictions of audience and behavior....</span>,
 <span class="summary" itemprop="description">\nAND/ORUtilizes <b>data</b> wrangling/<b>data</b> matching/ETL techniques while programming in several languages to explore a variety of <b>data</b> sources, gain <b>data</b> expertise,...</span>,
 <span class="summary" itemprop="description">\nExperience with <b>data</b> visualization (e.g. As well as third-party <b>data</b> partners. Tests new statistical analysis methods, software and <b>data</b> sources for continual...</span>,
 <span class="summary" itemprop="description">\nWe\u2019re looking for a <b>Data</b> <b>Scientist</b> to join our Analytics team. Deep familiarity with the core concepts, advantages, and tradeoffs of relational and...</span>,
 <span class="summary">Employees are encouraged and expected to build their expertise as <b>data</b> <b>scientists</b>, and deploy analytics to business problems....</span>,
 <span class="summary">1-3 years' experience and evidence with real world <b>data</b> wrangling (querying, storing, cleaning, aggregating &amp; summarizing <b>data</b>)....</span>]



In [8]:

    
#List of summaries for New York 20,000.
for entry in b.find_all('span', {'class','summary'}):
    print entry.text









    



Manage, mentor, and grow our team of data scientists and data analysts. 2+ years leading a team of data scientists and or/analysts and a track record of being a...
KPMG is currently seeking a Manager - Cognitive Data Scientist Natural Language Processing, to join our National Organization....
A passion for manipulating massive amounts of data. You will also work on our Personalization engine which uses Machine Learning and Big Data technologies to...

Experience with big data. Work with partners in Data. The in-depth, statistical & quantitative analyses of consumer data....

Hands-on programming skills experience with R, Python etc. Experience in machine learning, data mining, and predictive analysis Experience with natural language...

Be part of a team of Data Scientists that focus on the design and execution of advanced analytical data exploration, mining, inference, models and systems...

(JPMIS) is a new group considering ways to transform our data assets into opportunities for JPMorgan Chase by leveraging the vast amount of proprietary data...

Integral Ad Science is seeking Data Scientist to work on challenging fundamental data science problems in online advertising;...

Customer Graph Digital Data Scientist. We are currently looking for a data scientist that has extensive theoretical and practical experience implementing graph...

Data Scientists analyze PlaceIQ hyperlocal data sources to develop accurate predictions of audience and behavior....

AND/ORUtilizes data wrangling/data matching/ETL techniques while programming in several languages to explore a variety of data sources, gain data expertise,...

Experience with data visualization (e.g. As well as third-party data partners. Tests new statistical analysis methods, software and data sources for continual...

We’re looking for a Data Scientist to join our Analytics team. Deep familiarity with the core concepts, advantages, and tradeoffs of relational and...
Employees are encouraged and expected to build their expertise as data scientists, and deploy analytics to business problems....
1-3 years' experience and evidence with real world data wrangling (querying, storing, cleaning, aggregating & summarizing data)....

Let's look at one result more closely. A single result looks like

<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&campaignid=serp-linkcompanyname&fromjk=2480d203f7e97210&jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>

While this has some more verbose elements removed, we can see that there is some structure to the above:

The salary is available in a nobr element inside of a td element with class='snip.
The title of a job is in a link with class set to jobtitle and a data-tn-element="jobTitle.
The location is set in a span with class='location'.
The company is set in a span with class='company'.

Write 4 functions to extract each item: location, company, job, and salary.

Example

def extract_location_from_result(result):
    return result.find ...

- Make sure these functions are robust and can handle cases where the data/field may not be available.

Remember to check if a field is empty or None for attempting to call methods on it

Remember to use try/except if you anticipate errors.

Test the functions on the results above and simple examples



In [9]:

    
def extract_job_from_result(result): 
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('h2', {'class':'jobtitle'}):
        entry.text



In [10]:

    
extract_job_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')



In [11]:

    
def extract_location_from_result(result): 
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('span', {'class':'location'}):
        entry.text



In [12]:

    
extract_location_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')



In [13]:

    
def extract_company_from_result(result):
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('span', {'class':'company'}):
        entry.text



In [14]:

    
extract_company_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')



In [15]:

    
#The salary is available in a nobr element inside of a td element with class='snip'.
def extract_salary_from_result(result):
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('td', {'class':'snip'}):
        try:
            entry.find('nobr').renderContents()
        except:
            'NONE LISTED'



In [16]:

    
extract_salary_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

"http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the l=New+York and the start=10. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

Complete the following code to collect results from multiple cities and starting points.

Enter your city below to add it to the search
Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different



In [17]:

    
YOUR_CITY = 'Boston'



In [22]:

    
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 10000 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []
ny = []
chic = []
sf = []
aus = []
sea = []
la = []
phil = []
atl = []
dal = []
pitt = []
port = []
ph = []
den = []
hou = []
mi = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=" + city +"&start="+ str(start)
        # Make a list for each city
        if city=='New+York':
            ny.append(url)
        if city=='Chicago':
            chic.append(url)
        if city=='San+Francisco':
            sf.append(url)
        if city=='Austin':
            aus.append(url)
        if city=='Seattle':
            sea.append(url)
        if city=='Los+Angeles':
            la.append(url)
        if city=='Philadelphia':
            phil.append(url)
        if city=='Atlanta':
            atl.append(url)
        if city=='Dallas':
            dal.append(url)
        if city=='Pittsburgh':
            pitt.append(url)
        if city=='Portland':
            port.append(url)
        if city=='Philadelphia':
            ph.append(url)
        if city=='Phoenix':
            ph.append(url)
        if city=='Denver':
            den.append(url)
        if city=='Houston':
            hou.append(url)
       if city=='Miami':
            mi.append(url)
        # Make a full set of results just in case
        results.append(url)
        pass

Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.



In [34]:

    
import pandas as pd
job_details = pd.DataFrame(columns=['location','title','company', 'salary'])
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")



In [39]:

    
#Take the overall entry and excract info from there first. 
for result in results:
    url = result 
    html = urllib.urlopen(url).read()
    for entry in b.find_all('div', {'class':' row result'}):
        try: 
            location = b.find('span', {'class':'location'}).text
        except: 
            location = 'NA'
        try:
            title = b.find('h2', {'class':'jobtitle'}).text
        except:
            title = 'NA'
        try:
            company = b.find('span', {'class':'company'}).text
        except: 
            company = 'NA'
        
        try:
            salary = b.find('td', {'class':'snip'}).find('nobr').renderContents()
        except:
            salary = 'NA'
    job_details.loc[len(job_details)]=[location, title, company, salary]

Lastly, we need to clean up salary data.

Only a small number of the scraped results have salary information - only these will be used for modeling.
Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
Some of the entries may be duplicated
The salaries are given as text and usually with ranges.

Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries



In [51]:

    
job_details = job_details[job_details.salary != 'NONE LISTED']



In [53]:

    
job_details = job_details.reset_index()



In [56]:

    
job_details = job_details.drop('index', 1)



In [59]:

    
job_details = job_details.drop('level_0', 1)



In [61]:

    
## YOUR CODE HERE
job_details = job_details[job_details.salary.str.contains("a month") == False]
job_details = job_details[job_details.salary.str.contains("an hour") == False]
job_details = job_details[job_details.salary.str.contains("a week") == False]
job_details = job_details[job_details.salary.str.contains("a day") == False]



In [147]:

    
job_details









    Out[147]:






  
    
      
      location
      title
      company
      salary
    
  
  
    
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      70000 75000
    
    
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000
    
    
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      100000 110000
    
    
      5
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      75000 130000
    
    
      6
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      75000 130000
    
    
      7
      Houston, TX
      Quantitative Risk Analyst
      SearchBankingJobs
      150000 205000
    
    
      9
      Lexington, MA 02420
      Statistical Programmer
      BioPier Inc.
      60000 120000
    
    
      11
      Boston, MA 02116 (South End area)
      Back End Java Developer-Machine Learning
      Jobspring Partners
      130000
    
    
      12
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      100000 165000
    
    
      13
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      120000 165000
    
    
      14
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      60096 87045
    
    
      15
      Houston, TX
      Lead Bioinformatics Programmer
      Baylor College of Medicine
      90000
    
    
      16
      Quincy, MA 02169
      Machine Learning Engineer
      NRS Global Partners
      95000 120000
    
    
      19
      Cambridge, MA 02138 (West Cambridge area)
      Program Assistant
      Union of Concerned Scientists
      35000
    
    
      20
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      100000 165000
    
    
      21
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      120000 165000
    
    
      22
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      60096 87045
    
    
      23
      Braintree, MA
      Inside Sales Representative
      PayFactors
      70000 90000
    
    
      24
      Scottsdale, AZ
      Web Security Research Analyst
      SiteLock
      50000 55000
    
    
      25
      Chicago, IL
      Lead Data Scientist
      Analytic Recruiting
      140000
    
    
      26
      Chicago, IL
      Senior Data Scientist
      Selby Jennings
      125000 175000
    
    
      27
      Sausalito, CA 94965
      Associate Director of Analytics
      Workbridge Associates
      150000
    
    
      28
      San Francisco, CA
      Senior Data Science Manager
      Harnham
      185000
    
    
      29
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      59708 72246
    
    
      30
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      59708 72246
    
    
      31
      New York, NY
      Health Scientist
      Centers for Disease Control and Preven...
      88305 146570
    
    
      32
      New York, NY
      Technical Researcher
      HOUSING PRESERVATION & DVLPMNT
      70286
    
    
      33
      New York, NY
      Senior Data Scientist - Bio-tech
      Harnham
      170000
    
    
      34
      New York, NY
      Quantitative Research Analyst for Multi-Billio...
      Averity
      150000 200000
    
    
      35
      New York, NY 10013 (Tribeca area)
      Machine Learning Engineer
      Workbridge Associates
      125000
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      722
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      723
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      724
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      725
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      726
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      727
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      728
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      729
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      730
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      731
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      732
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      733
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      734
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      735
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      736
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      737
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      738
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      739
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      740
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      741
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      742
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      743
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      744
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      745
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      747
      Philadelphia, PA
      Quantitative Research Analyst
      Liberty Personnel Services
      140000
    
    
      748
      San Fernando, CA
      CLS Certifying Scientist (Toxicologist
      Lighthouse Recruiting
      65000 85000
    
    
      749
      Atlanta, GA
      Lead Quantitative Analyst
      SearchBankingJobs
      150000 205000
    
    
      750
      Seattle, WA
      Assistant Director Data Science
      Liberty Mutual
      109900 126100
    
    
      751
      Seattle, WA
      Clinical User Operations
      Quartet Health
      50000
    
    
      752
      Seattle, WA
      Variant Scientist (Remote
      Lighthouse Recruiting
      90000 100000
    
  

533 rows × 4 columns

Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary



In [63]:

    
## YOUR CODE HERE
job_details['salary'] = (job_details['salary'].replace( '[\a year,)]','', regex=True))
job_details['salary'] = (job_details['salary'].replace( '[\$,)]','', regex=True)) 
job_details['company'] = (job_details['company'].replace( '[\\n,)]','', regex=True))
job_details['company'] = (job_details['company'].replace( '[\\n\n,)]','', regex=True)) 
job_details['title'] = (job_details['title'].replace( '[\\n,)]','', regex=True))



In [388]:

    
#Checkpoint.
job_details = all_jobs



In [390]:

    
job_details_2 = job_details.drop_duplicates()
#533 results. Left with 34 in total.



In [392]:

    
#Need to convert the ranges.



In [393]:

    
job_details_2['salary'] = (job_details_2['salary'].replace( '[\-,)]',' ', regex=True))









    



/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [394]:

    
job_details_2.reset_index()









    Out[394]:






  
    
      
      index
      location
      title
      company
      salary
    
  
  
    
      0
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      70000 75000
    
    
      1
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000
    
    
      2
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      100000 110000
    
    
      3
      5
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      75000 130000
    
    
      4
      7
      Houston, TX
      Quantitative Risk Analyst
      SearchBankingJobs
      150000 205000
    
    
      5
      9
      Lexington, MA 02420
      Statistical Programmer
      BioPier Inc.
      60000 120000
    
    
      6
      11
      Boston, MA 02116 (South End area)
      Back End Java Developer-Machine Learning
      Jobspring Partners
      130000
    
    
      7
      12
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      100000 165000
    
    
      8
      13
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      120000 165000
    
    
      9
      14
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      60096 87045
    
    
      10
      15
      Houston, TX
      Lead Bioinformatics Programmer
      Baylor College of Medicine
      90000
    
    
      11
      16
      Quincy, MA 02169
      Machine Learning Engineer
      NRS Global Partners
      95000 120000
    
    
      12
      19
      Cambridge, MA 02138 (West Cambridge area)
      Program Assistant
      Union of Concerned Scientists
      35000
    
    
      13
      23
      Braintree, MA
      Inside Sales Representative
      PayFactors
      70000 90000
    
    
      14
      24
      Scottsdale, AZ
      Web Security Research Analyst
      SiteLock
      50000 55000
    
    
      15
      25
      Chicago, IL
      Lead Data Scientist
      Analytic Recruiting
      140000
    
    
      16
      26
      Chicago, IL
      Senior Data Scientist
      Selby Jennings
      125000 175000
    
    
      17
      27
      Sausalito, CA 94965
      Associate Director of Analytics
      Workbridge Associates
      150000
    
    
      18
      28
      San Francisco, CA
      Senior Data Science Manager
      Harnham
      185000
    
    
      19
      29
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      59708 72246
    
    
      20
      31
      New York, NY
      Health Scientist
      Centers for Disease Control and Preven...
      88305 146570
    
    
      21
      32
      New York, NY
      Technical Researcher
      HOUSING PRESERVATION & DVLPMNT
      70286
    
    
      22
      33
      New York, NY
      Senior Data Scientist - Bio-tech
      Harnham
      170000
    
    
      23
      34
      New York, NY
      Quantitative Research Analyst for Multi-Billio...
      Averity
      150000 200000
    
    
      24
      35
      New York, NY 10013 (Tribeca area)
      Machine Learning Engineer
      Workbridge Associates
      125000
    
    
      25
      36
      New York, NY 10167 (Midtown area)
      FTR Quantitative Risk Analyst
      Selby Jennings
      90000 180000
    
    
      26
      37
      New York, NY
      Senior Statistician for Fortune 500 Company
      Averity
      100000 120000
    
    
      27
      251
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      80000 110000
    
    
      28
      747
      Philadelphia, PA
      Quantitative Research Analyst
      Liberty Personnel Services
      140000
    
    
      29
      748
      San Fernando, CA
      CLS Certifying Scientist (Toxicologist
      Lighthouse Recruiting
      65000 85000
    
    
      30
      749
      Atlanta, GA
      Lead Quantitative Analyst
      SearchBankingJobs
      150000 205000
    
    
      31
      750
      Seattle, WA
      Assistant Director Data Science
      Liberty Mutual
      109900 126100
    
    
      32
      751
      Seattle, WA
      Clinical User Operations
      Quartet Health
      50000
    
    
      33
      752
      Seattle, WA
      Variant Scientist (Remote
      Lighthouse Recruiting
      90000 100000



In [395]:

    
salaries = job_details_2.salary.str.split(' ', expand=True)



In [397]:

    
salaries = salaries.astype(float)



In [398]:

    
salaries.dtypes









    Out[398]:





0    float64
1    float64
dtype: object



In [399]:

    
salaries = salaries.rename(columns = {0:'salary_1', 1:'salary_2'})



In [401]:

    
salaries.salary_2 = salaries.salary_2.fillna(salaries.salary_1)



In [403]:

    
final_salary = salaries.median(axis=1)



In [404]:

    
final_salary = pd.DataFrame(final_salary)



In [412]:

    
final_salary = final_salary.rename(columns = {0:'final_salary'})



In [413]:

    
final_salary.head()









    Out[413]:






  
    
      
      final_salary
    
  
  
    
      0
      72500.0
    
    
      1
      75000.0
    
    
      3
      105000.0
    
    
      5
      102500.0
    
    
      7
      177500.0



In [414]:

    
jobs = pd.concat([job_details_2, final_salary], axis=1)



In [415]:

    
jobs = jobs.drop('salary', axis=1)



In [416]:

    
jobs









    Out[416]:






  
    
      
      location
      title
      company
      final_salary
    
  
  
    
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      72500.0
    
    
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000.0
    
    
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      105000.0
    
    
      5
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      102500.0
    
    
      7
      Houston, TX
      Quantitative Risk Analyst
      SearchBankingJobs
      177500.0
    
    
      9
      Lexington, MA 02420
      Statistical Programmer
      BioPier Inc.
      90000.0
    
    
      11
      Boston, MA 02116 (South End area)
      Back End Java Developer-Machine Learning
      Jobspring Partners
      130000.0
    
    
      12
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      132500.0
    
    
      13
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      142500.0
    
    
      14
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      73570.5
    
    
      15
      Houston, TX
      Lead Bioinformatics Programmer
      Baylor College of Medicine
      90000.0
    
    
      16
      Quincy, MA 02169
      Machine Learning Engineer
      NRS Global Partners
      107500.0
    
    
      19
      Cambridge, MA 02138 (West Cambridge area)
      Program Assistant
      Union of Concerned Scientists
      35000.0
    
    
      23
      Braintree, MA
      Inside Sales Representative
      PayFactors
      80000.0
    
    
      24
      Scottsdale, AZ
      Web Security Research Analyst
      SiteLock
      52500.0
    
    
      25
      Chicago, IL
      Lead Data Scientist
      Analytic Recruiting
      140000.0
    
    
      26
      Chicago, IL
      Senior Data Scientist
      Selby Jennings
      150000.0
    
    
      27
      Sausalito, CA 94965
      Associate Director of Analytics
      Workbridge Associates
      150000.0
    
    
      28
      San Francisco, CA
      Senior Data Science Manager
      Harnham
      185000.0
    
    
      29
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      65977.0
    
    
      31
      New York, NY
      Health Scientist
      Centers for Disease Control and Preven...
      117437.5
    
    
      32
      New York, NY
      Technical Researcher
      HOUSING PRESERVATION & DVLPMNT
      70286.0
    
    
      33
      New York, NY
      Senior Data Scientist - Bio-tech
      Harnham
      170000.0
    
    
      34
      New York, NY
      Quantitative Research Analyst for Multi-Billio...
      Averity
      175000.0
    
    
      35
      New York, NY 10013 (Tribeca area)
      Machine Learning Engineer
      Workbridge Associates
      125000.0
    
    
      36
      New York, NY 10167 (Midtown area)
      FTR Quantitative Risk Analyst
      Selby Jennings
      135000.0
    
    
      37
      New York, NY
      Senior Statistician for Fortune 500 Company
      Averity
      110000.0
    
    
      251
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      95000.0
    
    
      747
      Philadelphia, PA
      Quantitative Research Analyst
      Liberty Personnel Services
      140000.0
    
    
      748
      San Fernando, CA
      CLS Certifying Scientist (Toxicologist
      Lighthouse Recruiting
      75000.0
    
    
      749
      Atlanta, GA
      Lead Quantitative Analyst
      SearchBankingJobs
      177500.0
    
    
      750
      Seattle, WA
      Assistant Director Data Science
      Liberty Mutual
      118000.0
    
    
      751
      Seattle, WA
      Clinical User Operations
      Quartet Health
      50000.0
    
    
      752
      Seattle, WA
      Variant Scientist (Remote
      Lighthouse Recruiting
      95000.0



In [417]:

    
jobs.dtypes









    Out[417]:





location         object
title            object
company          object
final_salary    float64
dtype: object

Save your results as a CSV



In [418]:

    
# Export to csv
job_details_csv = jobs.to_csv



In [419]:

    
job_details_csv









    Out[419]:





<bound method DataFrame.to_csv of                                          location  \
0                                      Boston, MA   
1                                      Boston, MA   
3    Boston, MA 02108 (Back Bay-Beacon Hill area)   
5                                   Cambridge, MA   
7                                     Houston, TX   
9                             Lexington, MA 02420   
11              Boston, MA 02116 (South End area)   
12                                     Boston, MA   
13                                     Boston, MA   
14                                     Boston, MA   
15                                    Houston, TX   
16                               Quincy, MA 02169   
19      Cambridge, MA 02138 (West Cambridge area)   
23                                  Braintree, MA   
24                                 Scottsdale, AZ   
25                                    Chicago, IL   
26                                    Chicago, IL   
27                            Sausalito, CA 94965   
28                              San Francisco, CA   
29                                  Manhattan, NY   
31                                   New York, NY   
32                                   New York, NY   
33                                   New York, NY   
34                                   New York, NY   
35              New York, NY 10013 (Tribeca area)   
36              New York, NY 10167 (Midtown area)   
37                                   New York, NY   
251                                     Plano, TX   
747                              Philadelphia, PA   
748                              San Fernando, CA   
749                                   Atlanta, GA   
750                                   Seattle, WA   
751                                   Seattle, WA   
752                                   Seattle, WA   

                                                 title  \
0    SAS Statistical Data Analyst-Healthcare Marketing   
1        SAS Program / Statistical Analyst-Data Mining   
3                                         Data Analyst   
5    Front-End Developer - JavaScript/HTML5 for EdT...   
7                            Quantitative Risk Analyst   
9                               Statistical Programmer   
11            Back End Java Developer-Machine Learning   
12      Spark/Hadoop/Scala at Top Growth Stage Startup   
13   Principal Software Engineer Lead Software Engi...   
14                        Clinical Laboratory Surveyor   
15                      Lead Bioinformatics Programmer   
16                           Machine Learning Engineer   
19                                   Program Assistant   
23                         Inside Sales Representative   
24                       Web Security Research Analyst   
25                                 Lead Data Scientist   
26                               Senior Data Scientist   
27                     Associate Director of Analytics   
28                         Senior Data Science Manager   
29   City Research Scientist Bureau of the Public H...   
31                                    Health Scientist   
32                                Technical Researcher   
33                    Senior Data Scientist - Bio-tech   
34   Quantitative Research Analyst for Multi-Billio...   
35                           Machine Learning Engineer   
36                       FTR Quantitative Risk Analyst   
37         Senior Statistician for Fortune 500 Company   
251                Genetic Counselor Clinical Genomics   
747                      Quantitative Research Analyst   
748             CLS Certifying Scientist (Toxicologist   
749                          Lead Quantitative Analyst   
750                    Assistant Director Data Science   
751                           Clinical User Operations   
752                          Variant Scientist (Remote   

                                               company  final_salary  
0                              ALC Staffing Associates       72500.0  
1                                        Alexander Bec       75000.0  
3                     Infotek Consulting Services Inc.      105000.0  
5                                     The Bivium Group      102500.0  
7                                    SearchBankingJobs      177500.0  
9                                         BioPier Inc.       90000.0  
11                                  Jobspring Partners      130000.0  
12                               Workbridge Associates      132500.0  
13                                    The Bivium Group      142500.0  
14                       Commonwealth of Massachusetts       73570.5  
15                          Baylor College of Medicine       90000.0  
16                                 NRS Global Partners      107500.0  
19                       Union of Concerned Scientists       35000.0  
23                                          PayFactors       80000.0  
24                                            SiteLock       52500.0  
25                                 Analytic Recruiting      140000.0  
26                                      Selby Jennings      150000.0  
27                               Workbridge Associates      150000.0  
28                                             Harnham      185000.0  
29                       DEPT OF HEALTH/MENTAL HYGIENE       65977.0  
31           Centers for Disease Control and Preven...      117437.5  
32                      HOUSING PRESERVATION & DVLPMNT       70286.0  
33                                             Harnham      170000.0  
34                                             Averity      175000.0  
35                               Workbridge Associates      125000.0  
36                                      Selby Jennings      135000.0  
37                                             Averity      110000.0  
251                              Lighthouse Recruiting       95000.0  
747                         Liberty Personnel Services      140000.0  
748                              Lighthouse Recruiting       75000.0  
749                                  SearchBankingJobs      177500.0  
750                                     Liberty Mutual      118000.0  
751                                     Quartet Health       50000.0  
752                              Lighthouse Recruiting       95000.0  >

Predicting salaries using Random Forests + Another Classifier

Load in the the data of scraped salaries



In [616]:

    
## YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split



In [617]:

    
rfc = RandomForestClassifier()
knn = KNeighborsClassifier()

We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a binary classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the median as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries,



In [424]:

    
# Median salary is 72,500 per year 
# Should do the mean or the 50th percentile instead as there aren't that many salaries above 72,500.



In [427]:

    
jobs.final_salary.describe()









    Out[427]:





count        34.000000
mean     112066.794118
std       40383.170629
min       35000.000000
25%       76250.000000
50%      108750.000000
75%      140000.000000
max      185000.000000
Name: final_salary, dtype: float64



In [428]:

    
#Upper 50% above 108,750
jobs['high_or_low'] = jobs['final_salary'].map(lambda x: 1 if x > 108750 else 0)



In [429]:

    
jobs









    Out[429]:






  
    
      
      location
      title
      company
      final_salary
      high_or_low
    
  
  
    
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      72500.0
      0
    
    
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000.0
      0
    
    
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      105000.0
      0
    
    
      5
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      102500.0
      0
    
    
      7
      Houston, TX
      Quantitative Risk Analyst
      SearchBankingJobs
      177500.0
      1
    
    
      9
      Lexington, MA 02420
      Statistical Programmer
      BioPier Inc.
      90000.0
      0
    
    
      11
      Boston, MA 02116 (South End area)
      Back End Java Developer-Machine Learning
      Jobspring Partners
      130000.0
      1
    
    
      12
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      132500.0
      1
    
    
      13
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      142500.0
      1
    
    
      14
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      73570.5
      0
    
    
      15
      Houston, TX
      Lead Bioinformatics Programmer
      Baylor College of Medicine
      90000.0
      0
    
    
      16
      Quincy, MA 02169
      Machine Learning Engineer
      NRS Global Partners
      107500.0
      0
    
    
      19
      Cambridge, MA 02138 (West Cambridge area)
      Program Assistant
      Union of Concerned Scientists
      35000.0
      0
    
    
      23
      Braintree, MA
      Inside Sales Representative
      PayFactors
      80000.0
      0
    
    
      24
      Scottsdale, AZ
      Web Security Research Analyst
      SiteLock
      52500.0
      0
    
    
      25
      Chicago, IL
      Lead Data Scientist
      Analytic Recruiting
      140000.0
      1
    
    
      26
      Chicago, IL
      Senior Data Scientist
      Selby Jennings
      150000.0
      1
    
    
      27
      Sausalito, CA 94965
      Associate Director of Analytics
      Workbridge Associates
      150000.0
      1
    
    
      28
      San Francisco, CA
      Senior Data Science Manager
      Harnham
      185000.0
      1
    
    
      29
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      65977.0
      0
    
    
      31
      New York, NY
      Health Scientist
      Centers for Disease Control and Preven...
      117437.5
      1
    
    
      32
      New York, NY
      Technical Researcher
      HOUSING PRESERVATION & DVLPMNT
      70286.0
      0
    
    
      33
      New York, NY
      Senior Data Scientist - Bio-tech
      Harnham
      170000.0
      1
    
    
      34
      New York, NY
      Quantitative Research Analyst for Multi-Billio...
      Averity
      175000.0
      1
    
    
      35
      New York, NY 10013 (Tribeca area)
      Machine Learning Engineer
      Workbridge Associates
      125000.0
      1
    
    
      36
      New York, NY 10167 (Midtown area)
      FTR Quantitative Risk Analyst
      Selby Jennings
      135000.0
      1
    
    
      37
      New York, NY
      Senior Statistician for Fortune 500 Company
      Averity
      110000.0
      1
    
    
      251
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      95000.0
      0
    
    
      747
      Philadelphia, PA
      Quantitative Research Analyst
      Liberty Personnel Services
      140000.0
      1
    
    
      748
      San Fernando, CA
      CLS Certifying Scientist (Toxicologist
      Lighthouse Recruiting
      75000.0
      0
    
    
      749
      Atlanta, GA
      Lead Quantitative Analyst
      SearchBankingJobs
      177500.0
      1
    
    
      750
      Seattle, WA
      Assistant Director Data Science
      Liberty Mutual
      118000.0
      1
    
    
      751
      Seattle, WA
      Clinical User Operations
      Quartet Health
      50000.0
      0
    
    
      752
      Seattle, WA
      Variant Scientist (Remote
      Lighthouse Recruiting
      95000.0
      0



In [448]:

    
jobs_with_locations = pd.concat([jobs, pd.get_dummies(jobs.location)], axis=1)



In [450]:

    
jobs_with_locations.head(3)









    Out[450]:






  
    
      
      location
      title
      company
      final_salary
      high_or_low
      Atlanta, GA
      Boston, MA
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Boston, MA 02116 (South End area)
      Braintree, MA
      ...
      New York, NY 10013 (Tribeca area)
      New York, NY 10167 (Midtown area)
      Philadelphia, PA
      Plano, TX
      Quincy, MA 02169
      San Fernando, CA
      San Francisco, CA
      Sausalito, CA 94965
      Scottsdale, AZ
      Seattle, WA
    
  
  
    
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      72500.0
      0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000.0
      0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      105000.0
      0
      0.0
      0.0
      1.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

3 rows × 27 columns

Thought experiment: What is the baseline accuracy for this model?

It is a measure of how well our selected features will be at predicting a high or low salary.

Create a Random Forest model to predict High/Low salary. Start by ONLY using the location as a feature.



In [612]:

    
## YOUR CODE HERE
X_1 = jobs_with_locations.drop(jobs_with_locations[[0,1,2,3,4]], axis=1)
y_1 = jobs_with_locations.high_or_low



In [ ]:

    
# NO TRAIN TEST SPLIT.



In [490]:

    
rfc.fit(X_train,y_train)









    Out[490]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [491]:

    
#Accuracy score.
rfc.score(X_train,y_train)









    Out[491]:





0.88235294117647056



In [532]:

    
from sklearn.cross_validation import cross_val_score



In [493]:

    
cross_val_score(rfc, X_train, y_train)









    Out[493]:





array([ 0.57142857,  0.4       ,  0.4       ])



In [ ]:

    
#The accuracy is high, but the cross validation score expresses substantially less 
# confidence. May be due to the smallness of the sample.



In [614]:

    
#Try with KNN



In [668]:

    
knn.fit(X_train, y_train)









    Out[668]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [669]:

    
knn.score(X_train,y_train)









    Out[669]:





0.58823529411764708



In [621]:

    
#Less accurate than with rfc.



In [622]:

    
cross_val_score(knn, X_train, y_train)









    Out[622]:





array([ 0.57142857,  0.6       ,  0.6       ])



In [ ]:

    
#Same cross val score as with rfc.

Create a few new variables in your dataframe to represent interesting features of a job title.

For example, create a feature that represents whether 'Senior' is in the title
or whether 'Manager' is in the title.
Then build a new Random Forest with these features. Do they add any value?



In [520]:

    
## YOUR CODE HERE
senior_variable = jobs['title'].map(lambda x: 1 if 'Senior' in x else 0)



In [521]:

    
senior_variable = pd.DataFrame(senior_variable)



In [523]:

    
senior_variable = senior_variable.rename(columns = {'title':'senior_variable'})



In [525]:

    
jobs_with_seniors = pd.concat([jobs, senior_variable], axis=1)



In [526]:

    
jobs_with_seniors









    Out[526]:






  
    
      
      location
      title
      company
      final_salary
      high_or_low
      senior_variable
    
  
  
    
      0
      Boston, MA
      SAS Statistical Data Analyst-Healthcare Marketing
      ALC Staffing Associates
      72500.0
      0
      0
    
    
      1
      Boston, MA
      SAS Program / Statistical Analyst-Data Mining
      Alexander Bec
      75000.0
      0
      0
    
    
      3
      Boston, MA 02108 (Back Bay-Beacon Hill area)
      Data Analyst
      Infotek Consulting Services Inc.
      105000.0
      0
      0
    
    
      5
      Cambridge, MA
      Front-End Developer - JavaScript/HTML5 for EdT...
      The Bivium Group
      102500.0
      0
      0
    
    
      7
      Houston, TX
      Quantitative Risk Analyst
      SearchBankingJobs
      177500.0
      1
      0
    
    
      9
      Lexington, MA 02420
      Statistical Programmer
      BioPier Inc.
      90000.0
      0
      0
    
    
      11
      Boston, MA 02116 (South End area)
      Back End Java Developer-Machine Learning
      Jobspring Partners
      130000.0
      1
      0
    
    
      12
      Boston, MA
      Spark/Hadoop/Scala at Top Growth Stage Startup
      Workbridge Associates
      132500.0
      1
      0
    
    
      13
      Boston, MA
      Principal Software Engineer Lead Software Engi...
      The Bivium Group
      142500.0
      1
      0
    
    
      14
      Boston, MA
      Clinical Laboratory Surveyor
      Commonwealth of Massachusetts
      73570.5
      0
      0
    
    
      15
      Houston, TX
      Lead Bioinformatics Programmer
      Baylor College of Medicine
      90000.0
      0
      0
    
    
      16
      Quincy, MA 02169
      Machine Learning Engineer
      NRS Global Partners
      107500.0
      0
      0
    
    
      19
      Cambridge, MA 02138 (West Cambridge area)
      Program Assistant
      Union of Concerned Scientists
      35000.0
      0
      0
    
    
      23
      Braintree, MA
      Inside Sales Representative
      PayFactors
      80000.0
      0
      0
    
    
      24
      Scottsdale, AZ
      Web Security Research Analyst
      SiteLock
      52500.0
      0
      0
    
    
      25
      Chicago, IL
      Lead Data Scientist
      Analytic Recruiting
      140000.0
      1
      0
    
    
      26
      Chicago, IL
      Senior Data Scientist
      Selby Jennings
      150000.0
      1
      1
    
    
      27
      Sausalito, CA 94965
      Associate Director of Analytics
      Workbridge Associates
      150000.0
      1
      0
    
    
      28
      San Francisco, CA
      Senior Data Science Manager
      Harnham
      185000.0
      1
      1
    
    
      29
      Manhattan, NY
      City Research Scientist Bureau of the Public H...
      DEPT OF HEALTH/MENTAL HYGIENE
      65977.0
      0
      0
    
    
      31
      New York, NY
      Health Scientist
      Centers for Disease Control and Preven...
      117437.5
      1
      0
    
    
      32
      New York, NY
      Technical Researcher
      HOUSING PRESERVATION & DVLPMNT
      70286.0
      0
      0
    
    
      33
      New York, NY
      Senior Data Scientist - Bio-tech
      Harnham
      170000.0
      1
      1
    
    
      34
      New York, NY
      Quantitative Research Analyst for Multi-Billio...
      Averity
      175000.0
      1
      0
    
    
      35
      New York, NY 10013 (Tribeca area)
      Machine Learning Engineer
      Workbridge Associates
      125000.0
      1
      0
    
    
      36
      New York, NY 10167 (Midtown area)
      FTR Quantitative Risk Analyst
      Selby Jennings
      135000.0
      1
      0
    
    
      37
      New York, NY
      Senior Statistician for Fortune 500 Company
      Averity
      110000.0
      1
      1
    
    
      251
      Plano, TX
      Genetic Counselor Clinical Genomics
      Lighthouse Recruiting
      95000.0
      0
      0
    
    
      747
      Philadelphia, PA
      Quantitative Research Analyst
      Liberty Personnel Services
      140000.0
      1
      0
    
    
      748
      San Fernando, CA
      CLS Certifying Scientist (Toxicologist
      Lighthouse Recruiting
      75000.0
      0
      0
    
    
      749
      Atlanta, GA
      Lead Quantitative Analyst
      SearchBankingJobs
      177500.0
      1
      0
    
    
      750
      Seattle, WA
      Assistant Director Data Science
      Liberty Mutual
      118000.0
      1
      0
    
    
      751
      Seattle, WA
      Clinical User Operations
      Quartet Health
      50000.0
      0
      0
    
    
      752
      Seattle, WA
      Variant Scientist (Remote
      Lighthouse Recruiting
      95000.0
      0
      0



In [606]:

    
X_2 = jobs_with_seniors.drop(jobs_with_seniors[[0,1,2,3,4]], axis=1)
y_2 = jobs_with_seniors.high_or_low



In [624]:

    
rfc.fit(X_train,y_train)









    Out[624]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [625]:

    
rfc.score(X_train,y_train)









    Out[625]:





0.6470588235294118



In [ ]:

    
#Not the most accurate for "senior" in job title, but could be because of the small size of sample



In [626]:

    
cross_val_score(rfc, X_train, y_train)









    Out[626]:





array([ 0.57142857,  0.6       ,  0.6       ])



In [ ]:

    
# "Senior" not highly predictive of high salaries, at least from this dataset



In [671]:

    
knn.fit(X_train,y_train)









    Out[671]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [672]:

    
knn.score(X_train,y_train)









    Out[672]:





0.58823529411764708



In [630]:

    
cross_val_score(knn, X_train, y_train)









    Out[630]:





array([ 0.57142857,  0.6       ,  0.6       ])



In [597]:

    
# Saw a few high salaries with "Quantitative" in there--should test for that



In [598]:

    
quant_variable = jobs['title'].map(lambda x: 1 if 'Quantitative' in x else 0)



In [600]:

    
quant_variable = pd.DataFrame(quant_variable)
quant_variable = quant_variable.rename(columns = {'title':'quant_variable'})



In [601]:

    
jobs_with_quant = pd.concat([jobs, quant_variable], axis=1)



In [31]:

    
X_3 = jobs_with_quant.quant_variable
y_3 = jobs_with_quant.high_or_low









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-5b10fc4b9039> in <module>()
----> 1 X_3 = jobs_with_quant.quant_variable
      2 y_3 = jobs_with_quant.high_or_low

NameError: name 'jobs_with_quant' is not defined



In [673]:

    
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)



In [674]:

    
rfc.fit(X_train,y_train)









    Out[674]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)



In [609]:

    
rfc.score(X_train,y_train)









    Out[609]:





0.58823529411764708



In [610]:

    
#Accuracy score likely lower than usual because of the limited sample from the scraping.



In [611]:

    
cross_val_score(rfc, X_train,y_train)









    Out[611]:





array([ 0.42857143,  0.6       ,  0.4       ])



In [ ]:

    
# Not predictive according to cross val score.



In [675]:

    
knn.fit(X_train,y_train)









    Out[675]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [676]:

    
knn.score(X_train,y_train)









    Out[676]:





0.58823529411764708



In [677]:

    
cross_val_score(knn,X_train,y_train)









    Out[677]:





array([ 0.57142857,  0.6       ,  0.6       ])

Rebuild this model with scikit-learn.

You can either create the dummy features manually or use the dmatrix function from patsy
Remember to scale the feature variables as well!



In [659]:

    
#I've already turned them to dummies, so I'm not sure scaling would be beneficial.



In [476]:

    
## YOUR CODE HERE

Use cross-validation in scikit-learn to evaluate the model above.

Evaluate the accuracy of the model.



In [ ]:

    
## YOUR CODE HERE

Random Forest Regressor

Let's try treating this as a regression problem.

Train a random forest regressor on the regression problem and predict your dependent.
Evaluate the score with a 5-fold cross-validation
Do a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?



In [635]:

    
## YOUR CODE HERE
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()



In [ ]:

    
#X_1, y_1: location; X_2, y_2: senior; X_3, y_3: quant



In [636]:

    
#First for location
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, test_size=0.5, random_state=50)



In [637]:

    
rfr.fit(X_train,y_train)









    Out[637]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)



In [665]:

    
import matplotlib.pyplot as plt



In [638]:

    
rfr.score(X_train,y_train)









    Out[638]:





0.59722817460317446



In [639]:

    
#A slightly improved score from others.



In [642]:

    
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')









    Out[642]:





array([-0.18777778, -0.4975    , -0.47893333, -0.66      , -0.32      ])



In [ ]:

    
#All negative scores.



In [645]:

    
#Now for "seniors"



In [646]:

    
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.5, random_state=50)



In [647]:

    
rfr.fit(X_train,y_train)









    Out[647]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)



In [648]:

    
rfr.score(X_train,y_train)









    Out[648]:





0.083613968253968163



In [649]:

    
#Not accurate at all.



In [650]:

    
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')









    Out[650]:





array([-0.1895362 , -0.26093463, -0.3225    , -0.3435363 , -0.21301775])



In [ ]:

    
#Again, negative scores.



In [651]:

    
#Maybe Quantitative could work.



In [653]:

    
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)



In [654]:

    
rfr.fit(X_train,y_train)









    Out[654]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)



In [655]:

    
rfr.score(X_train, y_train)









    Out[655]:





0.086571800833340373



In [656]:

    
#Also very low.



In [657]:

    
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')









    Out[657]:





array([-0.26600446, -0.25525259, -0.29889602, -0.25791682, -0.3144752 ])

	location	title	company	salary
0	Boston, MA	SAS Statistical Data Analyst-Healthcare Marketing	ALC Staffing Associates	70000 75000
1	Boston, MA	SAS Program / Statistical Analyst-Data Mining	Alexander Bec	75000
3	Boston, MA 02108 (Back Bay-Beacon Hill area)	Data Analyst	Infotek Consulting Services Inc.	100000 110000
5	Cambridge, MA	Front-End Developer - JavaScript/HTML5 for EdT...	The Bivium Group	75000 130000
6	Cambridge, MA	Front-End Developer - JavaScript/HTML5 for EdT...	The Bivium Group	75000 130000
7	Houston, TX	Quantitative Risk Analyst	SearchBankingJobs	150000 205000
9	Lexington, MA 02420	Statistical Programmer	BioPier Inc.	60000 120000
11	Boston, MA 02116 (South End area)	Back End Java Developer-Machine Learning	Jobspring Partners	130000
12	Boston, MA	Spark/Hadoop/Scala at Top Growth Stage Startup	Workbridge Associates	100000 165000
13	Boston, MA	Principal Software Engineer Lead Software Engi...	The Bivium Group	120000 165000
14	Boston, MA	Clinical Laboratory Surveyor	Commonwealth of Massachusetts	60096 87045
15	Houston, TX	Lead Bioinformatics Programmer	Baylor College of Medicine	90000
16	Quincy, MA 02169	Machine Learning Engineer	NRS Global Partners	95000 120000
19	Cambridge, MA 02138 (West Cambridge area)	Program Assistant	Union of Concerned Scientists	35000
20	Boston, MA	Spark/Hadoop/Scala at Top Growth Stage Startup	Workbridge Associates	100000 165000
21	Boston, MA	Principal Software Engineer Lead Software Engi...	The Bivium Group	120000 165000
22	Boston, MA	Clinical Laboratory Surveyor	Commonwealth of Massachusetts	60096 87045
23	Braintree, MA	Inside Sales Representative	PayFactors	70000 90000
24	Scottsdale, AZ	Web Security Research Analyst	SiteLock	50000 55000
25	Chicago, IL	Lead Data Scientist	Analytic Recruiting	140000
26	Chicago, IL	Senior Data Scientist	Selby Jennings	125000 175000
27	Sausalito, CA 94965	Associate Director of Analytics	Workbridge Associates	150000
28	San Francisco, CA	Senior Data Science Manager	Harnham	185000
29	Manhattan, NY	City Research Scientist Bureau of the Public H...	DEPT OF HEALTH/MENTAL HYGIENE	59708 72246
30	Manhattan, NY	City Research Scientist Bureau of the Public H...	DEPT OF HEALTH/MENTAL HYGIENE	59708 72246
31	New York, NY	Health Scientist	Centers for Disease Control and Preven...	88305 146570
32	New York, NY	Technical Researcher	HOUSING PRESERVATION & DVLPMNT	70286
33	New York, NY	Senior Data Scientist - Bio-tech	Harnham	170000
34	New York, NY	Quantitative Research Analyst for Multi-Billio...	Averity	150000 200000
35	New York, NY 10013 (Tribeca area)	Machine Learning Engineer	Workbridge Associates	125000
...	...	...	...	...
722	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
723	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
724	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
725	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
726	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
727	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
728	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
729	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
730	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
731	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
732	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
733	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
734	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
735	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
736	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
737	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
738	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
739	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
740	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
741	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
742	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
743	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
744	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
745	Plano, TX	Genetic Counselor Clinical Genomics	Lighthouse Recruiting	80000 110000
747	Philadelphia, PA	Quantitative Research Analyst	Liberty Personnel Services	140000
748	San Fernando, CA	CLS Certifying Scientist (Toxicologist	Lighthouse Recruiting	65000 85000
749	Atlanta, GA	Lead Quantitative Analyst	SearchBankingJobs	150000 205000
750	Seattle, WA	Assistant Director Data Science	Liberty Mutual	109900 126100
751	Seattle, WA	Clinical User Operations	Quartet Health	50000
752	Seattle, WA	Variant Scientist (Remote	Lighthouse Recruiting	90000 100000