Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN.

  • Question: Why would we want this to be a classification problem?
  • Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a div tag with a class name of result. We can use BeautifulSoup to extract those.

Setup a request (using requests) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

  • q for the job search
  • This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
  • l for a location
  • start for what result number to start on

In [2]:
#Using a random forest regressor, with one other classifier.

In [3]:
url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [4]:
import requests
import bs4
from bs4 import BeautifulSoup
import urllib
html = urllib.urlopen(url).read()

In [5]:
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

In [6]:
#http://stackoverflow.com/questions/9907492/how-to-get-firefox-working-with-selenium-webdriver-on-mac-osx

In [7]:
## YOUR CODE HERE
b.find_all('span', {'class','summary'})


Out[7]:
[<span class="summary">Manage, mentor, and grow our team of <b>data</b> <b>scientists</b> and <b>data</b> analysts. 2+ years leading a team of <b>data</b> <b>scientists</b> and or/analysts and a track record of being a...</span>,
 <span class="summary">KPMG is currently seeking a Manager - Cognitive <b>Data</b> <b>Scientist</b> Natural Language Processing, to join our National Organization....</span>,
 <span class="summary">A passion for manipulating massive amounts of <b>data</b>. You will also work on our Personalization engine which uses Machine Learning and Big <b>Data</b> technologies to...</span>,
 <span class="summary" itemprop="description">\nExperience with big <b>data</b>. Work with partners in <b>Data</b>. The in-depth, statistical &amp; quantitative analyses of consumer <b>data</b>....</span>,
 <span class="summary" itemprop="description">\nHands-on programming skills experience with R, Python etc. Experience in machine learning, <b>data</b> mining, and predictive analysis Experience with natural language...</span>,
 <span class="summary" itemprop="description">\nBe part of a team of <b>Data</b> <b>Scientists</b> that focus on the design and execution of advanced analytical <b>data</b> exploration, mining, inference, models and systems...</span>,
 <span class="summary" itemprop="description">\n(JPMIS) is a new group considering ways to transform our <b>data</b> assets into opportunities for JPMorgan Chase by leveraging the vast amount of proprietary <b>data</b>...</span>,
 <span class="summary" itemprop="description">\nIntegral Ad Science is seeking <b>Data</b> <b>Scientist</b> to work on challenging fundamental <b>data</b> science problems in online advertising;...</span>,
 <span class="summary" itemprop="description">\nCustomer Graph Digital <b>Data</b> <b>Scientist</b>. We are currently looking for a <b>data</b> <b>scientist</b> that has extensive theoretical and practical experience implementing graph...</span>,
 <span class="summary" itemprop="description">\n<b>Data</b> <b>Scientists</b> analyze PlaceIQ hyperlocal <b>data</b> sources to develop accurate predictions of audience and behavior....</span>,
 <span class="summary" itemprop="description">\nAND/ORUtilizes <b>data</b> wrangling/<b>data</b> matching/ETL techniques while programming in several languages to explore a variety of <b>data</b> sources, gain <b>data</b> expertise,...</span>,
 <span class="summary" itemprop="description">\nExperience with <b>data</b> visualization (e.g. As well as third-party <b>data</b> partners. Tests new statistical analysis methods, software and <b>data</b> sources for continual...</span>,
 <span class="summary" itemprop="description">\nWe\u2019re looking for a <b>Data</b> <b>Scientist</b> to join our Analytics team. Deep familiarity with the core concepts, advantages, and tradeoffs of relational and...</span>,
 <span class="summary">Employees are encouraged and expected to build their expertise as <b>data</b> <b>scientists</b>, and deploy analytics to business problems....</span>,
 <span class="summary">1-3 years' experience and evidence with real world <b>data</b> wrangling (querying, storing, cleaning, aggregating &amp; summarizing <b>data</b>)....</span>]

In [8]:
#List of summaries for New York 20,000.
for entry in b.find_all('span', {'class','summary'}):
    print entry.text


Manage, mentor, and grow our team of data scientists and data analysts. 2+ years leading a team of data scientists and or/analysts and a track record of being a...
KPMG is currently seeking a Manager - Cognitive Data Scientist Natural Language Processing, to join our National Organization....
A passion for manipulating massive amounts of data. You will also work on our Personalization engine which uses Machine Learning and Big Data technologies to...

Experience with big data. Work with partners in Data. The in-depth, statistical & quantitative analyses of consumer data....

Hands-on programming skills experience with R, Python etc. Experience in machine learning, data mining, and predictive analysis Experience with natural language...

Be part of a team of Data Scientists that focus on the design and execution of advanced analytical data exploration, mining, inference, models and systems...

(JPMIS) is a new group considering ways to transform our data assets into opportunities for JPMorgan Chase by leveraging the vast amount of proprietary data...

Integral Ad Science is seeking Data Scientist to work on challenging fundamental data science problems in online advertising;...

Customer Graph Digital Data Scientist. We are currently looking for a data scientist that has extensive theoretical and practical experience implementing graph...

Data Scientists analyze PlaceIQ hyperlocal data sources to develop accurate predictions of audience and behavior....

AND/ORUtilizes data wrangling/data matching/ETL techniques while programming in several languages to explore a variety of data sources, gain data expertise,...

Experience with data visualization (e.g. As well as third-party data partners. Tests new statistical analysis methods, software and data sources for continual...

We’re looking for a Data Scientist to join our Analytics team. Deep familiarity with the core concepts, advantages, and tradeoffs of relational and...
Employees are encouraged and expected to build their expertise as data scientists, and deploy analytics to business problems....
1-3 years' experience and evidence with real world data wrangling (querying, storing, cleaning, aggregating & summarizing data)....

Let's look at one result more closely. A single result looks like

<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&campaignid=serp-linkcompanyname&fromjk=2480d203f7e97210&jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>

While this has some more verbose elements removed, we can see that there is some structure to the above:

  • The salary is available in a nobr element inside of a td element with class='snip.
  • The title of a job is in a link with class set to jobtitle and a data-tn-element="jobTitle.
  • The location is set in a span with class='location'.
  • The company is set in a span with class='company'.

Write 4 functions to extract each item: location, company, job, and salary.

Example

def extract_location_from_result(result):
    return result.find ...
- Make sure these functions are robust and can handle cases where the data/field may not be available.
  • Remember to check if a field is empty or None for attempting to call methods on it
  • Remember to use try/except if you anticipate errors.
  • Test the functions on the results above and simple examples

In [9]:
def extract_job_from_result(result): 
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('h2', {'class':'jobtitle'}):
        entry.text

In [10]:
extract_job_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')

In [11]:
def extract_location_from_result(result): 
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('span', {'class':'location'}):
        entry.text

In [12]:
extract_location_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')

In [13]:
def extract_company_from_result(result):
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('span', {'class':'company'}):
        entry.text

In [14]:
extract_company_from_result('https://www.indeed.com/jobs?q=data+scientist+$20,000&l=New+York&start=10')

In [15]:
#The salary is available in a nobr element inside of a td element with class='snip'.
def extract_salary_from_result(result):
    url = result 
    html = urllib.urlopen(url).read()
    b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")
    for entry in b.find_all('td', {'class':'snip'}):
        try:
            entry.find('nobr').renderContents()
        except:
            'NONE LISTED'

In [16]:
extract_salary_from_result('http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10')

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

There are two query parameters here we can alter to collect more results, the l=New+York and the start=10. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

Complete the following code to collect results from multiple cities and starting points.
  • Enter your city below to add it to the search
  • Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [17]:
YOUR_CITY = 'Boston'

In [22]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 10000 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []
ny = []
chic = []
sf = []
aus = []
sea = []
la = []
phil = []
atl = []
dal = []
pitt = []
port = []
ph = []
den = []
hou = []
mi = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        url = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=" + city +"&start="+ str(start)
        # Make a list for each city
        if city=='New+York':
            ny.append(url)
        if city=='Chicago':
            chic.append(url)
        if city=='San+Francisco':
            sf.append(url)
        if city=='Austin':
            aus.append(url)
        if city=='Seattle':
            sea.append(url)
        if city=='Los+Angeles':
            la.append(url)
        if city=='Philadelphia':
            phil.append(url)
        if city=='Atlanta':
            atl.append(url)
        if city=='Dallas':
            dal.append(url)
        if city=='Pittsburgh':
            pitt.append(url)
        if city=='Portland':
            port.append(url)
        if city=='Philadelphia':
            ph.append(url)
        if city=='Phoenix':
            ph.append(url)
        if city=='Denver':
            den.append(url)
        if city=='Houston':
            hou.append(url)
       if city=='Miami':
            mi.append(url)
        # Make a full set of results just in case
        results.append(url)
        pass

Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.


In [34]:
import pandas as pd
job_details = pd.DataFrame(columns=['location','title','company', 'salary'])
b = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")

In [39]:
#Take the overall entry and excract info from there first. 
for result in results:
    url = result 
    html = urllib.urlopen(url).read()
    for entry in b.find_all('div', {'class':' row result'}):
        try: 
            location = b.find('span', {'class':'location'}).text
        except: 
            location = 'NA'
        try:
            title = b.find('h2', {'class':'jobtitle'}).text
        except:
            title = 'NA'
        try:
            company = b.find('span', {'class':'company'}).text
        except: 
            company = 'NA'
        
        try:
            salary = b.find('td', {'class':'snip'}).find('nobr').renderContents()
        except:
            salary = 'NA'
    job_details.loc[len(job_details)]=[location, title, company, salary]

Lastly, we need to clean up salary data.

  1. Only a small number of the scraped results have salary information - only these will be used for modeling.
  2. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
  3. Some of the entries may be duplicated
  4. The salaries are given as text and usually with ranges.

Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries


In [51]:
job_details = job_details[job_details.salary != 'NONE LISTED']

In [53]:
job_details = job_details.reset_index()

In [56]:
job_details = job_details.drop('index', 1)

In [59]:
job_details = job_details.drop('level_0', 1)

In [61]:
## YOUR CODE HERE
job_details = job_details[job_details.salary.str.contains("a month") == False]
job_details = job_details[job_details.salary.str.contains("an hour") == False]
job_details = job_details[job_details.salary.str.contains("a week") == False]
job_details = job_details[job_details.salary.str.contains("a day") == False]

In [147]:
job_details


Out[147]:
location title company salary
0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 70000 75000
1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000
3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 100000 110000
5 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 75000 130000
6 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 75000 130000
7 Houston, TX Quantitative Risk Analyst SearchBankingJobs 150000 205000
9 Lexington, MA 02420 Statistical Programmer BioPier Inc. 60000 120000
11 Boston, MA 02116 (South End area) Back End Java Developer-Machine Learning Jobspring Partners 130000
12 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 100000 165000
13 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 120000 165000
14 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 60096 87045
15 Houston, TX Lead Bioinformatics Programmer Baylor College of Medicine 90000
16 Quincy, MA 02169 Machine Learning Engineer NRS Global Partners 95000 120000
19 Cambridge, MA 02138 (West Cambridge area) Program Assistant Union of Concerned Scientists 35000
20 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 100000 165000
21 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 120000 165000
22 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 60096 87045
23 Braintree, MA Inside Sales Representative PayFactors 70000 90000
24 Scottsdale, AZ Web Security Research Analyst SiteLock 50000 55000
25 Chicago, IL Lead Data Scientist Analytic Recruiting 140000
26 Chicago, IL Senior Data Scientist Selby Jennings 125000 175000
27 Sausalito, CA 94965 Associate Director of Analytics Workbridge Associates 150000
28 San Francisco, CA Senior Data Science Manager Harnham 185000
29 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 59708 72246
30 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 59708 72246
31 New York, NY Health Scientist Centers for Disease Control and Preven... 88305 146570
32 New York, NY Technical Researcher HOUSING PRESERVATION & DVLPMNT 70286
33 New York, NY Senior Data Scientist - Bio-tech Harnham 170000
34 New York, NY Quantitative Research Analyst for Multi-Billio... Averity 150000 200000
35 New York, NY 10013 (Tribeca area) Machine Learning Engineer Workbridge Associates 125000
... ... ... ... ...
722 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
723 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
724 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
725 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
726 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
727 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
728 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
729 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
730 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
731 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
732 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
733 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
734 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
735 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
736 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
737 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
738 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
739 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
740 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
741 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
742 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
743 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
744 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
745 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
747 Philadelphia, PA Quantitative Research Analyst Liberty Personnel Services 140000
748 San Fernando, CA CLS Certifying Scientist (Toxicologist Lighthouse Recruiting 65000 85000
749 Atlanta, GA Lead Quantitative Analyst SearchBankingJobs 150000 205000
750 Seattle, WA Assistant Director Data Science Liberty Mutual 109900 126100
751 Seattle, WA Clinical User Operations Quartet Health 50000
752 Seattle, WA Variant Scientist (Remote Lighthouse Recruiting 90000 100000

533 rows × 4 columns

Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary


In [63]:
## YOUR CODE HERE
job_details['salary'] = (job_details['salary'].replace( '[\a year,)]','', regex=True))
job_details['salary'] = (job_details['salary'].replace( '[\$,)]','', regex=True)) 
job_details['company'] = (job_details['company'].replace( '[\\n,)]','', regex=True))
job_details['company'] = (job_details['company'].replace( '[\\n\n,)]','', regex=True)) 
job_details['title'] = (job_details['title'].replace( '[\\n,)]','', regex=True))

In [388]:
#Checkpoint.
job_details = all_jobs

In [390]:
job_details_2 = job_details.drop_duplicates()
#533 results. Left with 34 in total.

In [392]:
#Need to convert the ranges.

In [393]:
job_details_2['salary'] = (job_details_2['salary'].replace( '[\-,)]',' ', regex=True))


/Applications/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

In [394]:
job_details_2.reset_index()


Out[394]:
index location title company salary
0 0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 70000 75000
1 1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000
2 3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 100000 110000
3 5 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 75000 130000
4 7 Houston, TX Quantitative Risk Analyst SearchBankingJobs 150000 205000
5 9 Lexington, MA 02420 Statistical Programmer BioPier Inc. 60000 120000
6 11 Boston, MA 02116 (South End area) Back End Java Developer-Machine Learning Jobspring Partners 130000
7 12 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 100000 165000
8 13 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 120000 165000
9 14 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 60096 87045
10 15 Houston, TX Lead Bioinformatics Programmer Baylor College of Medicine 90000
11 16 Quincy, MA 02169 Machine Learning Engineer NRS Global Partners 95000 120000
12 19 Cambridge, MA 02138 (West Cambridge area) Program Assistant Union of Concerned Scientists 35000
13 23 Braintree, MA Inside Sales Representative PayFactors 70000 90000
14 24 Scottsdale, AZ Web Security Research Analyst SiteLock 50000 55000
15 25 Chicago, IL Lead Data Scientist Analytic Recruiting 140000
16 26 Chicago, IL Senior Data Scientist Selby Jennings 125000 175000
17 27 Sausalito, CA 94965 Associate Director of Analytics Workbridge Associates 150000
18 28 San Francisco, CA Senior Data Science Manager Harnham 185000
19 29 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 59708 72246
20 31 New York, NY Health Scientist Centers for Disease Control and Preven... 88305 146570
21 32 New York, NY Technical Researcher HOUSING PRESERVATION & DVLPMNT 70286
22 33 New York, NY Senior Data Scientist - Bio-tech Harnham 170000
23 34 New York, NY Quantitative Research Analyst for Multi-Billio... Averity 150000 200000
24 35 New York, NY 10013 (Tribeca area) Machine Learning Engineer Workbridge Associates 125000
25 36 New York, NY 10167 (Midtown area) FTR Quantitative Risk Analyst Selby Jennings 90000 180000
26 37 New York, NY Senior Statistician for Fortune 500 Company Averity 100000 120000
27 251 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 80000 110000
28 747 Philadelphia, PA Quantitative Research Analyst Liberty Personnel Services 140000
29 748 San Fernando, CA CLS Certifying Scientist (Toxicologist Lighthouse Recruiting 65000 85000
30 749 Atlanta, GA Lead Quantitative Analyst SearchBankingJobs 150000 205000
31 750 Seattle, WA Assistant Director Data Science Liberty Mutual 109900 126100
32 751 Seattle, WA Clinical User Operations Quartet Health 50000
33 752 Seattle, WA Variant Scientist (Remote Lighthouse Recruiting 90000 100000

In [395]:
salaries = job_details_2.salary.str.split(' ', expand=True)

In [397]:
salaries = salaries.astype(float)

In [398]:
salaries.dtypes


Out[398]:
0    float64
1    float64
dtype: object

In [399]:
salaries = salaries.rename(columns = {0:'salary_1', 1:'salary_2'})

In [401]:
salaries.salary_2 = salaries.salary_2.fillna(salaries.salary_1)

In [403]:
final_salary = salaries.median(axis=1)

In [404]:
final_salary = pd.DataFrame(final_salary)

In [412]:
final_salary = final_salary.rename(columns = {0:'final_salary'})

In [413]:
final_salary.head()


Out[413]:
final_salary
0 72500.0
1 75000.0
3 105000.0
5 102500.0
7 177500.0

In [414]:
jobs = pd.concat([job_details_2, final_salary], axis=1)

In [415]:
jobs = jobs.drop('salary', axis=1)

In [416]:
jobs


Out[416]:
location title company final_salary
0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 72500.0
1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000.0
3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 105000.0
5 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 102500.0
7 Houston, TX Quantitative Risk Analyst SearchBankingJobs 177500.0
9 Lexington, MA 02420 Statistical Programmer BioPier Inc. 90000.0
11 Boston, MA 02116 (South End area) Back End Java Developer-Machine Learning Jobspring Partners 130000.0
12 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 132500.0
13 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 142500.0
14 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 73570.5
15 Houston, TX Lead Bioinformatics Programmer Baylor College of Medicine 90000.0
16 Quincy, MA 02169 Machine Learning Engineer NRS Global Partners 107500.0
19 Cambridge, MA 02138 (West Cambridge area) Program Assistant Union of Concerned Scientists 35000.0
23 Braintree, MA Inside Sales Representative PayFactors 80000.0
24 Scottsdale, AZ Web Security Research Analyst SiteLock 52500.0
25 Chicago, IL Lead Data Scientist Analytic Recruiting 140000.0
26 Chicago, IL Senior Data Scientist Selby Jennings 150000.0
27 Sausalito, CA 94965 Associate Director of Analytics Workbridge Associates 150000.0
28 San Francisco, CA Senior Data Science Manager Harnham 185000.0
29 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 65977.0
31 New York, NY Health Scientist Centers for Disease Control and Preven... 117437.5
32 New York, NY Technical Researcher HOUSING PRESERVATION & DVLPMNT 70286.0
33 New York, NY Senior Data Scientist - Bio-tech Harnham 170000.0
34 New York, NY Quantitative Research Analyst for Multi-Billio... Averity 175000.0
35 New York, NY 10013 (Tribeca area) Machine Learning Engineer Workbridge Associates 125000.0
36 New York, NY 10167 (Midtown area) FTR Quantitative Risk Analyst Selby Jennings 135000.0
37 New York, NY Senior Statistician for Fortune 500 Company Averity 110000.0
251 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 95000.0
747 Philadelphia, PA Quantitative Research Analyst Liberty Personnel Services 140000.0
748 San Fernando, CA CLS Certifying Scientist (Toxicologist Lighthouse Recruiting 75000.0
749 Atlanta, GA Lead Quantitative Analyst SearchBankingJobs 177500.0
750 Seattle, WA Assistant Director Data Science Liberty Mutual 118000.0
751 Seattle, WA Clinical User Operations Quartet Health 50000.0
752 Seattle, WA Variant Scientist (Remote Lighthouse Recruiting 95000.0

In [417]:
jobs.dtypes


Out[417]:
location         object
title            object
company          object
final_salary    float64
dtype: object

Save your results as a CSV


In [418]:
# Export to csv
job_details_csv = jobs.to_csv

In [419]:
job_details_csv


Out[419]:
<bound method DataFrame.to_csv of                                          location  \
0                                      Boston, MA   
1                                      Boston, MA   
3    Boston, MA 02108 (Back Bay-Beacon Hill area)   
5                                   Cambridge, MA   
7                                     Houston, TX   
9                             Lexington, MA 02420   
11              Boston, MA 02116 (South End area)   
12                                     Boston, MA   
13                                     Boston, MA   
14                                     Boston, MA   
15                                    Houston, TX   
16                               Quincy, MA 02169   
19      Cambridge, MA 02138 (West Cambridge area)   
23                                  Braintree, MA   
24                                 Scottsdale, AZ   
25                                    Chicago, IL   
26                                    Chicago, IL   
27                            Sausalito, CA 94965   
28                              San Francisco, CA   
29                                  Manhattan, NY   
31                                   New York, NY   
32                                   New York, NY   
33                                   New York, NY   
34                                   New York, NY   
35              New York, NY 10013 (Tribeca area)   
36              New York, NY 10167 (Midtown area)   
37                                   New York, NY   
251                                     Plano, TX   
747                              Philadelphia, PA   
748                              San Fernando, CA   
749                                   Atlanta, GA   
750                                   Seattle, WA   
751                                   Seattle, WA   
752                                   Seattle, WA   

                                                 title  \
0    SAS Statistical Data Analyst-Healthcare Marketing   
1        SAS Program / Statistical Analyst-Data Mining   
3                                         Data Analyst   
5    Front-End Developer - JavaScript/HTML5 for EdT...   
7                            Quantitative Risk Analyst   
9                               Statistical Programmer   
11            Back End Java Developer-Machine Learning   
12      Spark/Hadoop/Scala at Top Growth Stage Startup   
13   Principal Software Engineer Lead Software Engi...   
14                        Clinical Laboratory Surveyor   
15                      Lead Bioinformatics Programmer   
16                           Machine Learning Engineer   
19                                   Program Assistant   
23                         Inside Sales Representative   
24                       Web Security Research Analyst   
25                                 Lead Data Scientist   
26                               Senior Data Scientist   
27                     Associate Director of Analytics   
28                         Senior Data Science Manager   
29   City Research Scientist Bureau of the Public H...   
31                                    Health Scientist   
32                                Technical Researcher   
33                    Senior Data Scientist - Bio-tech   
34   Quantitative Research Analyst for Multi-Billio...   
35                           Machine Learning Engineer   
36                       FTR Quantitative Risk Analyst   
37         Senior Statistician for Fortune 500 Company   
251                Genetic Counselor Clinical Genomics   
747                      Quantitative Research Analyst   
748             CLS Certifying Scientist (Toxicologist   
749                          Lead Quantitative Analyst   
750                    Assistant Director Data Science   
751                           Clinical User Operations   
752                          Variant Scientist (Remote   

                                               company  final_salary  
0                              ALC Staffing Associates       72500.0  
1                                        Alexander Bec       75000.0  
3                     Infotek Consulting Services Inc.      105000.0  
5                                     The Bivium Group      102500.0  
7                                    SearchBankingJobs      177500.0  
9                                         BioPier Inc.       90000.0  
11                                  Jobspring Partners      130000.0  
12                               Workbridge Associates      132500.0  
13                                    The Bivium Group      142500.0  
14                       Commonwealth of Massachusetts       73570.5  
15                          Baylor College of Medicine       90000.0  
16                                 NRS Global Partners      107500.0  
19                       Union of Concerned Scientists       35000.0  
23                                          PayFactors       80000.0  
24                                            SiteLock       52500.0  
25                                 Analytic Recruiting      140000.0  
26                                      Selby Jennings      150000.0  
27                               Workbridge Associates      150000.0  
28                                             Harnham      185000.0  
29                       DEPT OF HEALTH/MENTAL HYGIENE       65977.0  
31           Centers for Disease Control and Preven...      117437.5  
32                      HOUSING PRESERVATION & DVLPMNT       70286.0  
33                                             Harnham      170000.0  
34                                             Averity      175000.0  
35                               Workbridge Associates      125000.0  
36                                      Selby Jennings      135000.0  
37                                             Averity      110000.0  
251                              Lighthouse Recruiting       95000.0  
747                         Liberty Personnel Services      140000.0  
748                              Lighthouse Recruiting       75000.0  
749                                  SearchBankingJobs      177500.0  
750                                     Liberty Mutual      118000.0  
751                                     Quartet Health       50000.0  
752                              Lighthouse Recruiting       95000.0  >

Predicting salaries using Random Forests + Another Classifier

Load in the the data of scraped salaries


In [616]:
## YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split

In [617]:
rfc = RandomForestClassifier()
knn = KNeighborsClassifier()

We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a binary classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the median as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries,


In [424]:
# Median salary is 72,500 per year 
# Should do the mean or the 50th percentile instead as there aren't that many salaries above 72,500.

In [427]:
jobs.final_salary.describe()


Out[427]:
count        34.000000
mean     112066.794118
std       40383.170629
min       35000.000000
25%       76250.000000
50%      108750.000000
75%      140000.000000
max      185000.000000
Name: final_salary, dtype: float64

In [428]:
#Upper 50% above 108,750
jobs['high_or_low'] = jobs['final_salary'].map(lambda x: 1 if x > 108750 else 0)

In [429]:
jobs


Out[429]:
location title company final_salary high_or_low
0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 72500.0 0
1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000.0 0
3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 105000.0 0
5 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 102500.0 0
7 Houston, TX Quantitative Risk Analyst SearchBankingJobs 177500.0 1
9 Lexington, MA 02420 Statistical Programmer BioPier Inc. 90000.0 0
11 Boston, MA 02116 (South End area) Back End Java Developer-Machine Learning Jobspring Partners 130000.0 1
12 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 132500.0 1
13 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 142500.0 1
14 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 73570.5 0
15 Houston, TX Lead Bioinformatics Programmer Baylor College of Medicine 90000.0 0
16 Quincy, MA 02169 Machine Learning Engineer NRS Global Partners 107500.0 0
19 Cambridge, MA 02138 (West Cambridge area) Program Assistant Union of Concerned Scientists 35000.0 0
23 Braintree, MA Inside Sales Representative PayFactors 80000.0 0
24 Scottsdale, AZ Web Security Research Analyst SiteLock 52500.0 0
25 Chicago, IL Lead Data Scientist Analytic Recruiting 140000.0 1
26 Chicago, IL Senior Data Scientist Selby Jennings 150000.0 1
27 Sausalito, CA 94965 Associate Director of Analytics Workbridge Associates 150000.0 1
28 San Francisco, CA Senior Data Science Manager Harnham 185000.0 1
29 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 65977.0 0
31 New York, NY Health Scientist Centers for Disease Control and Preven... 117437.5 1
32 New York, NY Technical Researcher HOUSING PRESERVATION & DVLPMNT 70286.0 0
33 New York, NY Senior Data Scientist - Bio-tech Harnham 170000.0 1
34 New York, NY Quantitative Research Analyst for Multi-Billio... Averity 175000.0 1
35 New York, NY 10013 (Tribeca area) Machine Learning Engineer Workbridge Associates 125000.0 1
36 New York, NY 10167 (Midtown area) FTR Quantitative Risk Analyst Selby Jennings 135000.0 1
37 New York, NY Senior Statistician for Fortune 500 Company Averity 110000.0 1
251 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 95000.0 0
747 Philadelphia, PA Quantitative Research Analyst Liberty Personnel Services 140000.0 1
748 San Fernando, CA CLS Certifying Scientist (Toxicologist Lighthouse Recruiting 75000.0 0
749 Atlanta, GA Lead Quantitative Analyst SearchBankingJobs 177500.0 1
750 Seattle, WA Assistant Director Data Science Liberty Mutual 118000.0 1
751 Seattle, WA Clinical User Operations Quartet Health 50000.0 0
752 Seattle, WA Variant Scientist (Remote Lighthouse Recruiting 95000.0 0

In [448]:
jobs_with_locations = pd.concat([jobs, pd.get_dummies(jobs.location)], axis=1)

In [450]:
jobs_with_locations.head(3)


Out[450]:
location title company final_salary high_or_low Atlanta, GA Boston, MA Boston, MA 02108 (Back Bay-Beacon Hill area) Boston, MA 02116 (South End area) Braintree, MA ... New York, NY 10013 (Tribeca area) New York, NY 10167 (Midtown area) Philadelphia, PA Plano, TX Quincy, MA 02169 San Fernando, CA San Francisco, CA Sausalito, CA 94965 Scottsdale, AZ Seattle, WA
0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 72500.0 0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000.0 0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 105000.0 0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 27 columns

Thought experiment: What is the baseline accuracy for this model?

It is a measure of how well our selected features will be at predicting a high or low salary.

Create a Random Forest model to predict High/Low salary. Start by ONLY using the location as a feature.


In [612]:
## YOUR CODE HERE
X_1 = jobs_with_locations.drop(jobs_with_locations[[0,1,2,3,4]], axis=1)
y_1 = jobs_with_locations.high_or_low

In [ ]:
# NO TRAIN TEST SPLIT.

In [490]:
rfc.fit(X_train,y_train)


Out[490]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [491]:
#Accuracy score.
rfc.score(X_train,y_train)


Out[491]:
0.88235294117647056

In [532]:
from sklearn.cross_validation import cross_val_score

In [493]:
cross_val_score(rfc, X_train, y_train)


Out[493]:
array([ 0.57142857,  0.4       ,  0.4       ])

In [ ]:
#The accuracy is high, but the cross validation score expresses substantially less 
# confidence. May be due to the smallness of the sample.

In [614]:
#Try with KNN

In [668]:
knn.fit(X_train, y_train)


Out[668]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [669]:
knn.score(X_train,y_train)


Out[669]:
0.58823529411764708

In [621]:
#Less accurate than with rfc.

In [622]:
cross_val_score(knn, X_train, y_train)


Out[622]:
array([ 0.57142857,  0.6       ,  0.6       ])

In [ ]:
#Same cross val score as with rfc.

Create a few new variables in your dataframe to represent interesting features of a job title.

  • For example, create a feature that represents whether 'Senior' is in the title
  • or whether 'Manager' is in the title.
  • Then build a new Random Forest with these features. Do they add any value?

In [520]:
## YOUR CODE HERE
senior_variable = jobs['title'].map(lambda x: 1 if 'Senior' in x else 0)

In [521]:
senior_variable = pd.DataFrame(senior_variable)

In [523]:
senior_variable = senior_variable.rename(columns = {'title':'senior_variable'})

In [525]:
jobs_with_seniors = pd.concat([jobs, senior_variable], axis=1)

In [526]:
jobs_with_seniors


Out[526]:
location title company final_salary high_or_low senior_variable
0 Boston, MA SAS Statistical Data Analyst-Healthcare Marketing ALC Staffing Associates 72500.0 0 0
1 Boston, MA SAS Program / Statistical Analyst-Data Mining Alexander Bec 75000.0 0 0
3 Boston, MA 02108 (Back Bay-Beacon Hill area) Data Analyst Infotek Consulting Services Inc. 105000.0 0 0
5 Cambridge, MA Front-End Developer - JavaScript/HTML5 for EdT... The Bivium Group 102500.0 0 0
7 Houston, TX Quantitative Risk Analyst SearchBankingJobs 177500.0 1 0
9 Lexington, MA 02420 Statistical Programmer BioPier Inc. 90000.0 0 0
11 Boston, MA 02116 (South End area) Back End Java Developer-Machine Learning Jobspring Partners 130000.0 1 0
12 Boston, MA Spark/Hadoop/Scala at Top Growth Stage Startup Workbridge Associates 132500.0 1 0
13 Boston, MA Principal Software Engineer Lead Software Engi... The Bivium Group 142500.0 1 0
14 Boston, MA Clinical Laboratory Surveyor Commonwealth of Massachusetts 73570.5 0 0
15 Houston, TX Lead Bioinformatics Programmer Baylor College of Medicine 90000.0 0 0
16 Quincy, MA 02169 Machine Learning Engineer NRS Global Partners 107500.0 0 0
19 Cambridge, MA 02138 (West Cambridge area) Program Assistant Union of Concerned Scientists 35000.0 0 0
23 Braintree, MA Inside Sales Representative PayFactors 80000.0 0 0
24 Scottsdale, AZ Web Security Research Analyst SiteLock 52500.0 0 0
25 Chicago, IL Lead Data Scientist Analytic Recruiting 140000.0 1 0
26 Chicago, IL Senior Data Scientist Selby Jennings 150000.0 1 1
27 Sausalito, CA 94965 Associate Director of Analytics Workbridge Associates 150000.0 1 0
28 San Francisco, CA Senior Data Science Manager Harnham 185000.0 1 1
29 Manhattan, NY City Research Scientist Bureau of the Public H... DEPT OF HEALTH/MENTAL HYGIENE 65977.0 0 0
31 New York, NY Health Scientist Centers for Disease Control and Preven... 117437.5 1 0
32 New York, NY Technical Researcher HOUSING PRESERVATION & DVLPMNT 70286.0 0 0
33 New York, NY Senior Data Scientist - Bio-tech Harnham 170000.0 1 1
34 New York, NY Quantitative Research Analyst for Multi-Billio... Averity 175000.0 1 0
35 New York, NY 10013 (Tribeca area) Machine Learning Engineer Workbridge Associates 125000.0 1 0
36 New York, NY 10167 (Midtown area) FTR Quantitative Risk Analyst Selby Jennings 135000.0 1 0
37 New York, NY Senior Statistician for Fortune 500 Company Averity 110000.0 1 1
251 Plano, TX Genetic Counselor Clinical Genomics Lighthouse Recruiting 95000.0 0 0
747 Philadelphia, PA Quantitative Research Analyst Liberty Personnel Services 140000.0 1 0
748 San Fernando, CA CLS Certifying Scientist (Toxicologist Lighthouse Recruiting 75000.0 0 0
749 Atlanta, GA Lead Quantitative Analyst SearchBankingJobs 177500.0 1 0
750 Seattle, WA Assistant Director Data Science Liberty Mutual 118000.0 1 0
751 Seattle, WA Clinical User Operations Quartet Health 50000.0 0 0
752 Seattle, WA Variant Scientist (Remote Lighthouse Recruiting 95000.0 0 0

In [606]:
X_2 = jobs_with_seniors.drop(jobs_with_seniors[[0,1,2,3,4]], axis=1)
y_2 = jobs_with_seniors.high_or_low

In [624]:
rfc.fit(X_train,y_train)


Out[624]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [625]:
rfc.score(X_train,y_train)


Out[625]:
0.6470588235294118

In [ ]:
#Not the most accurate for "senior" in job title, but could be because of the small size of sample

In [626]:
cross_val_score(rfc, X_train, y_train)


Out[626]:
array([ 0.57142857,  0.6       ,  0.6       ])

In [ ]:
# "Senior" not highly predictive of high salaries, at least from this dataset

In [671]:
knn.fit(X_train,y_train)


Out[671]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [672]:
knn.score(X_train,y_train)


Out[672]:
0.58823529411764708

In [630]:
cross_val_score(knn, X_train, y_train)


Out[630]:
array([ 0.57142857,  0.6       ,  0.6       ])

In [597]:
# Saw a few high salaries with "Quantitative" in there--should test for that

In [598]:
quant_variable = jobs['title'].map(lambda x: 1 if 'Quantitative' in x else 0)

In [600]:
quant_variable = pd.DataFrame(quant_variable)
quant_variable = quant_variable.rename(columns = {'title':'quant_variable'})

In [601]:
jobs_with_quant = pd.concat([jobs, quant_variable], axis=1)

In [31]:
X_3 = jobs_with_quant.quant_variable
y_3 = jobs_with_quant.high_or_low


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-5b10fc4b9039> in <module>()
----> 1 X_3 = jobs_with_quant.quant_variable
      2 y_3 = jobs_with_quant.high_or_low

NameError: name 'jobs_with_quant' is not defined

In [673]:
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)

In [674]:
rfc.fit(X_train,y_train)


Out[674]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [609]:
rfc.score(X_train,y_train)


Out[609]:
0.58823529411764708

In [610]:
#Accuracy score likely lower than usual because of the limited sample from the scraping.

In [611]:
cross_val_score(rfc, X_train,y_train)


Out[611]:
array([ 0.42857143,  0.6       ,  0.4       ])

In [ ]:
# Not predictive according to cross val score.

In [675]:
knn.fit(X_train,y_train)


Out[675]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [676]:
knn.score(X_train,y_train)


Out[676]:
0.58823529411764708

In [677]:
cross_val_score(knn,X_train,y_train)


Out[677]:
array([ 0.57142857,  0.6       ,  0.6       ])

Rebuild this model with scikit-learn.

  • You can either create the dummy features manually or use the dmatrix function from patsy
  • Remember to scale the feature variables as well!

In [659]:
#I've already turned them to dummies, so I'm not sure scaling would be beneficial.

In [476]:
## YOUR CODE HERE

Use cross-validation in scikit-learn to evaluate the model above.

  • Evaluate the accuracy of the model.

In [ ]:
## YOUR CODE HERE

Random Forest Regressor

Let's try treating this as a regression problem.

  • Train a random forest regressor on the regression problem and predict your dependent.
  • Evaluate the score with a 5-fold cross-validation
  • Do a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?

In [635]:
## YOUR CODE HERE
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()

In [ ]:
#X_1, y_1: location; X_2, y_2: senior; X_3, y_3: quant

In [636]:
#First for location
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, test_size=0.5, random_state=50)

In [637]:
rfr.fit(X_train,y_train)


Out[637]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [665]:
import matplotlib.pyplot as plt

In [638]:
rfr.score(X_train,y_train)


Out[638]:
0.59722817460317446

In [639]:
#A slightly improved score from others.

In [642]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')


Out[642]:
array([-0.18777778, -0.4975    , -0.47893333, -0.66      , -0.32      ])

In [ ]:
#All negative scores.

In [645]:
#Now for "seniors"

In [646]:
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, test_size=0.5, random_state=50)

In [647]:
rfr.fit(X_train,y_train)


Out[647]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [648]:
rfr.score(X_train,y_train)


Out[648]:
0.083613968253968163

In [649]:
#Not accurate at all.

In [650]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')


Out[650]:
array([-0.1895362 , -0.26093463, -0.3225    , -0.3435363 , -0.21301775])

In [ ]:
#Again, negative scores.

In [651]:
#Maybe Quantitative could work.

In [653]:
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, test_size=0.5, random_state=50)

In [654]:
rfr.fit(X_train,y_train)


Out[654]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [655]:
rfr.score(X_train, y_train)


Out[655]:
0.086571800833340373

In [656]:
#Also very low.

In [657]:
cross_val_score(rfr, X_train, y_train, cv=5, scoring='mean_squared_error')


Out[657]:
array([-0.26600446, -0.25525259, -0.29889602, -0.25791682, -0.3144752 ])