Scraping Indeed Job listings for the term "Data Scientist"

This little demo will scrape job listings and company information from Indeed.com

This code has been taken from a post on the NYC Data Science Academy website titled Project 3: Web Scraping Company Data from Indeed.com and Dice.com by Sung Pil Moon.

I have modified the code to work properly with Indeed.com's new HTML structure.

Scraping the listings

First we need to get the job listings, if you click the link below you can see the page we are going to scrape: https://www.indeed.com/jobs?q=data+scientist&jt=fulltime&sort=date

There are 21,209 listings! This is way too many to copy and past by hand, lets automate it!



In [12]:

    
# import libraries
from bs4 import BeautifulSoup as Soup
import requests
import pandas as pd

# indeed.com url
base_url = 'http://www.indeed.com/jobs?q=data+scientist&jt=fulltime&sort='
sort_by = 'date'          # sort by data
start_from = '&start='    # start page number
home_url = "http://www.indeed.com"
print(home_url)









    



http://www.indeed.com



In [6]:

    
# Create a list to contain all the job postings
job_listings = []

for page in range(0,500,10): # page from 1 to 100 (last page we can scrape is 100)
    if page % 100 == 0:
        print("Scraping page {}".format(page // 10))
    url = "%s%s%s%d" % (base_url, sort_by, start_from, page) # get full url
    target = Soup(requests.get(url).text, "lxml") 

    targetElements = target.findAll('div', attrs={'class' : 'result'}) # we're interested in each row (= each job)
    
    # trying to get each specific job information (such as company name, job title, urls, ...)
    for elem in targetElements:
        
        try:
            comp_name = elem.find('span', "company").text.strip()
            job_title = elem.find('a', attrs={'class':'turnstileLink'}).attrs['title']
            job_addr = elem.find('span',"location").text
            job_link = "%s%s" % (home_url,elem.find('a').get('href'))
            job_summary = elem.find('span',"summary").text.strip()

            if elem.find('span', "company").find("a"):

                company_link = elem.find('span', "company").find("a")
                comp_link_overall = "%s%s" % (home_url, company_link['href'])
            else:
                comp_link_overall = None

            # add a job info to our data frame
            job_listings.append({'company_name': comp_name, 
                                 'job_title': job_title, 
                                 'job_link': job_link,
                                 'job_summary': job_summary,
                                 'company_link': comp_link_overall, 
                                 'job_location': job_addr})
        
        # Some ofthe listings are missing information, we are going to skip them
        except:
            print("Bad data on search page")
            print(url)
            
print("Scrapting Finish! Collected {} job postings!".format(len(job_listings)))









    



Scraping page 0
Scraping page 10
Scraping page 20
Scraping page 30
Scraping page 40
Scrapting Finish! Collected 748 job postings!



In [8]:

    
jobs_dataframe = pd.DataFrame(job_listings)
jobs_dataframe.to_csv("jobs-data.csv", index=False)
jobs_dataframe.head()









    Out[8]:






  
    
      
      company_link
      company_name
      job_link
      job_location
      job_summary
      job_title
    
  
  
    
      0
      http://www.indeed.com/cmp/Allegheny-General-Ho...
      Allegheny General Hospital
      http://www.indeed.com/rc/clk?jk=805f6ed65b49dd...
      Pittsburgh, PA
      Allegheny Health Network’s clinical expertise ...
      Research Data Analyst - Neurosurgery
    
    
      1
      http://www.indeed.com/cmp/Marathon-Oil
      Marathon Petroleum Corporation
      http://www.indeed.com/rc/clk?jk=b28767d68e08b1...
      Findlay, OH
      The vision for the Advanced Analytics team is ...
      Data Scientist
    
    
      2
      http://www.indeed.com/cmp/Oracle
      Oracle
      http://www.indeed.com/rc/clk?jk=56e6eae3c307f9...
      United States
      The Cloud Data Curation team is looking for Sc...
      Data Scientist 5
    
    
      3
      http://www.indeed.com/cmp/Google
      Google
      http://www.indeed.com/rc/clk?jk=35f4fa4d806cc9...
      Mountain View, CA
      From creating experiments and prototyping impl...
      Research Scientist, Google Brain (United States)
    
    
      4
      http://www.indeed.com/cmp/Childrens-Hospital-L...
      Childrens Hospital Los Angeles
      http://www.indeed.com/rc/clk?jk=ff4d4e425d9c76...
      Los Angeles, CA
      The Data Scientist conducts research in medica...
      Data Scientist, VPICU

Getting Information about the company

OK, now that I have information about all of the listings, why don't we try and get some information about the companies posting those jobs. Here is an example of a company page: https://www.indeed.com/cmp/Kpmg?from=SERP&fromjk=2cc9b68015bf617f&jcid=2dd390c3a48a7ed0&attributionid=serp-linkcompanyname



In [9]:

    
# remove duplicate company URLs
company_urls = set(listing['company_link'] for listing in job_listings)
print(len(job_listings))
print(len(company_urls))



In [10]:

    
company_info = []

for i,company_url in enumerate(company_urls):
    if i % 50 == 0:
        total_listings = len(company_urls)
        print("Scraping company {} of {}".format(i, total_listings))
    
    # skip the None 
    if not company_url:
        continue

    company_page = Soup(requests.get(company_url).text, "lxml")
    
    
    # get the company ratings
    ratings = company_page.find_all('span','cmp-star-rating')
    
    company_info.append({
    'url'                          : company_url,
    'overall_rating'               : float(company_page.find('span','cmp-average-rating').text),
    'wl_balanace_rating'           : float(ratings[0].text),
    'compensation_benefits_rating' : float(ratings[1].text),
    'js_advancement_rating'        : float(ratings[2].text),
    'management_rating'            : float(ratings[3].text),
    'culture_rating'               : float(ratings[4].text)})









    



Scraping company 0 of 219
Scraping company 50 of 219
Scraping company 100 of 219
Scraping company 150 of 219
Scraping company 200 of 219



In [80]:

    
company_dataframe = pd.DataFrame(company_info)
company_dataframe.to_csv("company-data.csv", index=False)
company_dataframe.head()









    Out[80]:






  
    
      
      compensation_benefits_rating
      culture_rating
      js_advancement_rating
      management_rating
      overall_rating
      url
      wl_balanace_rating
    
  
  
    
      0
      3.9
      3.5
      3.3
      3.4
      3.9
      http://www.indeed.com/cmp/Population-Council
      3.8
    
    
      1
      3.6
      3.8
      3.2
      3.9
      4.0
      http://www.indeed.com/cmp/Ensco,-Inc.
      4.1
    
    
      2
      3.9
      3.6
      3.3
      3.4
      3.8
      http://www.indeed.com/cmp/General-Dynamics-Inf...
      3.7
    
    
      3
      3.2
      3.8
      2.9
      3.4
      3.5
      http://www.indeed.com/cmp/Mintel
      3.7
    
    
      4
      3.8
      3.8
      3.6
      3.5
      4.0
      http://www.indeed.com/cmp/ADP
      3.8



In [ ]:



In [ ]:

	company_link	company_name	job_link	job_location	job_summary	job_title
0	http://www.indeed.com/cmp/Allegheny-General-Ho...	Allegheny General Hospital	http://www.indeed.com/rc/clk?jk=805f6ed65b49dd...	Pittsburgh, PA	Allegheny Health Network’s clinical expertise ...	Research Data Analyst - Neurosurgery
1	http://www.indeed.com/cmp/Marathon-Oil	Marathon Petroleum Corporation	http://www.indeed.com/rc/clk?jk=b28767d68e08b1...	Findlay, OH	The vision for the Advanced Analytics team is ...	Data Scientist
2	http://www.indeed.com/cmp/Oracle	Oracle	http://www.indeed.com/rc/clk?jk=56e6eae3c307f9...	United States	The Cloud Data Curation team is looking for Sc...	Data Scientist 5
3	http://www.indeed.com/cmp/Google	Google	http://www.indeed.com/rc/clk?jk=35f4fa4d806cc9...	Mountain View, CA	From creating experiments and prototyping impl...	Research Scientist, Google Brain (United States)
4	http://www.indeed.com/cmp/Childrens-Hospital-L...	Childrens Hospital Los Angeles	http://www.indeed.com/rc/clk?jk=ff4d4e425d9c76...	Los Angeles, CA	The Data Scientist conducts research in medica...	Data Scientist, VPICU

	compensation_benefits_rating	culture_rating	js_advancement_rating	management_rating	overall_rating	url	wl_balanace_rating
0	3.9	3.5	3.3	3.4	3.9	http://www.indeed.com/cmp/Population-Council	3.8
1	3.6	3.8	3.2	3.9	4.0	http://www.indeed.com/cmp/Ensco,-Inc.	4.1
2	3.9	3.6	3.3	3.4	3.8	http://www.indeed.com/cmp/General-Dynamics-Inf...	3.7
3	3.2	3.8	2.9	3.4	3.5	http://www.indeed.com/cmp/Mintel	3.7
4	3.8	3.8	3.6	3.5	4.0	http://www.indeed.com/cmp/ADP	3.8