Scraping Reviews

This notebook shows how to use the scrape reviews from Indeed and Glassdoor. To visualize the ratings go to the Ratings notebook and to do topic modeling go to the Topic Modeling notebook.

Before, make sure you have MongoDB up and running.

Parameters


In [1]:
# Search settings
KEYWORD_FILTER = "Data Scientist"
LOCATION_FILTER = "New York City, NY"

# Other settings
MAX_PAGES_COMPANIES = 500
MAX_PAGES_REVIEWS = 500

In [3]:
import os
import re
from datetime import datetime
from pymongo import MongoClient
import indeed
import glassdoor
import utils


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-3-be267ff36641> in <module>()
      2 import re
      3 from datetime import datetime
----> 4 from pymongo import MongoClient
      5 import indeed
      6 import glassdoor

ImportError: No module named pymongo

In [4]:
# DB settings
client = MongoClient()
indeed_db = client.indeed
indeed_jobs = indeed_db.jobs
indeed_reviews = indeed_db.reviews
glassdoor_db = client.glassdoor
glassdoor_reviews = glassdoor_db.reviews


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-5e444593d89b> in <module>()
      1 # DB settings
----> 2 client = MongoClient()
      3 indeed_db = client.indeed
      4 indeed_jobs = indeed_db.jobs
      5 indeed_reviews = indeed_db.reviews

NameError: name 'MongoClient' is not defined

Scrape job listings from Indeed


In [ ]:
jobs = indeed.get_jobs(KEYWORD_FILTER, LOCATION_FILTER, indeed_jobs, MAX_PAGES_COMPANIES)

Scrape company reviews from Indeed

This takes all the companies that appear in the jobs scraped.


In [ ]:
indeed.get_all_company_reviews(jobs, indeed_reviews, MAX_PAGES_REVIEWS)

In [5]:
indeed_reviews.find_one()


Out[5]:
{u'_id': ObjectId('54f763e3bcccd9197dbbdb91'),
 u'company_name': u'American Express',
 u'date': datetime.datetime(2013, 4, 3, 0, 0),
 u'employment_status': u'\xa0(Former Employee),\xa0',
 u'job_title': u'Shipping Clerk',
 u'location': u'Piscataway, New Jersey',
 u'rating': u'5.0',
 u'review_cons': u'Cons: long hours doing the christmas season',
 u'review_pros': u'Pros: you are able to apply for a credit card',
 u'review_text': u'If you are looking for a job to retire from and the work is not hard,then American Express is that company.',
 u'review_title': u'A Company with a future',
 u'stars': {u'Compensation/Benefits': 5,
  u'Job Culture': 5,
  u'Job Security/Advancement': 5,
  u'Job Work/Life Balance': 5,
  u'Management': 5}}

Fix Company Names

Indeed's company names are inconsistent. The same company can be listed several times with various spellings/typos/words. It's necessary to look at the companies and fix the names. The utils module has a function which takes a dictionary that takes the old name and returns the new one (names not in the dictionary are left as is). See below for an example (the one I used had over 30 name fixes.


In [17]:
companies = list(set(utils.get_company_names(indeed_reviews)))
companies[:5]


Out[17]:
[u'Financial Times',
 u'McGraw Hill Financial',
 u'The Nielsen Company',
 u'Continuum',
 u'RUSSELL INVESTMENTS']

In [ ]:
fix_companies = {'Argus, ISO, Verisk Analytics, Verisk Climate, Veri...': 'Verisk Analytics',
                 'Barclays Investment Bank': 'Barclays', 'Dun & Brandstreet': u'Dun & Bradstreet', 
                 'Dun & Broadstreet':u'Dun & Bradstreet', 'World Business Lenders - New York, NY':'World Business Lenders'
                }
utils.fix_all_company_names(indeed_reviews, fix_companies)

In [ ]:
companies = list(set(utils.get_company_names(indeed_reviews)))

Scrape Glassdoor


In [ ]:
visited_companies, failed_companies = glassdoor.get_all_company_reviews(companies, 
                                              glassdoor_reviews, MAX_PAGES_REVIEWS)

Final Fixes

Look at the failed companies. Often they couldn't be found on glassdoor because of an issue with their name. You might need to fix the names again (and search on glassdoor for the name some companies are listed under). Beware of encoding issues: if you pass an optional flag to utils.fix_company_name, you can encode the company names to ascii.

Note: this is usually quite a bit slower than Indeed because there are many more reviews (e.g. Goldman Sachs has 198 pages!).


In [ ]:
# fix_companies = {u'SigmaTek':u'SigmaTek Consulting LLC',
#                 }
# utils.fix_all_company_names(indeed_reviews, fix_companies)
# fixed_failed_companies = fixed_failed_companies = [utils.fix_company_name(company,
# fix_companies, True) for company in failed_companies]
# visited_companies2, failed_companies = glassdoor.get_all_company_reviews(fixed_failed_companies, 
#                                               glassdoor_reviews, MAX_PAGES_REVIEWS)

Here I would do one last check too see which companies were scraped in glassdoor and indeed. Occasionally the wrong company might have been scraped on glassdoor.


In [ ]:
glassdoor_companies = set(utils.get_company_names(glassdoor_reviews))
indeed_companies = set(utils.get_company_names(indeed_reviews))

# Remove the extra companies:
extra_companies = glassdoor_companies - indeed_companies
for company in extra_companies:
    glassdoor_reviews.remove({'company' : company})

print "Missing companies", indeed_companies - glassdoor_companies

Done!

Now all of the data is in the Mongo database. To visualize the ratings go to the Ratings notebook and to do topic modeling go to the Topic Modeling notebook.