This notebook shows how to use the scrape reviews from Indeed and Glassdoor. To visualize the ratings go to the Ratings notebook and to do topic modeling go to the Topic Modeling notebook.
Before, make sure you have MongoDB up and running.
In [1]:
# Search settings
KEYWORD_FILTER = "Data Scientist"
LOCATION_FILTER = "New York City, NY"
# Other settings
MAX_PAGES_COMPANIES = 500
MAX_PAGES_REVIEWS = 500
In [3]:
import os
import re
from datetime import datetime
from pymongo import MongoClient
import indeed
import glassdoor
import utils
In [4]:
# DB settings
client = MongoClient()
indeed_db = client.indeed
indeed_jobs = indeed_db.jobs
indeed_reviews = indeed_db.reviews
glassdoor_db = client.glassdoor
glassdoor_reviews = glassdoor_db.reviews
In [ ]:
jobs = indeed.get_jobs(KEYWORD_FILTER, LOCATION_FILTER, indeed_jobs, MAX_PAGES_COMPANIES)
In [ ]:
indeed.get_all_company_reviews(jobs, indeed_reviews, MAX_PAGES_REVIEWS)
In [5]:
indeed_reviews.find_one()
Out[5]:
Indeed's company names are inconsistent. The same company can be listed several times with various spellings/typos/words. It's necessary to look at the companies and fix the names. The utils module has a function which takes a dictionary that takes the old name and returns the new one (names not in the dictionary are left as is). See below for an example (the one I used had over 30 name fixes.
In [17]:
companies = list(set(utils.get_company_names(indeed_reviews)))
companies[:5]
Out[17]:
In [ ]:
fix_companies = {'Argus, ISO, Verisk Analytics, Verisk Climate, Veri...': 'Verisk Analytics',
'Barclays Investment Bank': 'Barclays', 'Dun & Brandstreet': u'Dun & Bradstreet',
'Dun & Broadstreet':u'Dun & Bradstreet', 'World Business Lenders - New York, NY':'World Business Lenders'
}
utils.fix_all_company_names(indeed_reviews, fix_companies)
In [ ]:
companies = list(set(utils.get_company_names(indeed_reviews)))
In [ ]:
visited_companies, failed_companies = glassdoor.get_all_company_reviews(companies,
glassdoor_reviews, MAX_PAGES_REVIEWS)
Look at the failed companies. Often they couldn't be found on glassdoor because of an issue with their name. You might need to fix the names again (and search on glassdoor for the name some companies are listed under). Beware of encoding issues: if you pass an optional flag to utils.fix_company_name, you can encode the company names to ascii.
Note: this is usually quite a bit slower than Indeed because there are many more reviews (e.g. Goldman Sachs has 198 pages!).
In [ ]:
# fix_companies = {u'SigmaTek':u'SigmaTek Consulting LLC',
# }
# utils.fix_all_company_names(indeed_reviews, fix_companies)
# fixed_failed_companies = fixed_failed_companies = [utils.fix_company_name(company,
# fix_companies, True) for company in failed_companies]
# visited_companies2, failed_companies = glassdoor.get_all_company_reviews(fixed_failed_companies,
# glassdoor_reviews, MAX_PAGES_REVIEWS)
Here I would do one last check too see which companies were scraped in glassdoor and indeed. Occasionally the wrong company might have been scraped on glassdoor.
In [ ]:
glassdoor_companies = set(utils.get_company_names(glassdoor_reviews))
indeed_companies = set(utils.get_company_names(indeed_reviews))
# Remove the extra companies:
extra_companies = glassdoor_companies - indeed_companies
for company in extra_companies:
glassdoor_reviews.remove({'company' : company})
print "Missing companies", indeed_companies - glassdoor_companies
Now all of the data is in the Mongo database. To visualize the ratings go to the Ratings notebook and to do topic modeling go to the Topic Modeling notebook.