Web Scraping Indeed Data Science Salaries

The goal of the following project was to acquire Indeed.com job listings, discover what I could about data scientist salaries from the job listings that did include salary data, and then predict whether salaries that did not include listings were above or below the median salary.

First, I imported all necessary libraries.


In [1]:
# libraries to import

# related to webscraping - to acquire data
import requests
import bs4
from bs4 import BeautifulSoup

# for working with and visualizing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# for modeling
from sklearn.cross_validation import cross_val_score, StratifiedKFold , train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# cleaning up the notebook
import warnings
warnings.filterwarnings('ignore')

Next, I wrote a scraper to obtain job postings for data scientist positions from several different markets.


In [2]:
# string of the indeed.com URL we want to search across cities
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"

max_results_per_city = 1000

# each result is a set of 10 job listings from one page
results = []
# results city allows cleaner analysis of jobs by location later - what city was searched for?
result_city = []

# loop through set of cities to get job postings
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Boston', 'San+Jose',
    'San+Diego','San+Antonio','Portland', 'Phoenix', 'Denver', 'Houston', 'Washington+DC']):
        for start in range(0, max_results_per_city, 10):
        # Grab the results from the request
            url = url_template.format(city, start)
            job = requests.get(url)
        # Append to the full set of results
            b = BeautifulSoup(job.text)
            results.append(b)
            result_city.append(city)
            pass

The cell below creates an empty dataframe with columns for all the relevant information that I need, and then loops through the scraped results to add each job to the dataframe.


In [3]:
# create empty dataframe
indeed = pd.DataFrame(columns = ['title','location','search_city','company','salary'])

# loop through results list to extract information
for i in results:
    indx = results.index(i)
    for job in i.find_all('div', {'class' : ' row result'}):
        title = job.find('a', {'class':'turnstileLink'}).text
        location = job.find('span', {'class': 'location'}).text
        search_city = result_city[indx]
        try:
            company = job.find('span', {'class' : 'company'}).text.strip()
        except:
            company = 'NA'
        salary = job.find('nobr')
        # add result to end of dataframe
        indeed.loc[len(indeed)]=[title, location, search_city, company, salary]

Below creates a dataframe for all jobs without salaries, then drops duplicate records from that dataframe.


In [4]:
## Extracting all fields with missing salaries for analysis and estimation later

indeed.salary = indeed.salary.astype(str)
indeed_missing = indeed[indeed['salary'] == 'None']

## Drop duplicate scraped records
indeed_missing[['title','location','company','salary']].drop_duplicates(inplace = True)

Below creates a dataframe with only jobs that list annual salaries, then strips out all extra items from the salary string aside from the salary numbers and a '-' indicating if there is a salary range.

The second step removes the '+' from the cities that were searched for.


In [5]:
## Getting only annual salaries, stripping all information out aside from numbers and dash to mark range

indeed_year = indeed[indeed['salary'].str.contains("a year", na=False)]
indeed_year.salary = indeed_year.salary.replace('[/$<nobraye>,]', '', regex = True)

indeed_year.search_city = indeed_year.search_city.replace('+', ' ', regex = True)

The function below gets the mean of the two salaries if there is a range, and converts to a float if there is only one listed. The function is then applied to the dataframe, replacing the original salary.


In [6]:
## function to get average salaries when applicable
## try to split on dash, get mean

def sal_split(i):
    try:
        splt = i.split(' - ',1)
        low = float(splt[0])
        high = float(splt[1])
        return (low + high)/2
    except:
        return float(i)
    
## apply above function to all salaries in df
indeed_year['salary'] = indeed_year['salary'].apply(sal_split)

## dropping dupes
indeed_year[['title','location','company','salary']].drop_duplicates(inplace = True)

Export file to csv - will not run this cell, and will instead use the originally scraped listings.

indeed_year.to_csv('indeed.csv', encoding='utf-8')

Begin process for model building below

We need to calculate median salary to determine if listing with a salary is above or below the median. We are making this into a classification problem rather than a regression problem, since salaries can fluctuate and we do not have a very large sample of jobs to work with.


In [7]:
## Importing dataset of salaries that was previously saved
df = pd.read_csv('indeed.csv', index_col = 0)

## Found 303 records with salaries
print df.shape


(303, 5)

In [8]:
## calculate median
## lambda applies a 1 if salary is above median, 0 if below
med = np.median(df.salary)
print "The median salary is", med
df['high_low'] = df['salary'].map(lambda x: 1 if x > med else 0)


The median salary is 108702.5

The two graphs below show that San Jose and San Francisco have the highest mean salaries, while our dataset has the most job listings with salaries from New York.


In [9]:
df.groupby('search_city').salary.mean().sort_values(ascending = False).plot(kind='bar')
plt.xlabel('City')
plt.ylabel('Mean Salary')
plt.show()



In [10]:
df.search_city.value_counts().plot(kind='bar')
plt.xlabel('City')
plt.ylabel('Job Results')
plt.show()


The model is created below.

I tested several different classification methods, and several different metrics, but ultimately found that a logistic regression was most accurate, and that using a dummy variable for the search city and a tf-idf vectorizer of the job title with n-grams of 2 words provided the best results. 78.9% of salaries in our test set were correctly predicted using these features with a logistic regression.


In [11]:
## Text analysis of job titles

job_titles = df.title

## Using a count vectorizer to ID all individual words across job titles
## Limit ngrams to pairs of 2 words since titles are shorter
tfv = TfidfVectorizer(lowercase = True, strip_accents = 'unicode', ngram_range=(2,2), stop_words = 'english')
tfv_title = tfv.fit_transform(job_titles).todense()


# create dataframe on count vectorized data
title_counts = pd.DataFrame(tfv_title, columns = tfv.get_feature_names())

In [12]:
random_state = 43
X = pd.concat([pd.get_dummies(df['search_city'], drop_first = True).reset_index(), title_counts.reset_index()], axis = 1)
X.drop('index', axis = 1, inplace = True)
y = df['high_low']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = random_state)

In [13]:
logit = LogisticRegression(random_state = random_state)

logit_fit = logit.fit(X_train, y_train)
logit_y_pred = logit.predict(X_test)

print classification_report(y_test, logit_y_pred, target_names=['Low Salary', 'High Salary'])
print confusion_matrix(y_test, logit_y_pred)
print accuracy_score(y_test, logit_y_pred)


             precision    recall  f1-score   support

 Low Salary       0.79      0.82      0.80        40
High Salary       0.79      0.75      0.77        36

avg / total       0.79      0.79      0.79        76

[[33  7]
 [ 9 27]]
0.789473684211

In [15]:
coefs = logit_fit.coef_[0]
names = X.columns

print "Features sorted by their logistic regression coefficients:"
print sorted(zip(map(lambda x: round(x, 4), coefs), names), 
             reverse=True)


Features sorted by their logistic regression coefficients:
[(1.3831, 'San Jose'), (1.3705, u'quantitative analyst'), (1.3575, 'San Francisco'), (1.1074, u'machine learning'), (1.0704, u'senior data'), (1.0361, u'data scientist'), (0.9986, u'quantitative research'), (0.9146, u'data engineer'), (0.91, u'multiple vacancies'), (0.91, u'analyst multiple'), (0.8576, u'data architect'), (0.7954, u'director data'), (0.7136, u'learning engineer'), (0.7103, u'risk analyst'), (0.7103, u'quantitative risk'), (0.692, u'data science'), (0.6565, u'software engineer'), (0.6256, u'lead data'), (0.5902, 'Philadelphia'), (0.4971, u'supervisory statistician'), (0.4554, u'scientist machine'), (0.4402, u'big data'), (0.4184, u'analyst manager'), (0.3997, u'scientist fraud'), (0.3994, 'New York'), (0.3889, 'Chicago'), (0.3712, u'learning researcher'), (0.3472, u'government enterprise'), (0.3472, u'enterprise architect'), (0.3317, u'intelligence scientist'), (0.3317, u'business intelligence'), (0.3284, u'svp quantitative'), (0.3264, u'scientist commerce'), (0.3096, u'principal data'), (0.3088, u'scientist program'), (0.3088, u'program lead'), (0.3073, u'visualization engineer'), (0.3049, u'hedge fund'), (0.3019, u'lead quantitative'), (0.3014, u'multi billion'), (0.3014, u'analyst multi'), (0.2938, u'test manager'), (0.2938, u'integrated test'), (0.2898, u'scientist leading'), (0.2813, 'Boston'), (0.2719, u'data visualization'), (0.2616, u'software data'), (0.2616, u'senior software'), (0.2579, u'veteran military'), (0.2579, u'scientist veteran'), (0.2579, u'military connected'), (0.2499, u'scientist healthcare'), (0.2499, u'healthcare data'), (0.2486, u'virology unit'), (0.2486, u'unit manager'), (0.2486, u'ts sci'), (0.2486, u'supervisor virology'), (0.2486, u'specialist ts'), (0.2486, u'project delivery'), (0.2486, u'fs supervisor'), (0.2486, u'delivery specialist'), (0.2478, u'auto risk'), (0.2428, u'scientist recommendation'), (0.2422, u'life science'), (0.2422, u'commercial director'), (0.2404, u'data analytics'), (0.2404, u'biostatistician emeryville'), (0.2383, u'learning scientist'), (0.2369, u'statistical programmer'), (0.2369, u'principal statistical'), (0.2369, u'clincal principal'), (0.2364, u'corpus linguist'), (0.229, u'assistant director'), (0.2259, u'director life'), (0.2245, u'researcher machine'), (0.2245, u'quantitative researcher'), (0.2211, u'vitro dx'), (0.2211, u'learning vitro'), (0.2211, u'dx imaging'), (0.2154, u'scientist hft'), (0.2154, u'hpc scientist'), (0.2154, u'hft hedge'), (0.212, u'scientist fortune'), (0.2086, u'sr predictive'), (0.2086, u'president quantitative'), (0.2086, u'nyc fintech'), (0.2086, u'modeler nyc'), (0.2065, u'scientist financial'), (0.2065, u'financial services'), (0.2051, u'rapid development'), (0.2051, u'fintech rapid'), (0.2051, u'engineer fintech'), (0.1992, u'scientist graph'), (0.1992, u'graph analytics'), (0.1977, u'fortune 500'), (0.1977, u'500 company'), (0.1946, u'vice president'), (0.1945, u'predictive modeler'), (0.1939, u'stage startup'), (0.1939, u'spark hadoop'), (0.1939, u'scala growth'), (0.1939, u'hadoop scala'), (0.1939, u'growth stage'), (0.1834, u'engineer big'), (0.1834, u'data python'), (0.183, u'dollar hedge'), (0.183, u'billion dollar'), (0.1819, u'scientist bio'), (0.1819, u'bio tech'), (0.1787, u'geospatial data'), (0.1758, u'associate director'), (0.17, u'leading insurance'), (0.17, u'insurance business'), (0.1673, u'engineer machine'), (0.1672, u'scientist global'), (0.1672, u'global bank'), (0.1672, u'engineer data'), (0.1672, u'director business'), (0.1672, u'dev engineer'), (0.1672, u'business development'), (0.1672, u'bank 150k'), (0.1639, u'director analytics'), (0.1614, u'scientist digital'), (0.1614, u'digital content'), (0.1614, u'content advertising'), (0.1599, u'principal software'), (0.1599, u'lead software'), (0.1599, u'engineer lead'), (0.1546, u'mathematical statistician'), (0.1485, u'quant equities'), (0.1485, u'hedge fu'), (0.1485, u'fundamental quant'), (0.1485, u'equities hedge'), (0.1485, u'analyst fundamental'), (0.1476, u'scientist natural'), (0.1407, u'tech platform'), (0.1407, u'platform comp'), (0.1407, u'leading data'), (0.1407, u'driven tech'), (0.1407, u'data driven'), (0.1407, u'comp eq'), (0.1402, u'systematic quantitative'), (0.1402, u'senior systematic'), (0.1402, u'billion dolla'), (0.1387, u'scientist medical'), (0.1387, u'scientist human'), (0.1387, u'medical ai'), (0.1387, u'human computation'), (0.1363, u'scientist innovation'), (0.1363, u'innovation team'), (0.1349, u'recommendation engine'), (0.1345, u'staff data'), (0.1345, u'scientist deep'), (0.1319, u'talend developer'), (0.1319, u'data talend'), (0.118, u'wearable company'), (0.118, u'fitness wearable'), (0.1159, u'scientist iot'), (0.1159, u'network security'), (0.1159, u'iot network'), (0.115, u'senior machine'), (0.1114, u'scientist fundamental'), (0.1114, u'fundamental hedge'), (0.1101, u'scientist fitness'), (0.1074, u'learning data'), (0.1052, u'lead devops'), (0.1052, u'devops engineer'), (0.102, 'Seattle'), (0.1006, u'staff machine'), (0.0915, u'vision machine'), (0.0915, u'computer vision'), (0.0905, u'engineer ad'), (0.0905, u'ad tech'), (0.0855, u'security experience'), (0.0855, u'scientist security'), (0.0855, u'experience huge'), (0.0776, u'recommendation platform'), (0.0776, u'engineer recommendation'), (0.0741, u'learning artificial'), (0.0741, u'ios software'), (0.0741, u'artificial intelli'), (0.0518, u'opir scientist'), (0.0278, u'risk data'), (0.0, u'wellness firm'), (0.0, u'water demand'), (0.0, u'vp data'), (0.0, u'visualization developer'), (0.0, u'technology specialist'), (0.0, u'team leader'), (0.0, u'statistician level'), (0.0, u'statistician fortune'), (0.0, u'sr machine'), (0.0, u'sponsor funded'), (0.0, u'specialist data'), (0.0, u'source data'), (0.0, u'software engineers'), (0.0, u'share data'), (0.0, u'senior statistician'), (0.0, u'senior product'), (0.0, u'senior predictive'), (0.0, u'scientist research'), (0.0, u'scientist python'), (0.0, u'scientist nlp'), (0.0, u'scientist episode'), (0.0, u'scientist e2'), (0.0, u'scientist customer'), (0.0, u'science manager'), (0.0, u'science director'), (0.0, u'scala java'), (0.0, u'research statistical'), (0.0, u'research specialist'), (0.0, u'research project'), (0.0, u'research investigation'), (0.0, u'research evaluation'), (0.0, u'research enrollment'), (0.0, u'research chemist'), (0.0, u'research assessment'), (0.0, u'reprting researc'), (0.0, u'python hadoop'), (0.0, u'python flask'), (0.0, u'python data'), (0.0, u'python algo'), (0.0, u'projections manager'), (0.0, u'project coordinator'), (0.0, u'programmer associate'), (0.0, u'product manager'), (0.0, u'president research'), (0.0, u'predictive modeling'), (0.0, u'persistant infrared'), (0.0, u'overhead persistant'), (0.0, u'open source'), (0.0, u'new d3'), (0.0, u'mutual surety'), (0.0, u'molecular laboratory'), (0.0, u'modeling director'), (0.0, u'modeler sr'), (0.0, u'mid level'), (0.0, u'mechanical market'), (0.0, u'mathematical statisticians'), (0.0, u'marketing statistician'), (0.0, u'manager global'), (0.0, u'manager financial'), (0.0, u'life sciences'), (0.0, u'level python'), (0.0, u'learning application'), (0.0, u'laboratory supervisor'), (0.0, u'junior statistical'), (0.0, u'junior mid'), (0.0, u'job 2133'), (0.0, u'java python'), (0.0, u'iv business'), (0.0, u'ios learning'), (0.0, u'investigation analyst'), (0.0, u'infrared scientist'), (0.0, u'information technology'), (0.0, u'ii department'), (0.0, u'health 7444'), (0.0, u'head data'), (0.0, u'hadoop machine'), (0.0, u'global travel'), (0.0, u'funded administrative'), (0.0, u'flask angula'), (0.0, u'fitness wellness'), (0.0, u'financial reprting'), (0.0, u'evaluation group'), (0.0, u'enrollment informat'), (0.0, u'engineers scala'), (0.0, u'engineers junior'), (0.0, u'engineer auto'), (0.0, u'director vp'), (0.0, u'developer new'), (0.0, u'developer data'), (0.0, u'department research'), (0.0, u'demand projections'), (0.0, u'data sciences'), (0.0, u'data management'), (0.0, u'd3 based'), (0.0, u'customer analytic'), (0.0, u'coordinator data'), (0.0, u'compliance team'), (0.0, u'chemical industry'), (0.0, u'bike share'), (0.0, u'based development'), (0.0, u'associate sponsor'), (0.0, u'associate analytics'), (0.0, u'assistant vice'), (0.0, u'assessment data'), (0.0, u'applications developer'), (0.0, u'analytics research'), (0.0, u'analytics analyst'), (0.0, u'analytical chemist'), (0.0, u'analytic manager'), (0.0, u'analyst scientist'), (0.0, u'analyst programmer'), (0.0, u'analyst job'), (0.0, u'analyst iv'), (0.0, u'analyst ii'), (0.0, u'analyst health'), (0.0, u'analyst experience'), (0.0, u'algo developer'), (-0.008, 'Denver'), (-0.0385, 'Washington DC'), (-0.0584, u'industry market'), (-0.0626, u'automotive industry'), (-0.077, u'iii behavioral'), (-0.077, u'chief research'), (-0.077, u'analyst iii'), (-0.077, u'analyst behavioral'), (-0.0802, u'specially funded'), (-0.0802, u'institutional research'), (-0.0802, u'funded open'), (-0.0802, u'analyst specially'), (-0.0832, u'management research'), (-0.0834, 'San Diego'), (-0.088, u'proj spct'), (-0.088, u'prg proj'), (-0.088, u'analyst prg'), (-0.0885, u'ms clinical'), (-0.0885, u'lc ms'), (-0.0892, u'intelligence management'), (-0.0892, u'competitive intelligence'), (-0.0989, u'web security'), (-0.0989, u'sr data'), (-0.0989, u'security research'), (-0.1018, u'investment research'), (-0.1018, u'investigations research'), (-0.1042, u'research applications'), (-0.1042, u'project analyst'), (-0.1042, u'coordinator research'), (-0.1042, u'analyst coordinator'), (-0.1096, 'Austin'), (-0.1134, u'technologist data'), (-0.1134, u'scientist toxicologist'), (-0.1134, u'medical technologist'), (-0.1134, u'data systems'), (-0.1134, u'cls certifying'), (-0.1134, u'certifying scientist'), (-0.1137, u'scientist assistant'), (-0.1137, u'assistant ii'), (-0.1155, u'budget research'), (-0.1166, u'value management'), (-0.1166, u'management cost'), (-0.1166, u'earned value'), (-0.1166, u'cost research'), (-0.118, u'research associate'), (-0.118, u'associate statistical'), (-0.1191, u'operations research'), (-0.1191, u'general engineer'), (-0.1191, u'engineer operations'), (-0.1191, u'analyst interdisciplin'), (-0.1202, u'natural language'), (-0.1202, u'language processing'), (-0.1296, u'teaching learning'), (-0.1296, u'analyst teaching'), (-0.1327, u'research evaluatio'), (-0.1327, u'mental health'), (-0.1327, u'health research'), (-0.1327, u'bureau mental'), (-0.1368, u'credit risk'), (-0.1368, u'analyst credit'), (-0.1378, u'genetic counselor'), (-0.1378, u'counselor clinical'), (-0.1378, u'clinical genomics'), (-0.1388, u'statistical clerk'), (-0.1388, u'senior statistical'), (-0.139, u'operations development'), (-0.139, u'network operations'), (-0.139, u'development progra'), (-0.139, u'cryptanalytic computer'), (-0.139, u'computer network'), (-0.1393, u'corporate account'), (-0.1397, u'registry cohort'), (-0.1397, u'panel maintenance'), (-0.1397, u'maintenance analyst'), (-0.1397, u'health registry'), (-0.1397, u'health registr'), (-0.1397, u'cohort analyst'), (-0.14, u'tx time'), (-0.14, u'learning dallas'), (-0.14, u'java machine'), (-0.14, u'dallas tx'), (-0.1436, u'behavioral sciences'), (-0.1465, u'assistant research'), (-0.1466, u'city research'), (-0.1488, u'analyst junior'), (-0.1525, u'research policy'), (-0.1544, u'specialist space'), (-0.1544, u'space grant'), (-0.1557, u'jr statistical'), (-0.1572, u'scientist bureau'), (-0.1572, u'public health'), (-0.1572, u'health laborat'), (-0.1572, u'bureau public'), (-0.16, u'upstream mammalian'), (-0.16, u'supervising clinical'), (-0.16, u'senior scientist'), (-0.16, u'scientist upstream'), (-0.16, u'scientist ii'), (-0.16, u'mammalian cell'), (-0.16, u'culture developmen'), (-0.16, u'cell culture'), (-0.1647, u'sas program'), (-0.1647, u'program statistical'), (-0.1647, u'data mining'), (-0.1647, u'analyst data'), (-0.1697, u'vital statistics'), (-0.1697, u'surveillance analyst'), (-0.1697, u'senior environmental'), (-0.1697, u'quality assurance'), (-0.1697, u'environmental surveillance'), (-0.1697, u'bureau vital'), (-0.1697, u'bureau childca'), (-0.1697, u'assurance analyst'), (-0.1705, u'scientist hadoop'), (-0.1705, u'hadoop duluth'), (-0.1705, u'duluth georgia'), (-0.1744, 'Houston'), (-0.1752, u'viz pythonic'), (-0.1752, u'stack dev'), (-0.1752, u'pythonic design'), (-0.1752, u'dev data'), (-0.1752, u'data viz'), (-0.1827, u'manager environemental'), (-0.1827, u'environemental cleanup'), (-0.1827, u'cleanup projects'), (-0.194, u'technician laboratory'), (-0.194, u'laboratory technician'), (-0.194, u'extractions technician'), (-0.1964, u'program manager'), (-0.2031, u'sr bioinformatics'), (-0.2031, u'bioinformatics programmer'), (-0.2053, u'liberty mutual'), (-0.2053, u'analyst liberty'), (-0.2084, u'postdoctoral researcher'), (-0.2084, u'fishery biologist'), (-0.2104, u'research data'), (-0.2142, u'reporting analyst'), (-0.2142, u'data reporting'), (-0.2157, u'business manager'), (-0.218, u'sr market'), (-0.218, u'market risk'), (-0.2183, u'senior epidemiologist'), (-0.2198, u'post doctoral'), (-0.2198, u'lab assistant'), (-0.2198, u'doctoral scientist'), (-0.2198, u'biochemistry lab'), (-0.2201, u'mutual investments'), (-0.2212, u'policy research'), (-0.2214, u'deep learning'), (-0.2236, u'senior application'), (-0.2236, u'application developer'), (-0.2246, u'scientist ar'), (-0.2246, u'ar som'), (-0.2302, u'laboratory scientist'), (-0.2312, u'manager radars'), (-0.2364, u'travel company'), (-0.2376, u'user operations'), (-0.2376, u'clinical user'), (-0.2386, u'database engineer'), (-0.2475, u'scientist policy'), (-0.2475, u'policy strategy'), (-0.2475, u'modeler data'), (-0.2532, u'scientist specialist'), (-0.2535, u'scientist ranked'), (-0.2535, u'ranked travel'), (-0.2537, u'supply chain'), (-0.2537, u'scientist supply'), (-0.26, u'worker client'), (-0.26, u'social worker'), (-0.26, u'clinical social'), (-0.26, u'client care'), (-0.26, u'care navigation'), (-0.2606, u'world trade'), (-0.2606, u'trade center'), (-0.2606, u'center health'), (-0.2606, u'analyst world'), (-0.2611, u'sales representative'), (-0.2611, u'inside sales'), (-0.2614, u'laboratory surveyor'), (-0.2631, u'senior research'), (-0.2705, u'scientist apache'), (-0.2705, u'data tools'), (-0.2705, u'apache big'), (-0.273, u'modeler policy'), (-0.2748, u'statistical research'), (-0.2764, u'processing data'), (-0.2771, u'statistical data'), (-0.2771, u'staff assistant'), (-0.2771, u'research staff'), (-0.2771, u'research idnyc'), (-0.2771, u'manager statistical'), (-0.2771, u'director research'), (-0.3108, u'physical scientist'), (-0.3108, u'health scientist'), (-0.3176, u'laboratory aide'), (-0.3176, u'behavioral scientist'), (-0.3204, u'account manager'), (-0.3386, u'level gis'), (-0.3386, u'gis technician'), (-0.3386, u'entry level'), (-0.3473, u'statistical analyst'), (-0.3693, u'program assistant'), (-0.3847, u'learning internship'), (-0.3847, u'learning deep'), (-0.3856, u'analyst machine'), (-0.3919, u'technical researcher'), (-0.3919, u'research director'), (-0.3967, u'python software'), (-0.3969, u'policy analyst'), (-0.4178, u'analyst bureau'), (-0.4422, u'clinical laboratory'), (-0.4744, 'Dallas'), (-0.4867, u'junior data'), (-0.5299, u'project manager'), (-0.5342, u'research scientist'), (-0.5508, u'senior quantitative'), (-0.5815, u'stack designer'), (-0.61, 'Phoenix'), (-0.6634, 'San Antonio'), (-0.6794, 'Portland'), (-0.7656, 'Los Angeles'), (-0.7729, u'market research'), (-1.257, u'data analyst'), (-1.5387, u'research analyst')]

The search locations San Jose and San Francisco were the most indicative of a salary being high paying above the median, while the word pairs 'quantitative analyst' and 'machine learning' in the job title were most indicative of a high paying salary.

Phoenix, San Antonio, Portland, and Los Angeles had similar negative coefficients, indicating that they have about equal weight in determining that a job will have a lower paying salary below the median. The word pairs 'data analyst' and 'research analyst' were the most indicative of a lower paying salary.

Missing Salary Predictions

Below, we will predict if the remaining job postings that did not list salaries will be low or high paying using the above model. The data will need to be transformed so that the search location is a dummy variable, and the job titles are converted using the tf-idf vectorizer fit above.


In [31]:
indeed_missing.search_city.value_counts().plot(kind='bar')
plt.xlabel('City')
plt.ylabel('Job Results - No Listed Salary')
plt.show()



In [32]:
print("There are {} total job postings with missing salaries".format(len(indeed_missing)))


There are 3428 total job postings with missing salaries

Transforming job listings so they can be predicted as high or low paying...


In [33]:
## Count Vectorizing missing salary job titles
## Will use same words from the fit count vectorizer from our training data to assist in classification

missing_salary_titles = indeed_missing.title
missing_title_counts = pd.DataFrame(tfv.transform(missing_salary_titles).todense(), columns=tfv.get_feature_names())

## Get dummies for job location to allow it to be predicted

miss_city_dum = pd.get_dummies(indeed_missing['search_city'], drop_first = True)
miss_city_dum.reset_index(inplace = True)

## Combine two dfs

missing_sals = pd.concat([miss_city_dum, missing_title_counts], axis = 1)
missing_sals.drop('index', axis = 1, inplace = True)

In [34]:
logit_title_pred = logit_fit.predict(missing_sals)

indeed_missing['high_low'] = logit_title_pred

print(indeed_missing.head())


                                        title  \
0                              Data Scientist   
1                            Research Analyst   
2  Data Scientist (Corporate Experience Must)   
3                            Research Analyst   
4                               Data Engineer   

                            location search_city  \
0  Houston, TX 77002 (Downtown area)     Houston   
1                        Houston, TX     Houston   
2                        Houston, TX     Houston   
3                        Houston, TX     Houston   
4  Houston, TX 77002 (Downtown area)     Houston   

                              company salary  high_low  
0                    General Electric   None         1  
1                 KPRC - TV Channel 2   None         0  
2  Genpact headstrong capital markets   None         1  
3                Pennwell Corporation   None         0  
4               The Talance Group, LP   None         1  

In [42]:
## Aggregating information by city to plot

indeed_missing_agg = pd.concat([indeed_missing.groupby('search_city').sum(), 
                                pd.DataFrame(indeed_missing.groupby('search_city').title.count())], axis = 1)

## Getting percent of jobs that are high paying by city

indeed_missing_agg['high_pct'] = indeed_missing_agg.high_low / indeed_missing_agg.title

indeed_missing_agg.head()


Out[42]:
high_low title high_pct
search_city
Atlanta 47 194 0.242268
Austin 26 59 0.440678
Boston 94 373 0.252011
Chicago 68 184 0.369565
Dallas 24 82 0.292683

In [43]:
indeed_missing_agg.high_pct.plot.bar()
plt.xlabel('City')
plt.ylabel('Rate of High Salary Jobs')
plt.title('Rate of High Salary Jobs Predicted by City')
plt.show()


Summary

Using the model created, most jobs in San Jose, San Francisco, and Philadelphia are expected to pay above the median salary for all data scientist related jobs in these markets. Most jobs in other cities are expected to be below the median salary for these jobs.