Romain Choukroun, Matthias Leroy, Alain Milliet, Hector Parmantier
What's the best developer job like ? (depending on your own definition of "best")
We used the dataset provided by StackOverflow on Kaggle, you can download it and directly run the notebook. It contains about fifty thousand answers from a sample of the active StackOverflow population about a lot of questions, namely 154. This means that we have a tremendous insight into what makes a programmer unique, but we also can answer a lot of interesting questions.
Check the distributions of all useful features, outliers, quantiles. Questions we could answer with the exploration:
What features are more correlated with satisfaction?
Does salary equates to happiness/fulfilment in your job ?
How much is Job Satisfaction linked to education ?
Are "gif" people more satisfied with their job compared to "jif" people ?
What does the population that answered to this survey looks like ?
Derive a metric to measure the distance inbetween users
Data cleaning, categorize values, check out their distribution, selecting columns, removing bad values if needed.
The graph will be built the following way:
Users will be the nodes
Correlations (with a threshold) in-between users used as edges
The idea of the recommender system is to be able to recommend users to recruiters. To do so, we would simply check which existing node is the closest to the artificial one that we create for the chosen features a recruiter is looking for.
In [1]:
# Importing a few utilities
%config InlineBackend.figure_format = 'retina'
from helper_functions import *
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
from scipy.stats import ks_2samp
from subprocess import check_output
import colorlover as cl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
import seaborn as sns
import networkx as nx
import warnings
init_notebook_mode()
warnings.filterwarnings('ignore')
In [2]:
# Let's import the data
stack = pd.read_csv("data/survey_results_public.csv")
# We only keep the following columns
# for the analysis and the recommender system
kept_columns = ['Respondent', 'Professional', 'ProgramHobby', 'Country', 'University', 'EmploymentStatus', 'FormalEducation', 'MajorUndergrad', 'CompanySize', 'CompanyType', 'YearsProgram', 'YearsCodedJob', 'DeveloperType', 'WebDeveloperType', 'NonDeveloperType', 'CareerSatisfaction', 'JobSatisfaction', 'PronounceGIF', 'ProblemSolving', 'BuildingThings', 'LearningNewTech', 'BoringDetails', 'JobSecurity', 'DiversityImportant', 'FriendsDevelopers', 'WorkPayCare', 'ChallengeMyself', 'ImportantBenefits', 'ClickyKeys', 'Overpaid', 'TabsSpaces', 'EducationImportant', 'EducationTypes', 'SelfTaughtTypes', 'WorkStart', 'HaveWorkedLanguage', 'WantWorkLanguage', 'IDE', 'AuditoryEnvironment', 'Methodology', 'EquipmentSatisfiedMonitors', 'StackOverflowSatisfaction', 'StackOverflowFoundAnswer', 'StackOverflowCopiedCode', 'StackOverflowWhatDo', 'Gender', 'HighestEducationParents', 'Race', 'Salary', "ExpectedSalary"]
stack = stack[kept_columns]
stack.set_index("Respondent", inplace=True)
stack.head()
Out[2]:
In [3]:
# Splitting the dataframe into
# students and professionals
stack = stack[stack.apply(lambda row: row_filter(stack,row), axis=1)]
prof_stack = stack[stack.Professional == "Professional developer"]
stud_stack = stack[stack.Professional == "Student"]
metadata = pd.read_csv("data/survey_results_schema.csv")
metadata.head()
Out[3]:
In this section we will explore different columns of our dataframe to have an idea of what the population we have looks like. We decided to split the professionals and the students such that we can obtain 2 different recommender systems in function of what the recruiter is looking for. For certain features, it makes sense to compare the professionals and the students but for some others it makes more sense to do some statistics on the whole data.
In [4]:
stack['Professional'].value_counts()[0:10].plot(kind='bar',figsize=(5,5))
plt.ylabel("Count")
plt.show()
On this graph, we can see the proportion of professionals and students in our dataset. As we can see, there are a lot more professionals than students.
In [5]:
plot_stud_prof(prof_stack=prof_stack, stud_stack=stud_stack, column='Country', title="Country")
On this graph, we can see that the percentage of professionals is not always proportional to the percentage of students in the same country. The United States and India are the two most obvious ones. While 30% of professionals come from the USA, less than 20% of the students come from there. On the other hand, only 5% of professionals in our data come from India while nearly 15% of students come from there.
In [6]:
DevTypes = pd.Series([devtype for sublist in [str(devtypes).replace(" ", "").split(";") for devtypes in stack['DeveloperType'].dropna()] for devtype in sublist])
DevTypes.value_counts(normalize=True)[0:10].plot(kind='bar',figsize=(7,7))
plt.ylabel("Ratio")
plt.show()
As an observation, we can say that the major part of the population in our dataset define themself as Web Developper. But this result does not give that much information as there are multiple types of Web developpers.
In [7]:
prof_languages = pd.Series([lang for sublist in [str(langs).replace(" ", "").split(";") for langs in prof_stack['HaveWorkedLanguage'].dropna()] for lang in sublist])
stud_languages = pd.Series([lang for sublist in [str(langs).replace(" ", "").split(";") for langs in stud_stack['HaveWorkedLanguage'].dropna()] for lang in sublist])
plot_stud_prof(prof=prof_languages, stud=stud_languages, title="Languages")
It looks like it corresponds pretty well as the most important programming language we find is directly linked to the Web Development. We were interested in knowing what were the most important languages per country, so we decided to plot them on a map.
In [8]:
# We prepare the data for the map
codes = [MAP_COUNTRIES[country] if country != 'I prefer not to say' else None for country in stack['Country']]
stack['Code']=codes
country_stack = stack[['HaveWorkedLanguage','Code']]
country_stack["HaveWorkedLanguage"] = country_stack["HaveWorkedLanguage"].apply(lambda x: str(x).replace(" ", "").split(";"))
country_stack = country_stack.set_index('Code')
language_country = pd.get_dummies(pd.DataFrame(country_stack['HaveWorkedLanguage'].tolist(), index=country_stack.index).stack()).sum(level=0)
language_country = language_country.T.idxmax()
languages = list(language_country.unique())
languages.remove('nan')
# Get a 12 colors scale
paired = cl.scales['12']['qual']['Paired']
# Create dict between language and color
language_color_dict = {}
for i, language in enumerate(languages):
language_color_dict[language]=paired[i]
# Create dict between language and countries
language_country_dict = {}
for index, row in language_country.iteritems():
if row in language_country_dict:
language_country_dict[row].append(index)
else:
language_country_dict[row]=[index]
data = []
for i, lang in enumerate(languages):
trace1 = go.Choropleth(
z=['1']*len(language_country_dict[lang]),
autocolorscale=False,
colorscale=[[0, 'rgb(255,255,255)'], [1, language_color_dict[lang]]],
hoverinfo='text',
locations=language_country_dict[lang],
name=lang,
showscale=False,
text=lang
)
data.append(trace1)
layout = dict(
title = 'Most represented programming language per country',
geo = dict(
projection = dict(
type = 'Mercator'
),
showframe=False
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
We can see on this map the real importance of StackOverflow for Javascript developers around the world.
In [9]:
# Let's take a look at the satisfaction of each surveyee
jobSat = stack['JobSatisfaction']/stack['JobSatisfaction'].max()
carrSat = stack['CareerSatisfaction']/stack['CareerSatisfaction'].max()
j_satisfaction = jobSat.value_counts(normalize=True).sort_index()
carr = carrSat.value_counts(normalize=True)
c_satisfaction = carr.loc[j_satisfaction.index]
df = pd.DataFrame([j_satisfaction, c_satisfaction])
df = df.T
df.columns = ["Job Satisfaction", "Career Satisfaction"]
df.plot.bar(figsize=(7,7))
plt.title('Satisfaction')
plt.ylabel("Count")
plt.show()
It is pretty clear on the graph above that the job satisfaction and the career satisfaction are directly linked. To remove all bias such as a person that votes 4 for the job and 8 for the career while another person votes 8 for the job and 4 for the career, we decided to look at the difference between these values and see if we have a lot of people having that much difference between their satisfactions.
In [10]:
satisfaction = stack[['JobSatisfaction','CareerSatisfaction']]
satisfaction['Difference']= np.abs(satisfaction['JobSatisfaction']-satisfaction['CareerSatisfaction'])
satisfaction['Difference'].describe()
Out[10]:
As we can see, in general people give approximately the same satisfaction score to their job and to their career. It means that the distribution seen above is correct. Let's now see what causes the developers to be happy or unhappy in their professional life.
In [11]:
# We keep the columns liked to satisfaction,
# we mean them together and normalize the result
satisfaction_mean = prof_stack[["JobSatisfaction", "CareerSatisfaction"]].mean(axis=1)
satisfaction_mean = (satisfaction_mean - satisfaction_mean.min()) / (satisfaction_mean.max() - satisfaction_mean.min())
# Separate the other factors
other_factors = prof_stack.drop(["StackOverflowSatisfaction", "CareerSatisfaction", "JobSatisfaction", "ExpectedSalary"], axis=1)
other_factors = other_factors.fillna("")
# In order to measure the correlation we need
# to encode the labels need to encode the
# labels to measure the correlation
for c in other_factors.columns:
le = preprocessing.LabelEncoder()
le.fit(other_factors[c])
other_factors[c] = le.transform(other_factors[c])
# Which columns does the satisfaction correlate the most with ?
most_satisfactory = other_factors.corrwith(satisfaction_mean).nlargest(10)
most_satisfactory.plot.bar()
plt.ylabel("Correlation")
plt.xlabel("Other factors")
plt.title("Column correlation for the measure of satisfaction")
plt.show()
As expected, we can see that the more people have a high salary, the more satisfied they are. Interestingly, it also correlates with general welness at work, for example the equipment, or the environnement, but also with their free time ! They see to be more happy if they program as a hobby and if they build stuff on their own. Finally, obviously the biggest factors like Country, Education and Race matter a lot too, which was to be expected.
In [12]:
plot_stud_prof(prof_stack=prof_stack, stud_stack=stud_stack, column='DiversityImportant', title="Diversity Importance")
We can pretty easily see that the two distributions are really close and it seems to not be different between professionals and students
In [13]:
f, (box, dist) = plt.subplots(nrows=1, ncols=2, figsize=(10,6))
f.subplots_adjust(wspace=0.7)
# Boxplot of the salaries
sns.boxplot(prof_stack.Salary, orient='v', ax=box)
box.set_title("Box-plot of the total salaries")
# Comparing India and the USA salary-wise
usa_salary = prof_stack[prof_stack.Country == "United States"].Salary
usa_salary.rename("United States", inplace=True)
india_salary = prof_stack[prof_stack.Country == "India"].Salary
india_salary.rename("India", inplace=True)
india_salary.plot.kde(legend=True, ax=dist)
usa_salary.plot.kde(legend=True, ax=dist)
dist.set_title("Salary distribution")
dist.set_xlabel("Salary in USD")
plt.show()
As expected, the boxplot shows that we have a high breadth of salaries. If we compare the distribution of two different countries, we can see why, each country simply have their own revenue model for the profession. What about the students, do they their salary expectation meet the reality or not ?
In [14]:
# We compare the student expectations to the young professionals
stud_stack.ExpectedSalary.plot(kind='kde', figsize=(10,8), color='r', legend=True)
prof_stack[(prof_stack.YearsProgram == "Less than a year")].Salary.plot(kind='kde', figsize=(7,7), legend=True)
plt.xlabel("Salary/Expected Salary in USD")
plt.title("Distribution of expected salary and salary for the students and professionals")
plt.show()
Indeed, they match quite nicely ! Students are not the highest dreamers after all.
Another really important aspect concerning the salary is the equality inbetween women and men. Let's verify that it holds for the StackOverflow community.
In [15]:
female_salary = stack[stack.Gender == "Female"].Salary
male_salary = stack[stack.Gender == "Male"].Salary
male_salary.rename("Male", inplace=True)
female_salary.rename("Female", inplace=True)
print("There are {} male samples and {} female samples.".format(male_salary.count(), female_salary.count()))
As we can see the samples are really unequal, so it would be unwise to try and draw any conclusions as the salaries are supposedly quite close to one another. We decided instead to take it very seriously and run a pairwise-matching to find a one to one mapping of each woman to a man, without taking into account the salary, and then check the distributions. This way the comparisons will make sense.
In [16]:
# We select the columns to use as features
similarity_features = ['Country', 'University',
'EmploymentStatus', 'FormalEducation', 'CompanySize',
'CompanyType', 'YearsCodedJob', 'DeveloperType',
'WebDeveloperType', 'NonDeveloperType',
'EducationTypes', 'HaveWorkedLanguage', 'Methodology',
'HighestEducationParents', 'Race']
# The following columns we want a 1 to 1 matching for
# as they are the most important features
perfect_matches = ['Country', 'EmploymentStatus', 'FormalEducation',
'YearsCodedJob','HighestEducationParents', 'Race']
sim_stack = prof_stack[similarity_features].copy()
to_dummy = ["HaveWorkedLanguage", "DeveloperType", "Methodology", "EducationTypes", "NonDeveloperType", "Race"]
for sub in to_dummy:
sim_stack[sub] = sim_stack[sub].apply(lambda x: str(x).replace(" ", "").split(";"))
if sub == "Race":
sim_stack[sub] = sim_stack[sub].apply(lambda x: ["Race_" + s for s in x])
sim_stack = pd.concat([sim_stack, pd.get_dummies(pd.DataFrame(sim_stack[sub].tolist(), index=sim_stack.index).stack()).sum(level=0)], axis=1).drop(sub, axis=1)
# Final dummies we will be using
dummied_prof = pd.get_dummies(sim_stack)
dum_female = dummied_prof[prof_stack.Gender == "Female"]
dum_male = dummied_prof[prof_stack.Gender == "Male"]
# We load the data from a pickle for
# speed otherwise, change the constant
load_from_pickle = False
matching = pd.DataFrame(index=dum_female.index, columns=["male_matched", "correlation"])
for i,fem in enumerate(matching.index):
print("Progress: {:.2f}%".format((i+1)/matching.shape[0]*100), end="\r")
# We match the following female
female_to_match = dum_female.loc[fem]
# We want an exact match for the
# specified columns in the previous cell
perfect_columns = [col for c in perfect_matches for col in dum_male if col.startswith(c) ]
male_possibilities = dum_male[(dum_male[perfect_columns] == female_to_match[perfect_columns]).all(axis=1)]
# For the other columns, we select
# the closest sample correlation-wise
correlations = male_possibilities.drop(perfect_columns, axis=1).corrwith(dum_female.drop(perfect_columns, axis=1).loc[fem], axis=1)
# If we find no perfect match, the
# female sample is discarded
if correlations.shape[0] == 0:
continue
# Select the best male match and remove
# it from all possible matches
best_match = correlations.idxmax()
dum_male.drop(best_match, inplace=True)
matching.loc[fem]["male_matched"] = best_match
matching.loc[fem]["correlation"] = correlations.max()
# Remove the unmatched pairs
matching.dropna(inplace=True)
matching = matching.reset_index()
matched_indices = matching.T.iloc[0].tolist() + matching.T.iloc[1].tolist()
matched_data = prof_stack.loc[matched_indices]
male_salary = matched_data[matched_data.Gender == "Male"].Salary
female_salary = matched_data[matched_data.Gender == "Female"].Salary
male_salary.rename("Male", inplace=True)
female_salary.rename("Female", inplace=True)
# Let's plot it
male_salary.plot.kde(color="r", legend=True, figsize=(10,8))
female_salary.plot.kde(legend=True, figsize=(10,8))
plt.title("Distribution of male and female salaries")
plt.show()
# We perform a Kolmogorov-Smirnoff test
statistic, p_value = ks_2samp(female_salary, male_salary)
print("We get a statistic of {:.4f} and a p_value of {:.4f} for {} male samples and {} female samples."
.format(statistic, p_value, male_salary.count(), female_salary.count()))
# We also take a look at the quartiles
print("\nThe male salaries quartiles:\n{}".format(male_salary.describe()))
print("\nThe female salaries quartiles:\n{}".format(female_salary.describe()))
As the p-value is above the 0.05 threshold we cannot reject the hypothesis that the salary samples are drawn from the same distribution. Which is confirmed by the low statistic. We can also see that the distribution of the female salaries is a bit to the left compared to the men one, which is confirmed by the quartiles, where we see the men winning mostly in that regard.
Still, we should be reminded of the fact that we have only 447 samples of each population, which might not be enough to base any conclusion off of.
To get a final feel of the data, we take a look at the distribution of genders and races of the respondents.
In [17]:
f, (race, gender) = plt.subplots(nrows=1, ncols=2, figsize=(10,6))
f.subplots_adjust(wspace=0.7)
prof_stack.Race.value_counts(normalize=True)[0:6].plot(kind='bar', ax=race)
race.set_title("Distribution of the races")
race.set_ylabel("Ratio")
prof_stack.Gender.value_counts(normalize=True)[0:4].plot(kind='bar', ax=gender)
gender.set_title("Distribution of the genders")
gender.set_ylabel("Ratio")
plt.show()
In [18]:
prof_education = pd.Series([ed for sublist in [str(educs).replace(" ", "").split(";") for educs in prof_stack['EducationTypes'].dropna()] for ed in sublist])
stud_education = pd.Series([ed for sublist in [str(educs).replace(" ", "").split(";") for educs in stud_stack['EducationTypes'].dropna()] for ed in sublist])
plot_stud_prof(prof=prof_education, stud=stud_education, title="Education types")
Again without any surprise, the distributions are quite the same, appart from the ones linked to the job/student life, for example students are more likely to learn using online courses and professional using on-the-job training.
In [19]:
gif_df = stack[["PronounceGIF", "Salary"]].dropna(how='any')
gif_df = gif_df.set_index("PronounceGIF")
g_df = gif_df.loc['With a hard "g," like "gift"'].Salary.values
j_df = gif_df.loc['With a soft "g," like "jiff"'].Salary.values
comparing_df = pd.DataFrame()
comparing_df['gif'] = pd.Series(g_df)
# We fill the empty values
filling = np.empty((6081))
filling[:] = np.nan
to_add = np.append(j_df, filling)
comparing_df['jif'] = pd.Series(to_add)
# Let's check it out
plot = sns.boxplot(data=comparing_df, orient="v",)
plt.ylabel("Salary")
plt.title("Distribution of salary for the gif and jif populations")
plt.show()
print(comparing_df.describe())
As we can quickly see, there is not much difference inbetween the gif people and the jif people in terms of salary.
In this section we created a recommender system to help recruiters find the candidates they are looking for. The idea is to create a network using the individuals as nodes and their similarity as the weights of the edges. Recruiters can ask for certain skills/education/locations and we will show them the closest entities to their request.
In [20]:
#Keep important features
important_features_prof = ['ProgramHobby', 'Country', 'University', 'FormalEducation', 'MajorUndergrad', 'CompanyType'
,'YearsCodedJob', 'YearsProgram', 'DeveloperType', 'CareerSatisfaction', 'JobSatisfaction', 'Overpaid'
, 'HaveWorkedLanguage', 'WantWorkLanguage']
important_features_stud = ['ProgramHobby', 'Country', 'University', 'FormalEducation', 'YearsProgram','WorkStart',
'HaveWorkedLanguage', 'WantWorkLanguage', 'AuditoryEnvironment']
final_prof_stack = prof_stack[important_features_prof].copy()
final_stud_stack = stud_stack[important_features_stud].copy()
We first compute the k-nearest neighbors of each individual in our dataset and plot them on the corresponding graph.
In [21]:
# Let's take a look at the students network
preprocessed_dfs_stud = preprocessed(final_stud_stack, ["HaveWorkedLanguage", "WantWorkLanguage"],'WantWorkLanguage', False)
G_stud, pos_stud = draw_graph(compute_knn_graph(preprocessed_dfs_stud[0]), "Network containing StackOverflow's students")
In [22]:
# Now, onto the professionals network
preprocessed_dfs_prof = preprocessed(final_prof_stack, ["HaveWorkedLanguage", "WantWorkLanguage", 'DeveloperType'],'WantWorkLanguage', True)
G_prof, pos_prof = draw_graph(compute_knn_graph(preprocessed_dfs_prof[0]), "Network containing StackOverflow's professionals")
As we have less students than professionals we can see that the network with students is more sparse.
Afterwards, we wanted to see the patterns we could find in our network due to the important features we declared above.
In [23]:
map_stud, df_encode_stud = encode_label(preprocessed_dfs_stud[1], important_features_stud)
map_prof, df_encode_prof = encode_label(preprocessed_dfs_prof[1], important_features_prof)
# A few features for the student's network
draw_features(important_features_stud, df_encode_stud, map_stud, G_stud, pos_stud, "student's")
In [24]:
# A few features for the professional's network
draw_features(important_features_prof, df_encode_prof, map_prof, G_prof, pos_prof, "professional's")