Software Developer career satisfaction detection - DEMO

Created by Judit Acs



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.preprocessing import LabelEncoder

from keras.layers import Input, Dense, Dropout
from keras.models import Model
from keras.callbacks import EarlyStopping









    



Using TensorFlow backend.

Load data

We use pandas for data loading and preprocessing.

Data source (kaggle.com)



In [2]:

    
df = pd.read_csv("data/stackoverflow/survey_results_public.csv")
df.head()









    Out[2]:







  
    
      
      Respondent
      Professional
      ProgramHobby
      Country
      University
      EmploymentStatus
      FormalEducation
      MajorUndergrad
      HomeRemote
      CompanySize
      ...
      StackOverflowMakeMoney
      Gender
      HighestEducationParents
      Race
      SurveyLong
      QuestionsInteresting
      QuestionsConfusing
      InterestedAnswers
      Salary
      ExpectedSalary
    
  
  
    
      0
      1
      Student
      Yes, both
      United States
      No
      Not employed, and not looking for work
      Secondary school
      NaN
      NaN
      NaN
      ...
      Strongly disagree
      Male
      High school
      White or of European descent
      Strongly disagree
      Strongly agree
      Disagree
      Strongly agree
      NaN
      NaN
    
    
      1
      2
      Student
      Yes, both
      United Kingdom
      Yes, full-time
      Employed part-time
      Some college/university study without earning ...
      Computer science or software engineering
      More than half, but not all, the time
      20 to 99 employees
      ...
      Strongly disagree
      Male
      A master's degree
      White or of European descent
      Somewhat agree
      Somewhat agree
      Disagree
      Strongly agree
      NaN
      37500.0
    
    
      2
      3
      Professional developer
      Yes, both
      United Kingdom
      No
      Employed full-time
      Bachelor's degree
      Computer science or software engineering
      Less than half the time, but at least one day ...
      10,000 or more employees
      ...
      Disagree
      Male
      A professional degree
      White or of European descent
      Somewhat agree
      Agree
      Disagree
      Agree
      113750.0
      NaN
    
    
      3
      4
      Professional non-developer who sometimes write...
      Yes, both
      United States
      No
      Employed full-time
      Doctoral degree
      A non-computer-focused engineering discipline
      Less than half the time, but at least one day ...
      10,000 or more employees
      ...
      Disagree
      Male
      A doctoral degree
      White or of European descent
      Agree
      Agree
      Somewhat agree
      Strongly agree
      NaN
      NaN
    
    
      4
      5
      Professional developer
      Yes, I program as a hobby
      Switzerland
      No
      Employed full-time
      Master's degree
      Computer science or software engineering
      Never
      10 to 19 employees
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 154 columns

Most answers are categorical:



In [3]:

    
df.groupby("ProgramHobby").size()









    Out[3]:





ProgramHobby
No                                            9787
Yes, I contribute to open source projects     3048
Yes, I program as a hobby                    24801
Yes, both                                    13756
dtype: int64

The table has 154 columns but many values are empty. These columns have the most non-empty values:



In [4]:

    
df.count().sort_values(ascending=False)[:20]









    Out[4]:





Respondent                51392
Professional              51392
FormalEducation           51392
EmploymentStatus          51392
University                51392
Country                   51392
ProgramHobby              51392
YearsProgram              51145
PronounceGIF              51008
HomeRemote                44008
MajorUndergrad            42841
CareerSatisfaction        42695
ClickyKeys                42046
YearsCodedJob             40890
JobSatisfaction           40376
CompanySize               38922
TabsSpaces                38851
CompanyType               38823
StackOverflowDescribes    36932
WorkStart                 36696
dtype: int64

Feature extraction

We will use a few columns as features and CareerSatisfaction as the target variable:



In [5]:

    
feature_cols = ["Professional", "EmploymentStatus", "FormalEducation", "ProgramHobby", "HomeRemote",
                "IDE", "MajorUndergrad"]

# I do not include JobSatisfaction because it's too similar to the target variable
# Uncomment this line to include it in the features
# feature_cols.append("JobSatisfaction")

target_col = "CareerSatisfaction"

We filter all rows that do not define every feature column.



In [6]:

    
condition = (df[target_col].notnull())
for c in feature_cols:
    condition &= (df[c].notnull())
df = df[condition]
len(df)









    Out[6]:





27302

CareerSatisfaction values are distributed unevenly, so we may want to include fewer samples from very large classes. Uncomment the second to last line to filter these these samples:



In [7]:

    
minval = df.groupby(target_col).size().min() * 2

filt = None

for grouper, group in df.groupby(target_col):
    size = min(minval, len(group))
    if filt is None:
        filt = group.sample(size)
    else:
        filt = pd.concat((filt, group.sample(size)), axis=0)
#df = filt
len(df)









    Out[7]:





27302

Convert categorical features to one-hot vectors

Categorical features need to be encoded as one-hot vectors instead of a single integer value. LabelEncoder does this automatically:



In [8]:

    
X = None

for col in feature_cols:
    mtx = LabelEncoder().fit_transform(df[col])
    maxval = np.max(mtx)
    feat_mtx = np.zeros((mtx.shape[0], maxval+1))
    feat_mtx[np.arange(feat_mtx.shape[0]), mtx] = 1
    if X is None:
        X = feat_mtx
    else:
        X = np.concatenate((X, feat_mtx), axis=1)

Scale labels to [0, 1]



In [9]:

    
y = df[target_col].as_matrix() / 10

Shuffle data



In [10]:

    
rand_mtx = np.random.permutation(X.shape[0])
train_split = int(X.shape[0] * 0.9)
train_indices = rand_mtx[:train_split]
test_indices = rand_mtx[train_split:]

X_train = X[train_indices]
X_test = X[test_indices]
y_train = y[train_indices]
y_test = y[test_indices]

Define the model



In [11]:

    
input_layer = Input(batch_shape=(None, X.shape[1]))
layer = Dense(100, activation="sigmoid")(input_layer)
layer = Dropout(.2)(layer)
layer = Dense(100, activation="sigmoid")(input_layer)
layer = Dropout(.2)(layer)
layer = Dense(100, activation="sigmoid")(input_layer)
layer = Dropout(.2)(layer)
layer = Dense(1, activation="sigmoid")(layer)
model = Model(inputs=input_layer, outputs=layer)
model.compile("rmsprop", loss="mse")

Train the model



In [12]:

    
ea = EarlyStopping(patience=2)
model.fit(X_train, y_train, epochs=100, batch_size=512,
          validation_split=.2, callbacks=[ea])









    



Train on 19656 samples, validate on 4915 samples
Epoch 1/100
19656/19656 [==============================] - 0s - loss: 0.0360 - val_loss: 0.0326
Epoch 2/100
19656/19656 [==============================] - 0s - loss: 0.0348 - val_loss: 0.0327
Epoch 3/100
19656/19656 [==============================] - 0s - loss: 0.0344 - val_loss: 0.0324
Epoch 4/100
19656/19656 [==============================] - 0s - loss: 0.0338 - val_loss: 0.0329
Epoch 5/100
19656/19656 [==============================] - 0s - loss: 0.0331 - val_loss: 0.0324
Epoch 6/100
19656/19656 [==============================] - 0s - loss: 0.0327 - val_loss: 0.0327






    Out[12]:





<keras.callbacks.History at 0x7f7b0d5f1b00>

Predict labels



In [13]:

    
pred = model.predict(X_test)

Compute loss

We compute it manually.



In [14]:

    
np.sqrt(np.mean(pred - y_test) ** 2)









    Out[14]:





0.016742466623116976

What would be the loss if we guessed 0.5 every time?

It is always a good idea to perform sanity checks. Did our model learn anything more useful than a trivial solution or a random generator?



In [15]:

    
np.sqrt(np.mean(.5*np.ones(y_test.shape[0]) - y_test) ** 2)









    Out[15]:





0.24335408275357012

Histogram of predictions vs. gold labels



In [16]:

    
prediction = pd.DataFrame({'gold': y_test, 'prediction': pred[:, 0]})
prediction['diff'] = prediction.gold - prediction.prediction
prediction.hist(['gold', 'prediction'], bins=11)









    Out[16]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f7b4c61ec50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f7b0bc9c438>]], dtype=object)

Plot labels for 20 random samples



In [17]:

    
prediction.sample(20).plot(y=['gold', 'prediction'], kind='bar')









    Out[17]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7b0d44d128>

How similar are the answers for JobSafisfaction and CareerSatisfaction?



In [18]:

    
df[['JobSatisfaction', 'CareerSatisfaction']].sample(20).plot(kind='bar')









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7b0aa5b4e0>

They are very similar

Try training the model with and without this feature.



In [19]:

    
(df['JobSatisfaction'] - df['CareerSatisfaction']).hist(bins=20)









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7b0aa5cda0>

	Respondent	Professional	ProgramHobby	Country	University	EmploymentStatus	FormalEducation	MajorUndergrad	HomeRemote	CompanySize	...	StackOverflowMakeMoney	Gender	HighestEducationParents	Race	SurveyLong	QuestionsInteresting	QuestionsConfusing	InterestedAnswers	Salary	ExpectedSalary
0	1	Student	Yes, both	United States	No	Not employed, and not looking for work	Secondary school	NaN	NaN	NaN	...	Strongly disagree	Male	High school	White or of European descent	Strongly disagree	Strongly agree	Disagree	Strongly agree	NaN	NaN
1	2	Student	Yes, both	United Kingdom	Yes, full-time	Employed part-time	Some college/university study without earning ...	Computer science or software engineering	More than half, but not all, the time	20 to 99 employees	...	Strongly disagree	Male	A master's degree	White or of European descent	Somewhat agree	Somewhat agree	Disagree	Strongly agree	NaN	37500.0
2	3	Professional developer	Yes, both	United Kingdom	No	Employed full-time	Bachelor's degree	Computer science or software engineering	Less than half the time, but at least one day ...	10,000 or more employees	...	Disagree	Male	A professional degree	White or of European descent	Somewhat agree	Agree	Disagree	Agree	113750.0	NaN
3	4	Professional non-developer who sometimes write...	Yes, both	United States	No	Employed full-time	Doctoral degree	A non-computer-focused engineering discipline	Less than half the time, but at least one day ...	10,000 or more employees	...	Disagree	Male	A doctoral degree	White or of European descent	Agree	Agree	Somewhat agree	Strongly agree	NaN	NaN
4	5	Professional developer	Yes, I program as a hobby	Switzerland	No	Employed full-time	Master's degree	Computer science or software engineering	Never	10 to 19 employees	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN