Why you quit your job?

This notebook is trying to find what people want to quit. Also some interesting fact will be revealed during the digging of data


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set(style='white', color_codes=True)

In [2]:
data = pd.read_csv('HR_comma_sep.csv')

In [3]:
data.head()


Out[3]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

In [4]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Data is pretty clean. Love it! No nulls. Also most of them are in numeric format. Lets take a look of the values in the data


In [5]:
data['sales'].unique()


Out[5]:
array(['sales', 'accounting', 'hr', 'technical', 'support', 'management',
       'IT', 'product_mng', 'marketing', 'RandD'], dtype=object)

In [6]:
data['Work_accident'].unique()


Out[6]:
array([0, 1], dtype=int64)

In [7]:
data['left'].unique()


Out[7]:
array([1, 0], dtype=int64)

In [8]:
data['promotion_last_5years'].unique()


Out[8]:
array([0, 1], dtype=int64)

In [9]:
correlation = data.corr()
correlation


Out[9]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866 0.058697 -0.388375 0.025605
last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591 -0.007104 0.006567 -0.008684
number_project -0.142970 0.349333 1.000000 0.417211 0.196786 -0.004741 0.023787 -0.006064
average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755 -0.010143 0.071287 -0.003544
time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000 0.002120 0.144822 0.067433
Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120 1.000000 -0.154622 0.039245
left -0.388375 0.006567 0.023787 0.071287 0.144822 -0.154622 1.000000 -0.061788
promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433 0.039245 -0.061788 1.000000

Visulisation of correlations


In [10]:
plt.figure(figsize = (10,10))
sb.heatmap(correlation, vmax=0.8, square=True, annot=True)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xa9aca58>

Left vs Satisfaction Level has a strong negative correlation.

See what box plot can do


In [11]:
sb.boxplot(x='left', y='satisfaction_level', data=data)


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xabca668>

Violin plot


In [12]:
sb.violinplot(x='left', y='satisfaction_level', data=data, siz=6)


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xc0636a0>

Looks like a lot more people who left have a much lower satisfaction (obvious)


In [13]:
sb.FacetGrid(data, hue='left').map(sb.kdeplot,'satisfaction_level').add_legend()


Out[13]:
<seaborn.axisgrid.FacetGrid at 0xc1596a0>

Seperate the data by left/stay.


In [14]:
left = data.groupby('left').mean()
left


Out[14]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident promotion_last_5years
left
0 0.666810 0.715473 3.786664 199.060203 3.380032 0.175009 0.026251
1 0.440098 0.718113 3.855503 207.419210 3.876505 0.047326 0.005321

More hours, less promotion...and leave...


In [15]:
salary = data.groupby('salary').mean()
salary


Out[15]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
salary
high 0.637470 0.704325 3.767179 199.867421 3.692805 0.155214 0.066289 0.058205
low 0.600753 0.717017 3.799891 200.996583 3.438218 0.142154 0.296884 0.009021
medium 0.621817 0.717322 3.813528 201.338349 3.529010 0.145361 0.204313 0.028079

In [16]:
sales = data.groupby('sales').sum()
sales


Out[16]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
sales
IT 758.46 879.55 4683 248119 4256 164 273 3
RandD 487.80 560.44 3033 158030 2650 134 121 27
accounting 446.51 550.49 2934 154292 2702 96 204 14
hr 442.52 523.84 2701 146828 2480 89 215 15
management 391.45 456.12 2432 126787 2711 103 91 69
marketing 530.76 614.23 3164 171073 3063 138 203 43
product_mng 558.91 644.71 3434 180369 3135 132 198 0
sales 2543.81 2938.23 15634 831773 14631 587 1014 100
support 1378.19 1611.81 8479 447490 7563 345 555 20
technical 1653.48 1961.39 10548 550793 9279 381 697 28

It's more like a 'sales type' company.


In [17]:
sales_mean = data.groupby('sales').mean()
sales_mean


Out[17]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
sales
IT 0.618142 0.716830 3.816626 202.215974 3.468623 0.133659 0.222494 0.002445
RandD 0.619822 0.712122 3.853875 200.800508 3.367217 0.170267 0.153748 0.034307
accounting 0.582151 0.717718 3.825293 201.162973 3.522816 0.125163 0.265971 0.018253
hr 0.598809 0.708850 3.654939 198.684709 3.355886 0.120433 0.290934 0.020298
management 0.621349 0.724000 3.860317 201.249206 4.303175 0.163492 0.144444 0.109524
marketing 0.618601 0.715886 3.687646 199.385781 3.569930 0.160839 0.236597 0.050117
product_mng 0.619634 0.714756 3.807095 199.965632 3.475610 0.146341 0.219512 0.000000
sales 0.614447 0.709717 3.776329 200.911353 3.534058 0.141787 0.244928 0.024155
support 0.618300 0.723109 3.803948 200.758188 3.393001 0.154778 0.248991 0.008973
technical 0.607897 0.721099 3.877941 202.497426 3.411397 0.140074 0.256250 0.010294

Oh! Product Manager got zero promotion...

management had more promotion on average. Now go and get a management position !


In [18]:
data.groupby('sales').mean()['satisfaction_level'].plot(kind='bar', color ='b')


Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0xc63a320>

Model Training and Prediction


In [19]:
#from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
#from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#from sklearn.feature_selection import SelectFromModel

In [20]:
data2 = pd.get_dummies(data)
data2.head()


Out[20]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales_IT sales_RandD ... sales_hr sales_management sales_marketing sales_product_mng sales_sales sales_support sales_technical salary_high salary_low salary_medium
0 0.38 0.53 2 157 3 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
1 0.80 0.86 5 262 6 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
2 0.11 0.88 7 272 4 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
3 0.72 0.87 5 223 5 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0
4 0.37 0.52 2 159 3 0 1 0 0.0 0.0 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0

5 rows × 21 columns


In [21]:
d2_copy = data2

In [22]:
y=d2_copy['left'].values
X = d2_copy.drop(['left'], axis=1).values

In [23]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.5)

Logistic Regression


In [24]:
LR = LogisticRegression(n_jobs=-1)
LR.fit(Xtrain, ytrain)
y_predict_proba = LR.predict_proba(Xtest)
y_predict = LR.predict(Xtest)

In [25]:
print (sum(y_predict == ytest)) / float(len(ytest))


0.799333333333

Support Vector Machine


In [26]:
SV = svm.SVC()
SV.fit(Xtrain, ytrain)
y_predict = SV.predict(Xtest)
print (sum(y_predict == ytest)) / float(len(ytest))


0.943466666667

Random Forest


In [27]:
rc = RandomForestClassifier()
param_grid = {'max_features':[10,12,14], 'n_estimators' : [500]}

gs = GridSearchCV(estimator = rc, param_grid= param_grid, cv=2, n_jobs=-1, verbose=1)
gs.fit(Xtrain, ytrain)


Fitting 2 folds for each of 3 candidates, totalling 6 fits
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   48.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   48.5s finished
Out[27]:
GridSearchCV(cv=2, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [10, 12, 14], 'n_estimators': [500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [28]:
print gs.best_score_
print gs.best_params_


0.982931057474
{'max_features': 10, 'n_estimators': 500}

In [29]:
y_f = gs.predict(Xtest)

In [30]:
print (sum(y_f == ytest)) / float(len(ytest))
#print (sum(pd.DataFrame(y).idxmax(axis=1).values == ytest)/float(len(ytest)))


0.9868

In [31]:
column_name = data2.columns
column_name


Out[31]:
Index([u'satisfaction_level', u'last_evaluation', u'number_project',
       u'average_montly_hours', u'time_spend_company', u'Work_accident',
       u'left', u'promotion_last_5years', u'sales_IT', u'sales_RandD',
       u'sales_accounting', u'sales_hr', u'sales_management',
       u'sales_marketing', u'sales_product_mng', u'sales_sales',
       u'sales_support', u'sales_technical', u'salary_high', u'salary_low',
       u'salary_medium'],
      dtype='object')

Feature importance with Random Forest


In [32]:
#{'max_features': 12, 'n_estimators': 500}
rc = RandomForestClassifier(n_estimators=500, max_features=12)
forest = rc.fit(Xtrain, ytrain)
feature_imp = forest.feature_importances_

In [33]:
feature_importance_RF = pd.DataFrame(zip(column_name, feature_imp))

In [34]:
new_col_name = ['feature','importance']
feature_importance_RF.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_RF[:6])


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d14048>

#


In [35]:
gb = GradientBoostingClassifier()
gb.fit(Xtrain,ytrain)
y_predict = gb.predict(Xtest)
print sum(y_predict==ytest)/float(len(ytest))


0.974133333333

Feature Importance with Gradient boosting


In [36]:
feature_importance_GC = pd.DataFrame(zip(column_name, gb.feature_importances_))
feature_importance_GC.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_GC[:6])


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c5d860>

Who will leave the company?


In [37]:
pred_of_stay = forest.predict_proba(X[y == 0])
stay = data[y==0]
stay[pred_of_stay[:, 1] ==1]


Out[37]:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
6358 0.81 0.98 5 243 6 0 0 0 sales medium
6466 0.39 0.57 2 132 3 0 0 0 support low
7762 0.82 0.87 5 273 6 0 0 0 support medium
9781 0.42 0.50 2 151 3 0 0 0 sales low

In [38]:
### Who will leave?