Why you quit your job?

This notebook is trying to find what people want to quit. Also some interesting fact will be revealed during the digging of data



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set(style='white', color_codes=True)



In [2]:

    
data = pd.read_csv('HR_comma_sep.csv')



In [3]:

    
data.head()









    Out[3]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
      sales
      salary
    
  
  
    
      0
      0.38
      0.53
      2
      157
      3
      0
      1
      0
      sales
      low
    
    
      1
      0.80
      0.86
      5
      262
      6
      0
      1
      0
      sales
      medium
    
    
      2
      0.11
      0.88
      7
      272
      4
      0
      1
      0
      sales
      medium
    
    
      3
      0.72
      0.87
      5
      223
      5
      0
      1
      0
      sales
      low
    
    
      4
      0.37
      0.52
      2
      159
      3
      0
      1
      0
      sales
      low



In [4]:

    
data.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB

Data is pretty clean. Love it! No nulls. Also most of them are in numeric format. Lets take a look of the values in the data



In [5]:

    
data['sales'].unique()









    Out[5]:





array(['sales', 'accounting', 'hr', 'technical', 'support', 'management',
       'IT', 'product_mng', 'marketing', 'RandD'], dtype=object)



In [6]:

    
data['Work_accident'].unique()









    Out[6]:





array([0, 1], dtype=int64)



In [7]:

    
data['left'].unique()









    Out[7]:





array([1, 0], dtype=int64)



In [8]:

    
data['promotion_last_5years'].unique()









    Out[8]:





array([0, 1], dtype=int64)



In [9]:

    
correlation = data.corr()
correlation









    Out[9]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
    
  
  
    
      satisfaction_level
      1.000000
      0.105021
      -0.142970
      -0.020048
      -0.100866
      0.058697
      -0.388375
      0.025605
    
    
      last_evaluation
      0.105021
      1.000000
      0.349333
      0.339742
      0.131591
      -0.007104
      0.006567
      -0.008684
    
    
      number_project
      -0.142970
      0.349333
      1.000000
      0.417211
      0.196786
      -0.004741
      0.023787
      -0.006064
    
    
      average_montly_hours
      -0.020048
      0.339742
      0.417211
      1.000000
      0.127755
      -0.010143
      0.071287
      -0.003544
    
    
      time_spend_company
      -0.100866
      0.131591
      0.196786
      0.127755
      1.000000
      0.002120
      0.144822
      0.067433
    
    
      Work_accident
      0.058697
      -0.007104
      -0.004741
      -0.010143
      0.002120
      1.000000
      -0.154622
      0.039245
    
    
      left
      -0.388375
      0.006567
      0.023787
      0.071287
      0.144822
      -0.154622
      1.000000
      -0.061788
    
    
      promotion_last_5years
      0.025605
      -0.008684
      -0.006064
      -0.003544
      0.067433
      0.039245
      -0.061788
      1.000000

Visulisation of correlations



In [10]:

    
plt.figure(figsize = (10,10))
sb.heatmap(correlation, vmax=0.8, square=True, annot=True)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0xa9aca58>

Left vs Satisfaction Level has a strong negative correlation.

See what box plot can do



In [11]:

    
sb.boxplot(x='left', y='satisfaction_level', data=data)









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0xabca668>

Violin plot



In [12]:

    
sb.violinplot(x='left', y='satisfaction_level', data=data, siz=6)









    Out[12]:





<matplotlib.axes._subplots.AxesSubplot at 0xc0636a0>

Looks like a lot more people who left have a much lower satisfaction (obvious)



In [13]:

    
sb.FacetGrid(data, hue='left').map(sb.kdeplot,'satisfaction_level').add_legend()









    Out[13]:





<seaborn.axisgrid.FacetGrid at 0xc1596a0>

Seperate the data by left/stay.



In [14]:

    
left = data.groupby('left').mean()
left









    Out[14]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      promotion_last_5years
    
    
      left
      
      
      
      
      
      
      
    
  
  
    
      0
      0.666810
      0.715473
      3.786664
      199.060203
      3.380032
      0.175009
      0.026251
    
    
      1
      0.440098
      0.718113
      3.855503
      207.419210
      3.876505
      0.047326
      0.005321

More hours, less promotion...and leave...



In [15]:

    
salary = data.groupby('salary').mean()
salary









    Out[15]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
    
    
      salary
      
      
      
      
      
      
      
      
    
  
  
    
      high
      0.637470
      0.704325
      3.767179
      199.867421
      3.692805
      0.155214
      0.066289
      0.058205
    
    
      low
      0.600753
      0.717017
      3.799891
      200.996583
      3.438218
      0.142154
      0.296884
      0.009021
    
    
      medium
      0.621817
      0.717322
      3.813528
      201.338349
      3.529010
      0.145361
      0.204313
      0.028079



In [16]:

    
sales = data.groupby('sales').sum()
sales









    Out[16]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
    
    
      sales
      
      
      
      
      
      
      
      
    
  
  
    
      IT
      758.46
      879.55
      4683
      248119
      4256
      164
      273
      3
    
    
      RandD
      487.80
      560.44
      3033
      158030
      2650
      134
      121
      27
    
    
      accounting
      446.51
      550.49
      2934
      154292
      2702
      96
      204
      14
    
    
      hr
      442.52
      523.84
      2701
      146828
      2480
      89
      215
      15
    
    
      management
      391.45
      456.12
      2432
      126787
      2711
      103
      91
      69
    
    
      marketing
      530.76
      614.23
      3164
      171073
      3063
      138
      203
      43
    
    
      product_mng
      558.91
      644.71
      3434
      180369
      3135
      132
      198
      0
    
    
      sales
      2543.81
      2938.23
      15634
      831773
      14631
      587
      1014
      100
    
    
      support
      1378.19
      1611.81
      8479
      447490
      7563
      345
      555
      20
    
    
      technical
      1653.48
      1961.39
      10548
      550793
      9279
      381
      697
      28

It's more like a 'sales type' company.



In [17]:

    
sales_mean = data.groupby('sales').mean()
sales_mean









    Out[17]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
    
    
      sales
      
      
      
      
      
      
      
      
    
  
  
    
      IT
      0.618142
      0.716830
      3.816626
      202.215974
      3.468623
      0.133659
      0.222494
      0.002445
    
    
      RandD
      0.619822
      0.712122
      3.853875
      200.800508
      3.367217
      0.170267
      0.153748
      0.034307
    
    
      accounting
      0.582151
      0.717718
      3.825293
      201.162973
      3.522816
      0.125163
      0.265971
      0.018253
    
    
      hr
      0.598809
      0.708850
      3.654939
      198.684709
      3.355886
      0.120433
      0.290934
      0.020298
    
    
      management
      0.621349
      0.724000
      3.860317
      201.249206
      4.303175
      0.163492
      0.144444
      0.109524
    
    
      marketing
      0.618601
      0.715886
      3.687646
      199.385781
      3.569930
      0.160839
      0.236597
      0.050117
    
    
      product_mng
      0.619634
      0.714756
      3.807095
      199.965632
      3.475610
      0.146341
      0.219512
      0.000000
    
    
      sales
      0.614447
      0.709717
      3.776329
      200.911353
      3.534058
      0.141787
      0.244928
      0.024155
    
    
      support
      0.618300
      0.723109
      3.803948
      200.758188
      3.393001
      0.154778
      0.248991
      0.008973
    
    
      technical
      0.607897
      0.721099
      3.877941
      202.497426
      3.411397
      0.140074
      0.256250
      0.010294

Oh! Product Manager got zero promotion...

management had more promotion on average. Now go and get a management position !



In [18]:

    
data.groupby('sales').mean()['satisfaction_level'].plot(kind='bar', color ='b')









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0xc63a320>

Model Training and Prediction



In [19]:

    
#from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
#from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#from sklearn.feature_selection import SelectFromModel



In [20]:

    
data2 = pd.get_dummies(data)
data2.head()









    Out[20]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
      sales_IT
      sales_RandD
      ...
      sales_hr
      sales_management
      sales_marketing
      sales_product_mng
      sales_sales
      sales_support
      sales_technical
      salary_high
      salary_low
      salary_medium
    
  
  
    
      0
      0.38
      0.53
      2
      157
      3
      0
      1
      0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      1
      0.80
      0.86
      5
      262
      6
      0
      1
      0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      0.11
      0.88
      7
      272
      4
      0
      1
      0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      3
      0.72
      0.87
      5
      223
      5
      0
      1
      0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      4
      0.37
      0.52
      2
      159
      3
      0
      1
      0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
  

5 rows × 21 columns



In [21]:

    
d2_copy = data2



In [22]:

    
y=d2_copy['left'].values
X = d2_copy.drop(['left'], axis=1).values



In [23]:

    
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.5)

Logistic Regression



In [24]:

    
LR = LogisticRegression(n_jobs=-1)
LR.fit(Xtrain, ytrain)
y_predict_proba = LR.predict_proba(Xtest)
y_predict = LR.predict(Xtest)



In [25]:

    
print (sum(y_predict == ytest)) / float(len(ytest))









    



0.799333333333

Support Vector Machine



In [26]:

    
SV = svm.SVC()
SV.fit(Xtrain, ytrain)
y_predict = SV.predict(Xtest)
print (sum(y_predict == ytest)) / float(len(ytest))









    



0.943466666667

Random Forest



In [27]:

    
rc = RandomForestClassifier()
param_grid = {'max_features':[10,12,14], 'n_estimators' : [500]}

gs = GridSearchCV(estimator = rc, param_grid= param_grid, cv=2, n_jobs=-1, verbose=1)
gs.fit(Xtrain, ytrain)









    



Fitting 2 folds for each of 3 candidates, totalling 6 fits






    



[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   48.5s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   48.5s finished






    Out[27]:





GridSearchCV(cv=2, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [10, 12, 14], 'n_estimators': [500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)



In [28]:

    
print gs.best_score_
print gs.best_params_









    



0.982931057474
{'max_features': 10, 'n_estimators': 500}



In [29]:

    
y_f = gs.predict(Xtest)



In [30]:

    
print (sum(y_f == ytest)) / float(len(ytest))
#print (sum(pd.DataFrame(y).idxmax(axis=1).values == ytest)/float(len(ytest)))



In [31]:

    
column_name = data2.columns
column_name









    Out[31]:





Index([u'satisfaction_level', u'last_evaluation', u'number_project',
       u'average_montly_hours', u'time_spend_company', u'Work_accident',
       u'left', u'promotion_last_5years', u'sales_IT', u'sales_RandD',
       u'sales_accounting', u'sales_hr', u'sales_management',
       u'sales_marketing', u'sales_product_mng', u'sales_sales',
       u'sales_support', u'sales_technical', u'salary_high', u'salary_low',
       u'salary_medium'],
      dtype='object')

Feature importance with Random Forest



In [32]:

    
#{'max_features': 12, 'n_estimators': 500}
rc = RandomForestClassifier(n_estimators=500, max_features=12)
forest = rc.fit(Xtrain, ytrain)
feature_imp = forest.feature_importances_



In [33]:

    
feature_importance_RF = pd.DataFrame(zip(column_name, feature_imp))



In [34]:

    
new_col_name = ['feature','importance']
feature_importance_RF.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_RF[:6])









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x10d14048>

#



In [35]:

    
gb = GradientBoostingClassifier()
gb.fit(Xtrain,ytrain)
y_predict = gb.predict(Xtest)
print sum(y_predict==ytest)/float(len(ytest))









    



0.974133333333

Feature Importance with Gradient boosting



In [36]:

    
feature_importance_GC = pd.DataFrame(zip(column_name, gb.feature_importances_))
feature_importance_GC.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_GC[:6])









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x11c5d860>

Who will leave the company?



In [37]:

    
pred_of_stay = forest.predict_proba(X[y == 0])
stay = data[y==0]
stay[pred_of_stay[:, 1] ==1]









    Out[37]:






  
    
      
      satisfaction_level
      last_evaluation
      number_project
      average_montly_hours
      time_spend_company
      Work_accident
      left
      promotion_last_5years
      sales
      salary
    
  
  
    
      6358
      0.81
      0.98
      5
      243
      6
      0
      0
      0
      sales
      medium
    
    
      6466
      0.39
      0.57
      2
      132
      3
      0
      0
      0
      support
      low
    
    
      7762
      0.82
      0.87
      5
      273
      6
      0
      0
      0
      support
      medium
    
    
      9781
      0.42
      0.50
      2
      151
      3
      0
      0
      0
      sales
      low



In [38]:

    
### Who will leave?

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
satisfaction_level	1.000000	0.105021	-0.142970	-0.020048	-0.100866	0.058697	-0.388375	0.025605
last_evaluation	0.105021	1.000000	0.349333	0.339742	0.131591	-0.007104	0.006567	-0.008684
number_project	-0.142970	0.349333	1.000000	0.417211	0.196786	-0.004741	0.023787	-0.006064
average_montly_hours	-0.020048	0.339742	0.417211	1.000000	0.127755	-0.010143	0.071287	-0.003544
time_spend_company	-0.100866	0.131591	0.196786	0.127755	1.000000	0.002120	0.144822	0.067433
Work_accident	0.058697	-0.007104	-0.004741	-0.010143	0.002120	1.000000	-0.154622	0.039245
left	-0.388375	0.006567	0.023787	0.071287	0.144822	-0.154622	1.000000	-0.061788
promotion_last_5years	0.025605	-0.008684	-0.006064	-0.003544	0.067433	0.039245	-0.061788	1.000000

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	promotion_last_5years
left
0	0.666810	0.715473	3.786664	199.060203	3.380032	0.175009	0.026251
1	0.440098	0.718113	3.855503	207.419210	3.876505	0.047326	0.005321

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
salary
high	0.637470	0.704325	3.767179	199.867421	3.692805	0.155214	0.066289	0.058205
low	0.600753	0.717017	3.799891	200.996583	3.438218	0.142154	0.296884	0.009021
medium	0.621817	0.717322	3.813528	201.338349	3.529010	0.145361	0.204313	0.028079

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
sales
IT	758.46	879.55	4683	248119	4256	164	273	3
RandD	487.80	560.44	3033	158030	2650	134	121	27
accounting	446.51	550.49	2934	154292	2702	96	204	14
hr	442.52	523.84	2701	146828	2480	89	215	15
management	391.45	456.12	2432	126787	2711	103	91	69
marketing	530.76	614.23	3164	171073	3063	138	203	43
product_mng	558.91	644.71	3434	180369	3135	132	198	0
sales	2543.81	2938.23	15634	831773	14631	587	1014	100
support	1378.19	1611.81	8479	447490	7563	345	555	20
technical	1653.48	1961.39	10548	550793	9279	381	697	28

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
sales
IT	0.618142	0.716830	3.816626	202.215974	3.468623	0.133659	0.222494	0.002445
RandD	0.619822	0.712122	3.853875	200.800508	3.367217	0.170267	0.153748	0.034307
accounting	0.582151	0.717718	3.825293	201.162973	3.522816	0.125163	0.265971	0.018253
hr	0.598809	0.708850	3.654939	198.684709	3.355886	0.120433	0.290934	0.020298
management	0.621349	0.724000	3.860317	201.249206	4.303175	0.163492	0.144444	0.109524
marketing	0.618601	0.715886	3.687646	199.385781	3.569930	0.160839	0.236597	0.050117
product_mng	0.619634	0.714756	3.807095	199.965632	3.475610	0.146341	0.219512	0.000000
sales	0.614447	0.709717	3.776329	200.911353	3.534058	0.141787	0.244928	0.024155
support	0.618300	0.723109	3.803948	200.758188	3.393001	0.154778	0.248991	0.008973
technical	0.607897	0.721099	3.877941	202.497426	3.411397	0.140074	0.256250	0.010294

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	sales	salary
6358	0.81	0.98	5	243	6	sales	medium
6466	0.39	0.57	2	132	3	support	low
7762	0.82	0.87	5	273	6	support	medium
9781	0.42	0.50	2	151	3	sales	low