This notebook is trying to find what people want to quit. Also some interesting fact will be revealed during the digging of data
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set(style='white', color_codes=True)
In [2]:
data = pd.read_csv('HR_comma_sep.csv')
In [3]:
data.head()
Out[3]:
In [4]:
data.info()
Data is pretty clean. Love it! No nulls. Also most of them are in numeric format. Lets take a look of the values in the data
In [5]:
data['sales'].unique()
Out[5]:
In [6]:
data['Work_accident'].unique()
Out[6]:
In [7]:
data['left'].unique()
Out[7]:
In [8]:
data['promotion_last_5years'].unique()
Out[8]:
In [9]:
correlation = data.corr()
correlation
Out[9]:
Visulisation of correlations
In [10]:
plt.figure(figsize = (10,10))
sb.heatmap(correlation, vmax=0.8, square=True, annot=True)
Out[10]:
Left vs Satisfaction Level has a strong negative correlation.
See what box plot can do
In [11]:
sb.boxplot(x='left', y='satisfaction_level', data=data)
Out[11]:
Violin plot
In [12]:
sb.violinplot(x='left', y='satisfaction_level', data=data, siz=6)
Out[12]:
Looks like a lot more people who left have a much lower satisfaction (obvious)
In [13]:
sb.FacetGrid(data, hue='left').map(sb.kdeplot,'satisfaction_level').add_legend()
Out[13]:
Seperate the data by left/stay.
In [14]:
left = data.groupby('left').mean()
left
Out[14]:
More hours, less promotion...and leave...
In [15]:
salary = data.groupby('salary').mean()
salary
Out[15]:
In [16]:
sales = data.groupby('sales').sum()
sales
Out[16]:
It's more like a 'sales type' company.
In [17]:
sales_mean = data.groupby('sales').mean()
sales_mean
Out[17]:
In [18]:
data.groupby('sales').mean()['satisfaction_level'].plot(kind='bar', color ='b')
Out[18]:
In [19]:
#from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
#from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
#from sklearn.feature_selection import SelectFromModel
In [20]:
data2 = pd.get_dummies(data)
data2.head()
Out[20]:
In [21]:
d2_copy = data2
In [22]:
y=d2_copy['left'].values
X = d2_copy.drop(['left'], axis=1).values
In [23]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size = 0.5)
In [24]:
LR = LogisticRegression(n_jobs=-1)
LR.fit(Xtrain, ytrain)
y_predict_proba = LR.predict_proba(Xtest)
y_predict = LR.predict(Xtest)
In [25]:
print (sum(y_predict == ytest)) / float(len(ytest))
In [26]:
SV = svm.SVC()
SV.fit(Xtrain, ytrain)
y_predict = SV.predict(Xtest)
print (sum(y_predict == ytest)) / float(len(ytest))
In [27]:
rc = RandomForestClassifier()
param_grid = {'max_features':[10,12,14], 'n_estimators' : [500]}
gs = GridSearchCV(estimator = rc, param_grid= param_grid, cv=2, n_jobs=-1, verbose=1)
gs.fit(Xtrain, ytrain)
Out[27]:
In [28]:
print gs.best_score_
print gs.best_params_
In [29]:
y_f = gs.predict(Xtest)
In [30]:
print (sum(y_f == ytest)) / float(len(ytest))
#print (sum(pd.DataFrame(y).idxmax(axis=1).values == ytest)/float(len(ytest)))
In [31]:
column_name = data2.columns
column_name
Out[31]:
Feature importance with Random Forest
In [32]:
#{'max_features': 12, 'n_estimators': 500}
rc = RandomForestClassifier(n_estimators=500, max_features=12)
forest = rc.fit(Xtrain, ytrain)
feature_imp = forest.feature_importances_
In [33]:
feature_importance_RF = pd.DataFrame(zip(column_name, feature_imp))
In [34]:
new_col_name = ['feature','importance']
feature_importance_RF.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_RF[:6])
Out[34]:
In [35]:
gb = GradientBoostingClassifier()
gb.fit(Xtrain,ytrain)
y_predict = gb.predict(Xtest)
print sum(y_predict==ytest)/float(len(ytest))
Feature Importance with Gradient boosting
In [36]:
feature_importance_GC = pd.DataFrame(zip(column_name, gb.feature_importances_))
feature_importance_GC.columns = new_col_name
sb.barplot(x='importance', y='feature', data = feature_importance_GC[:6])
Out[36]:
In [37]:
pred_of_stay = forest.predict_proba(X[y == 0])
stay = data[y==0]
stay[pred_of_stay[:, 1] ==1]
Out[37]:
In [38]:
### Who will leave?