IBM 人员流失预测

Introduction

address: https://github.com/ghvn7777/kaggle/blob/master/ibm_employee/predict_ibm_attrition.ipynb

保持雇员的快乐和对公司满意度是一个古老的挑战,如果你对雇员投资了很多,而他却离开了,这意味着你还要花费更多时间雇用别人,本着 Kaggle 的精神,让我们构建一个预测模型来根据 IBM 的数据集预测 IBM 员工的流失

这个笔记包括以下内容:

  1. Exploratory Data Analysis: 在这个章节,我们探索数据集分布特征,特征间如何对应并可视化
  2. Feature Engineering and Categorical Encoding: 进行一些特征工程并将我们的的特征编码为多个变量
  3. Implementing Machine Learning models: 我们实现一个随机森林和梯度增强模型,然后看在这些模型中特征的重要性

Let's Go.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import statements required for Plotly 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
from imblearn.over_sampling import SMOTE
import xgboost

# Import and suppress warnings
import warnings
warnings.filterwarnings('ignore')


/home/kaka/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning:

This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

1. Exploratory Data Analysis

让我们通过 Pandas 加载 datasets,我们快速看一下前几行,重点的关注是 attrition


In [2]:
attrition = pd.read_csv('./inputs/WA_Fn-UseC_-HR-Employee-Attrition.csv')
attrition.head()


Out[2]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 1 80 0 8 0 1 6 4 0 5
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 4 80 1 10 3 3 10 7 1 7
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 2 80 0 7 3 3 0 0 0 0
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 3 80 0 8 3 3 8 7 3 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 4 80 1 6 3 3 2 2 2 2

5 rows × 35 columns

从数据集中看,我们的目标列是 Attrition

此外,我们的数据是类型和数字数据混合的,对于这些非数字的类别,我们后面会将其编码成数字,这里我们首先探索数据集,首先检查一下数据集的完整性,简单的检查一下数据集中有没有空的或者无穷的数据

Data quality checks

可以使用 isnull() 函数来看有没有空的数据


In [3]:
#Looking for NaN
attrition.isnull().any()


Out[3]:
Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool

Distribution of the dataset

一般前几步会探索数据集特征如何分布,为了实现这个,我们调用 Seaborn plotting 库中的 kdeplot() 函数并且生成双变量图如下:


In [4]:
# Plotting the KDEplots
f, axes = plt.subplots(3, 3, figsize=(10, 10), sharex=False, sharey=False)

# Defining our colormap scheme
# s本来想调颜色的,后来都手工指定了 0.333....
#s = np.linspace(0, 3, 10) # [0,3] 区间等间隔生成 10 个数
# 创建一系列调色板,light 是调色板的最浅颜色的强度,1表示最强,
# as_cmap 为真表示使用 matplotlib 颜色表
cmap = sns.cubehelix_palette(start=0.0, light=1, as_cmap=True)

# Generate and plot
x = attrition['Age'].values
y = attrition['TotalWorkingYears'].values
# 画出单变量或双变量核密度预测, shade=True 表示数据是双变量时候填充轮廓
# cut=5 表示从每个内核的极端数据点切去几个 bw (带宽, 也是 kdeplot 的参数,作用控制估计与数据的拟合程度)
# cut 越大,整个图像越小数据越密集
# ax 参数指定在哪个轴上绘制,默认使用当前轴
sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=axes[0,0])
axes[0,0].set( title = 'Age against Total working years')

cmap = sns.cubehelix_palette(start=0.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['Age'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,1])
axes[0,1].set( title = 'Age against Daily Rate')

cmap = sns.cubehelix_palette(start=0.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsInCurrentRole'].values
y = attrition['Age'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,2])
axes[0,2].set( title = 'Years in role against Age')

cmap = sns.cubehelix_palette(start=1.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['DistanceFromHome'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,0])
axes[1,0].set( title = 'Daily Rate against DistancefromHome')

cmap = sns.cubehelix_palette(start=1.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,1])
axes[1,1].set( title = 'Daily Rate against Job satisfaction')

cmap = sns.cubehelix_palette(start=1.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,2])
axes[1,2].set( title = 'Daily Rate against distance')

cmap = sns.cubehelix_palette(start=2.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,0])
axes[2,0].set( title = 'Years at company against Daily Rate')

cmap = sns.cubehelix_palette(start=2.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['RelationshipSatisfaction'].values
y = attrition['YearsWithCurrManager'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,1])
axes[2,1].set( title = 'Relationship Satisfaction vs years with manager')

cmap = sns.cubehelix_palette(start=2.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['WorkLifeBalance'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,2])
axes[2,2].set( title = 'WorklifeBalance against Satisfaction')

f.tight_layout()



In [5]:
# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
attrition["Attrition_numerical"] = attrition["Attrition"].apply(lambda x: target_map[x])

In [6]:
attrition


Out[6]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager Attrition_numerical
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 80 0 8 0 1 6 4 0 5 1
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 80 1 10 3 3 10 7 1 7 0
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 80 0 7 3 3 0 0 0 0 1
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 80 0 8 3 3 8 7 3 0 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 80 1 6 3 3 2 2 2 2 0
5 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 ... 80 0 8 2 2 7 7 3 6 0
6 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 10 ... 80 3 12 3 2 1 0 0 0 0
7 30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1 11 ... 80 1 1 2 3 1 0 0 0 0
8 38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1 12 ... 80 0 10 2 3 9 7 1 8 0
9 36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1 13 ... 80 2 17 3 2 7 7 7 7 0
10 35 No Travel_Rarely 809 Research & Development 16 3 Medical 1 14 ... 80 1 6 5 3 5 4 0 3 0
11 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 15 ... 80 0 10 3 3 9 5 0 8 0
12 31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1 16 ... 80 1 5 1 2 5 2 4 3 0
13 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 18 ... 80 1 3 2 3 2 2 1 2 0
14 28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1 19 ... 80 0 6 4 3 4 2 0 3 1
15 29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1 20 ... 80 1 10 1 3 10 9 8 8 0
16 32 No Travel_Rarely 334 Research & Development 5 2 Life Sciences 1 21 ... 80 2 7 5 2 6 2 0 5 0
17 22 No Non-Travel 1123 Research & Development 16 2 Medical 1 22 ... 80 2 1 2 2 1 0 0 0 0
18 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1 23 ... 80 0 31 3 3 25 8 3 7 0
19 38 No Travel_Rarely 371 Research & Development 2 3 Life Sciences 1 24 ... 80 0 6 3 3 3 2 1 2 0
20 24 No Non-Travel 673 Research & Development 11 2 Other 1 26 ... 80 1 5 5 2 4 2 1 3 0
21 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1 27 ... 80 0 10 4 3 5 3 0 3 1
22 34 No Travel_Rarely 419 Research & Development 7 4 Life Sciences 1 28 ... 80 0 13 4 3 12 6 2 11 0
23 21 No Travel_Rarely 391 Research & Development 15 2 Life Sciences 1 30 ... 80 0 0 6 3 0 0 0 0 0
24 34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1 31 ... 80 0 8 2 3 4 2 1 3 1
25 53 No Travel_Rarely 1282 Research & Development 5 3 Other 1 32 ... 80 1 26 3 2 14 13 4 8 0
26 32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1 33 ... 80 0 10 5 3 10 2 6 7 1
27 42 No Travel_Rarely 691 Sales 8 4 Marketing 1 35 ... 80 1 10 2 3 9 7 4 2 0
28 44 No Travel_Rarely 477 Research & Development 7 4 Medical 1 36 ... 80 1 24 4 3 22 6 5 17 0
29 46 No Travel_Rarely 705 Sales 2 4 Marketing 1 38 ... 80 0 22 2 2 2 2 2 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1440 36 No Travel_Frequently 688 Research & Development 4 2 Life Sciences 1 2025 ... 80 3 18 3 3 4 2 0 2 0
1441 56 No Non-Travel 667 Research & Development 1 4 Life Sciences 1 2026 ... 80 1 13 2 2 13 12 1 9 0
1442 29 Yes Travel_Rarely 1092 Research & Development 1 4 Medical 1 2027 ... 80 3 4 3 4 2 2 2 2 1
1443 42 No Travel_Rarely 300 Research & Development 2 3 Life Sciences 1 2031 ... 80 0 24 2 2 22 6 4 14 0
1444 56 Yes Travel_Rarely 310 Research & Development 7 2 Technical Degree 1 2032 ... 80 1 14 4 1 10 9 9 8 1
1445 41 No Travel_Rarely 582 Research & Development 28 4 Life Sciences 1 2034 ... 80 1 21 3 3 20 7 0 10 0
1446 34 No Travel_Rarely 704 Sales 28 3 Marketing 1 2035 ... 80 2 8 2 3 8 7 1 7 0
1447 36 No Non-Travel 301 Sales 15 4 Marketing 1 2036 ... 80 1 15 4 2 15 12 11 11 0
1448 41 No Travel_Rarely 930 Sales 3 3 Life Sciences 1 2037 ... 80 1 14 5 3 5 4 0 4 0
1449 32 No Travel_Rarely 529 Research & Development 2 3 Technical Degree 1 2038 ... 80 0 4 4 3 4 2 1 2 0
1450 35 No Travel_Rarely 1146 Human Resources 26 4 Life Sciences 1 2040 ... 80 0 9 2 3 9 0 1 7 0
1451 38 No Travel_Rarely 345 Sales 10 2 Life Sciences 1 2041 ... 80 1 10 1 3 10 7 1 9 0
1452 50 Yes Travel_Frequently 878 Sales 1 4 Life Sciences 1 2044 ... 80 2 12 3 3 6 3 0 1 1
1453 36 No Travel_Rarely 1120 Sales 11 4 Marketing 1 2045 ... 80 1 8 2 2 6 3 0 0 0
1454 45 No Travel_Rarely 374 Sales 20 3 Life Sciences 1 2046 ... 80 0 8 3 3 5 3 0 1 0
1455 40 No Travel_Rarely 1322 Research & Development 2 4 Life Sciences 1 2048 ... 80 0 8 2 3 2 2 2 2 0
1456 35 No Travel_Frequently 1199 Research & Development 18 4 Life Sciences 1 2049 ... 80 2 10 2 4 10 2 0 2 0
1457 40 No Travel_Rarely 1194 Research & Development 2 4 Medical 1 2051 ... 80 3 20 2 3 5 3 0 2 0
1458 35 No Travel_Rarely 287 Research & Development 1 4 Life Sciences 1 2052 ... 80 1 4 5 3 4 3 1 1 0
1459 29 No Travel_Rarely 1378 Research & Development 13 2 Other 1 2053 ... 80 1 10 2 3 4 3 0 3 0
1460 29 No Travel_Rarely 468 Research & Development 28 4 Medical 1 2054 ... 80 0 5 3 1 5 4 0 4 0
1461 50 Yes Travel_Rarely 410 Sales 28 3 Marketing 1 2055 ... 80 1 20 3 3 3 2 2 0 1
1462 39 No Travel_Rarely 722 Sales 24 1 Marketing 1 2056 ... 80 1 21 2 2 20 9 9 6 0
1463 31 No Non-Travel 325 Research & Development 5 3 Medical 1 2057 ... 80 0 10 2 3 9 4 1 7 0
1464 26 No Travel_Rarely 1167 Sales 5 3 Other 1 2060 ... 80 0 5 2 3 4 2 0 0 0
1465 36 No Travel_Frequently 884 Research & Development 23 2 Medical 1 2061 ... 80 1 17 3 3 5 2 0 3 0
1466 39 No Travel_Rarely 613 Research & Development 6 1 Medical 1 2062 ... 80 1 9 5 3 7 7 1 7 0
1467 27 No Travel_Rarely 155 Research & Development 4 3 Life Sciences 1 2064 ... 80 1 6 0 3 6 2 0 3 0
1468 49 No Travel_Frequently 1023 Sales 2 3 Medical 1 2065 ... 80 0 17 3 2 9 6 0 8 0
1469 34 No Travel_Rarely 628 Research & Development 8 3 Medical 1 2068 ... 80 0 6 3 4 4 3 1 2 0

1470 rows × 36 columns

Correlation of Features

接下来的探索工具是关于矩阵的,通过绘制相关矩阵,我们可以很好的描述特征之间的关联,在 Pandas dataframe 中,我们可以使用 corr 函数可以为 dataframe 的每对列数据提供皮尔森相关系数(也叫矩阵相关系数,用来反映两个变量线性相关程度的统计量)

在这里,我将使用 Plotly 库中的 Heatmap() 函数绘出皮尔森相关系数矩阵:


In [7]:
# creating a list of only numerical values
numerical = [u'Age', u'DailyRate', u'DistanceFromHome', u'Education', u'EmployeeNumber', u'EnvironmentSatisfaction',
       u'HourlyRate', u'JobInvolvement', u'JobLevel', u'JobSatisfaction',
       u'MonthlyIncome', u'MonthlyRate', u'NumCompaniesWorked',
       u'PercentSalaryHike', u'PerformanceRating', u'RelationshipSatisfaction',
       u'StockOptionLevel', u'TotalWorkingYears',
       u'TrainingTimesLastYear', u'WorkLifeBalance', u'YearsAtCompany',
       u'YearsInCurrentRole', u'YearsSinceLastPromotion',
       u'YearsWithCurrManager']
data = [
    go.Heatmap(
        z= attrition[numerical].astype(float).corr().values, # Generating the Pearson correlation
        x=attrition[numerical].columns.values,
        y=attrition[numerical].columns.values,
        colorscale='Viridis',
        reversescale = False, #反转色域
        text = True,
        opacity = 1.0 #不透明度
        
    )
]


layout = go.Layout(
    title='Pearson Correlation of numerical features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')


Takeaway from the plots

从上图中,我们可以看到有相当多的列好像彼此关系很差,一般来说,做一个预测模型,我们的训练数据最好彼此不相关,因为我们不需要冗余数据,在这个例子中,在我们有相当多的相关特征情况下,或许我们应该应用 PCA(Principal Component Analysis --> 主成分分析)来减少特征空间

Pairplot Visualisations

现在让我们创建一些 Seaborn pairplot 并且设置 Attrition 列作为目标变量得到各个特征分布对人员流失的影响


In [8]:
# Refining our list of numerical variables
numerical = [u'Age', u'DailyRate',  u'JobSatisfaction',
       u'MonthlyIncome', u'PerformanceRating',
        u'WorkLifeBalance', u'YearsAtCompany', u'Attrition_numerical']

#g = sns.pairplot(attrition[numerical], hue='Attrition_numerical', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
#g.set(xticklabels=[])

2. Feature Engineering & Categorical Encoding

我们对数据集进行了简单的探索,现在我们处理特征工程和对分类进行数字编码,特征工程简单地说就是从已有的特征创建新的特征关系,特征工程非常重要。

在开始之前,我们使用 dtype 方法将数字和分类隔离


In [9]:
attrition


Out[9]:
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education EducationField EmployeeCount EmployeeNumber ... StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager Attrition_numerical
0 41 Yes Travel_Rarely 1102 Sales 1 2 Life Sciences 1 1 ... 80 0 8 0 1 6 4 0 5 1
1 49 No Travel_Frequently 279 Research & Development 8 1 Life Sciences 1 2 ... 80 1 10 3 3 10 7 1 7 0
2 37 Yes Travel_Rarely 1373 Research & Development 2 2 Other 1 4 ... 80 0 7 3 3 0 0 0 0 1
3 33 No Travel_Frequently 1392 Research & Development 3 4 Life Sciences 1 5 ... 80 0 8 3 3 8 7 3 0 0
4 27 No Travel_Rarely 591 Research & Development 2 1 Medical 1 7 ... 80 1 6 3 3 2 2 2 2 0
5 32 No Travel_Frequently 1005 Research & Development 2 2 Life Sciences 1 8 ... 80 0 8 2 2 7 7 3 6 0
6 59 No Travel_Rarely 1324 Research & Development 3 3 Medical 1 10 ... 80 3 12 3 2 1 0 0 0 0
7 30 No Travel_Rarely 1358 Research & Development 24 1 Life Sciences 1 11 ... 80 1 1 2 3 1 0 0 0 0
8 38 No Travel_Frequently 216 Research & Development 23 3 Life Sciences 1 12 ... 80 0 10 2 3 9 7 1 8 0
9 36 No Travel_Rarely 1299 Research & Development 27 3 Medical 1 13 ... 80 2 17 3 2 7 7 7 7 0
10 35 No Travel_Rarely 809 Research & Development 16 3 Medical 1 14 ... 80 1 6 5 3 5 4 0 3 0
11 29 No Travel_Rarely 153 Research & Development 15 2 Life Sciences 1 15 ... 80 0 10 3 3 9 5 0 8 0
12 31 No Travel_Rarely 670 Research & Development 26 1 Life Sciences 1 16 ... 80 1 5 1 2 5 2 4 3 0
13 34 No Travel_Rarely 1346 Research & Development 19 2 Medical 1 18 ... 80 1 3 2 3 2 2 1 2 0
14 28 Yes Travel_Rarely 103 Research & Development 24 3 Life Sciences 1 19 ... 80 0 6 4 3 4 2 0 3 1
15 29 No Travel_Rarely 1389 Research & Development 21 4 Life Sciences 1 20 ... 80 1 10 1 3 10 9 8 8 0
16 32 No Travel_Rarely 334 Research & Development 5 2 Life Sciences 1 21 ... 80 2 7 5 2 6 2 0 5 0
17 22 No Non-Travel 1123 Research & Development 16 2 Medical 1 22 ... 80 2 1 2 2 1 0 0 0 0
18 53 No Travel_Rarely 1219 Sales 2 4 Life Sciences 1 23 ... 80 0 31 3 3 25 8 3 7 0
19 38 No Travel_Rarely 371 Research & Development 2 3 Life Sciences 1 24 ... 80 0 6 3 3 3 2 1 2 0
20 24 No Non-Travel 673 Research & Development 11 2 Other 1 26 ... 80 1 5 5 2 4 2 1 3 0
21 36 Yes Travel_Rarely 1218 Sales 9 4 Life Sciences 1 27 ... 80 0 10 4 3 5 3 0 3 1
22 34 No Travel_Rarely 419 Research & Development 7 4 Life Sciences 1 28 ... 80 0 13 4 3 12 6 2 11 0
23 21 No Travel_Rarely 391 Research & Development 15 2 Life Sciences 1 30 ... 80 0 0 6 3 0 0 0 0 0
24 34 Yes Travel_Rarely 699 Research & Development 6 1 Medical 1 31 ... 80 0 8 2 3 4 2 1 3 1
25 53 No Travel_Rarely 1282 Research & Development 5 3 Other 1 32 ... 80 1 26 3 2 14 13 4 8 0
26 32 Yes Travel_Frequently 1125 Research & Development 16 1 Life Sciences 1 33 ... 80 0 10 5 3 10 2 6 7 1
27 42 No Travel_Rarely 691 Sales 8 4 Marketing 1 35 ... 80 1 10 2 3 9 7 4 2 0
28 44 No Travel_Rarely 477 Research & Development 7 4 Medical 1 36 ... 80 1 24 4 3 22 6 5 17 0
29 46 No Travel_Rarely 705 Sales 2 4 Marketing 1 38 ... 80 0 22 2 2 2 2 2 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1440 36 No Travel_Frequently 688 Research & Development 4 2 Life Sciences 1 2025 ... 80 3 18 3 3 4 2 0 2 0
1441 56 No Non-Travel 667 Research & Development 1 4 Life Sciences 1 2026 ... 80 1 13 2 2 13 12 1 9 0
1442 29 Yes Travel_Rarely 1092 Research & Development 1 4 Medical 1 2027 ... 80 3 4 3 4 2 2 2 2 1
1443 42 No Travel_Rarely 300 Research & Development 2 3 Life Sciences 1 2031 ... 80 0 24 2 2 22 6 4 14 0
1444 56 Yes Travel_Rarely 310 Research & Development 7 2 Technical Degree 1 2032 ... 80 1 14 4 1 10 9 9 8 1
1445 41 No Travel_Rarely 582 Research & Development 28 4 Life Sciences 1 2034 ... 80 1 21 3 3 20 7 0 10 0
1446 34 No Travel_Rarely 704 Sales 28 3 Marketing 1 2035 ... 80 2 8 2 3 8 7 1 7 0
1447 36 No Non-Travel 301 Sales 15 4 Marketing 1 2036 ... 80 1 15 4 2 15 12 11 11 0
1448 41 No Travel_Rarely 930 Sales 3 3 Life Sciences 1 2037 ... 80 1 14 5 3 5 4 0 4 0
1449 32 No Travel_Rarely 529 Research & Development 2 3 Technical Degree 1 2038 ... 80 0 4 4 3 4 2 1 2 0
1450 35 No Travel_Rarely 1146 Human Resources 26 4 Life Sciences 1 2040 ... 80 0 9 2 3 9 0 1 7 0
1451 38 No Travel_Rarely 345 Sales 10 2 Life Sciences 1 2041 ... 80 1 10 1 3 10 7 1 9 0
1452 50 Yes Travel_Frequently 878 Sales 1 4 Life Sciences 1 2044 ... 80 2 12 3 3 6 3 0 1 1
1453 36 No Travel_Rarely 1120 Sales 11 4 Marketing 1 2045 ... 80 1 8 2 2 6 3 0 0 0
1454 45 No Travel_Rarely 374 Sales 20 3 Life Sciences 1 2046 ... 80 0 8 3 3 5 3 0 1 0
1455 40 No Travel_Rarely 1322 Research & Development 2 4 Life Sciences 1 2048 ... 80 0 8 2 3 2 2 2 2 0
1456 35 No Travel_Frequently 1199 Research & Development 18 4 Life Sciences 1 2049 ... 80 2 10 2 4 10 2 0 2 0
1457 40 No Travel_Rarely 1194 Research & Development 2 4 Medical 1 2051 ... 80 3 20 2 3 5 3 0 2 0
1458 35 No Travel_Rarely 287 Research & Development 1 4 Life Sciences 1 2052 ... 80 1 4 5 3 4 3 1 1 0
1459 29 No Travel_Rarely 1378 Research & Development 13 2 Other 1 2053 ... 80 1 10 2 3 4 3 0 3 0
1460 29 No Travel_Rarely 468 Research & Development 28 4 Medical 1 2054 ... 80 0 5 3 1 5 4 0 4 0
1461 50 Yes Travel_Rarely 410 Sales 28 3 Marketing 1 2055 ... 80 1 20 3 3 3 2 2 0 1
1462 39 No Travel_Rarely 722 Sales 24 1 Marketing 1 2056 ... 80 1 21 2 2 20 9 9 6 0
1463 31 No Non-Travel 325 Research & Development 5 3 Medical 1 2057 ... 80 0 10 2 3 9 4 1 7 0
1464 26 No Travel_Rarely 1167 Sales 5 3 Other 1 2060 ... 80 0 5 2 3 4 2 0 0 0
1465 36 No Travel_Frequently 884 Research & Development 23 2 Medical 1 2061 ... 80 1 17 3 3 5 2 0 3 0
1466 39 No Travel_Rarely 613 Research & Development 6 1 Medical 1 2062 ... 80 1 9 5 3 7 7 1 7 0
1467 27 No Travel_Rarely 155 Research & Development 4 3 Life Sciences 1 2064 ... 80 1 6 0 3 6 2 0 3 0
1468 49 No Travel_Frequently 1023 Sales 2 3 Medical 1 2065 ... 80 0 17 3 2 9 6 0 8 0
1469 34 No Travel_Rarely 628 Research & Development 8 3 Medical 1 2068 ... 80 0 6 3 4 4 3 1 2 0

1470 rows × 36 columns


In [10]:
# Drop the Attrition_numerical column from attrition dataset first - Don't want to include that
attrition = attrition.drop(['Attrition_numerical'], axis=1)

# Empty list to store columns with categorical data
categorical = []
for col, value in attrition.iteritems():
    if value.dtype == 'object':
        categorical.append(col)

# Store the numerical columns in a list numerical
print(categorical)
numerical = attrition.columns.difference(categorical)


['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']

In [11]:
numerical


Out[11]:
Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

确定我们的特征包含分类数据,我们可以将 numerical 编码,可以使用 Pandas 的 get_dummies() 方法


In [12]:
# Store the categorical data in a dataframe called attrition_cat
attrition_cat = attrition[categorical] #提取出不是数字的列
attrition_cat = attrition_cat.drop(['Attrition'], axis=1) # Dropping the target column
print(attrition_cat)


         BusinessTravel              Department    EducationField  Gender  \
0         Travel_Rarely                   Sales     Life Sciences  Female   
1     Travel_Frequently  Research & Development     Life Sciences    Male   
2         Travel_Rarely  Research & Development             Other    Male   
3     Travel_Frequently  Research & Development     Life Sciences  Female   
4         Travel_Rarely  Research & Development           Medical    Male   
5     Travel_Frequently  Research & Development     Life Sciences    Male   
6         Travel_Rarely  Research & Development           Medical  Female   
7         Travel_Rarely  Research & Development     Life Sciences    Male   
8     Travel_Frequently  Research & Development     Life Sciences    Male   
9         Travel_Rarely  Research & Development           Medical    Male   
10        Travel_Rarely  Research & Development           Medical    Male   
11        Travel_Rarely  Research & Development     Life Sciences  Female   
12        Travel_Rarely  Research & Development     Life Sciences    Male   
13        Travel_Rarely  Research & Development           Medical    Male   
14        Travel_Rarely  Research & Development     Life Sciences    Male   
15        Travel_Rarely  Research & Development     Life Sciences  Female   
16        Travel_Rarely  Research & Development     Life Sciences    Male   
17           Non-Travel  Research & Development           Medical    Male   
18        Travel_Rarely                   Sales     Life Sciences  Female   
19        Travel_Rarely  Research & Development     Life Sciences    Male   
20           Non-Travel  Research & Development             Other  Female   
21        Travel_Rarely                   Sales     Life Sciences    Male   
22        Travel_Rarely  Research & Development     Life Sciences  Female   
23        Travel_Rarely  Research & Development     Life Sciences    Male   
24        Travel_Rarely  Research & Development           Medical    Male   
25        Travel_Rarely  Research & Development             Other  Female   
26    Travel_Frequently  Research & Development     Life Sciences  Female   
27        Travel_Rarely                   Sales         Marketing    Male   
28        Travel_Rarely  Research & Development           Medical  Female   
29        Travel_Rarely                   Sales         Marketing  Female   
...                 ...                     ...               ...     ...   
1440  Travel_Frequently  Research & Development     Life Sciences  Female   
1441         Non-Travel  Research & Development     Life Sciences    Male   
1442      Travel_Rarely  Research & Development           Medical    Male   
1443      Travel_Rarely  Research & Development     Life Sciences    Male   
1444      Travel_Rarely  Research & Development  Technical Degree    Male   
1445      Travel_Rarely  Research & Development     Life Sciences  Female   
1446      Travel_Rarely                   Sales         Marketing  Female   
1447         Non-Travel                   Sales         Marketing    Male   
1448      Travel_Rarely                   Sales     Life Sciences    Male   
1449      Travel_Rarely  Research & Development  Technical Degree    Male   
1450      Travel_Rarely         Human Resources     Life Sciences  Female   
1451      Travel_Rarely                   Sales     Life Sciences  Female   
1452  Travel_Frequently                   Sales     Life Sciences    Male   
1453      Travel_Rarely                   Sales         Marketing  Female   
1454      Travel_Rarely                   Sales     Life Sciences  Female   
1455      Travel_Rarely  Research & Development     Life Sciences    Male   
1456  Travel_Frequently  Research & Development     Life Sciences    Male   
1457      Travel_Rarely  Research & Development           Medical  Female   
1458      Travel_Rarely  Research & Development     Life Sciences  Female   
1459      Travel_Rarely  Research & Development             Other    Male   
1460      Travel_Rarely  Research & Development           Medical  Female   
1461      Travel_Rarely                   Sales         Marketing    Male   
1462      Travel_Rarely                   Sales         Marketing  Female   
1463         Non-Travel  Research & Development           Medical    Male   
1464      Travel_Rarely                   Sales             Other  Female   
1465  Travel_Frequently  Research & Development           Medical    Male   
1466      Travel_Rarely  Research & Development           Medical    Male   
1467      Travel_Rarely  Research & Development     Life Sciences    Male   
1468  Travel_Frequently                   Sales           Medical    Male   
1469      Travel_Rarely  Research & Development           Medical    Male   

                        JobRole MaritalStatus Over18 OverTime  
0               Sales Executive        Single      Y      Yes  
1            Research Scientist       Married      Y       No  
2         Laboratory Technician        Single      Y      Yes  
3            Research Scientist       Married      Y      Yes  
4         Laboratory Technician       Married      Y       No  
5         Laboratory Technician        Single      Y       No  
6         Laboratory Technician       Married      Y      Yes  
7         Laboratory Technician      Divorced      Y       No  
8        Manufacturing Director        Single      Y       No  
9     Healthcare Representative       Married      Y       No  
10        Laboratory Technician       Married      Y       No  
11        Laboratory Technician        Single      Y      Yes  
12           Research Scientist      Divorced      Y       No  
13        Laboratory Technician      Divorced      Y       No  
14        Laboratory Technician        Single      Y      Yes  
15       Manufacturing Director      Divorced      Y       No  
16           Research Scientist      Divorced      Y      Yes  
17        Laboratory Technician      Divorced      Y      Yes  
18                      Manager       Married      Y       No  
19           Research Scientist        Single      Y      Yes  
20       Manufacturing Director      Divorced      Y       No  
21         Sales Representative        Single      Y       No  
22            Research Director        Single      Y       No  
23           Research Scientist        Single      Y       No  
24           Research Scientist        Single      Y       No  
25                      Manager      Divorced      Y       No  
26           Research Scientist        Single      Y      Yes  
27              Sales Executive       Married      Y       No  
28    Healthcare Representative       Married      Y       No  
29                      Manager        Single      Y       No  
...                         ...           ...    ...      ...  
1440     Manufacturing Director      Divorced      Y       No  
1441  Healthcare Representative      Divorced      Y       No  
1442         Research Scientist       Married      Y      Yes  
1443                    Manager       Married      Y       No  
1444      Laboratory Technician       Married      Y       No  
1445     Manufacturing Director       Married      Y       No  
1446            Sales Executive       Married      Y       No  
1447            Sales Executive      Divorced      Y       No  
1448            Sales Executive      Divorced      Y       No  
1449         Research Scientist        Single      Y       No  
1450            Human Resources        Single      Y      Yes  
1451            Sales Executive       Married      Y       No  
1452            Sales Executive      Divorced      Y       No  
1453            Sales Executive       Married      Y       No  
1454            Sales Executive        Single      Y       No  
1455         Research Scientist        Single      Y       No  
1456  Healthcare Representative       Married      Y      Yes  
1457         Research Scientist       Married      Y       No  
1458         Research Scientist       Married      Y       No  
1459      Laboratory Technician       Married      Y      Yes  
1460         Research Scientist        Single      Y       No  
1461            Sales Executive      Divorced      Y      Yes  
1462            Sales Executive       Married      Y       No  
1463     Manufacturing Director        Single      Y       No  
1464       Sales Representative        Single      Y       No  
1465      Laboratory Technician       Married      Y       No  
1466  Healthcare Representative       Married      Y       No  
1467     Manufacturing Director       Married      Y      Yes  
1468            Sales Executive       Married      Y       No  
1469      Laboratory Technician       Married      Y       No  

[1470 rows x 8 columns]

应用 get_dummies() 方法自动编码, 我们可以很方便的用以下代码看编码后的结果


In [13]:
attrition_cat = pd.get_dummies(attrition_cat)
attrition_cat.head(3)


Out[13]:
BusinessTravel_Non-Travel BusinessTravel_Travel_Frequently BusinessTravel_Travel_Rarely Department_Human Resources Department_Research & Development Department_Sales EducationField_Human Resources EducationField_Life Sciences EducationField_Marketing EducationField_Medical ... JobRole_Research Director JobRole_Research Scientist JobRole_Sales Executive JobRole_Sales Representative MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Over18_Y OverTime_No OverTime_Yes
0 0 0 1 0 0 1 0 1 0 0 ... 0 0 1 0 0 0 1 1 0 1
1 0 1 0 0 1 0 0 1 0 0 ... 0 1 0 0 0 1 0 1 1 0
2 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 1

3 rows × 29 columns

提取出是数字的列


In [14]:
# Store the numerical features to a dataframe attrition_num
attrition_num = attrition[numerical]

我们编码了非数字的变量,并将数字的提取出来,现在我们要将它们合并成最终的训练数据


In [15]:
# Concat the two dataframes together columnwise
attrition_final = pd.concat([attrition_num, attrition_cat], axis=1)

Target variable

最后,我们需要目标变量, 由 attrition 列给出,我们需要将其编码,1 代表 Yes, 0 代表 No


In [16]:
# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
target = attrition["Attrition"].apply(lambda x: target_map[x])
target.head(3)


Out[16]:
0    1
1    0
2    1
Name: Attrition, dtype: int64

然而,如果检查 Yes 和 No 的数量就会发现,数据有非常大的偏差


In [17]:
data = [go.Bar(
            x=attrition["Attrition"].value_counts().index.values,
            y= attrition["Attrition"].value_counts().values
    )]

py.iplot(data, filename='basic-bar')


因此,我们现在的数据是不平衡的,有很多方法可以解决数据不平衡的问题,在这里我们使用 SMOTE 过采样技术来处理不平衡

3. Implementing Machine Learning Models

进行了一些探索分析和简单的特征工程,确保我们的所有的数据都被编码,我们现在可以建立自己的模型

在这个笔记一开始时,我们说我们的目标是为了评估和对比一些不同模型的表现

分离和测试数据

在我们训练数据之前,需要有一个训练集和测试集,不同于 Kaggle 比赛,一般我们都会有现成的训练集和测试集,这里我们使用 sklearn 来分离数据


In [18]:
# Import the train_test_split method
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit

# Split data into train and test sets as well as for validation and testing
train, test, target_train, target_val = train_test_split(attrition_final, target, train_size= 0.75,random_state=0);
#train, test, target_train, target_val = StratifiedShuffleSplit(attrition_final, target, random_state=0);

SMOTE to oversample due to the skewness in target

既然我们已经注意到了目标值的不平衡,让我们通过 imblearn 包实现。


In [19]:
oversampler=SMOTE(random_state=0)
smote_train, smote_target = oversampler.fit_sample(train,target_train)

A. Random Forest Classifier

随机森林分类方法是无处不在的决策树,作为独立模型的决策树通常被认为是 "弱学习" 模型,因为它的预测性较差,然而随机森林分类是收集一组决策树,用其组合能力获得较强的预测性能,称为强学习

Initialising Random Forest parameters

我们将使用 scikit-learn 的库中的 Random Forest mode, 我们首先定义我们的参数


In [20]:
seed = 0   # We set our random seed to zero for reproducibility
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 800,
    'warm_start': True, 
    'max_features': 0.3,
    'max_depth': 9,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

我们可以使用 scikit-learn 的 RandomForestClassifier() 函数来初始化随机森林并将参数传入


In [21]:
rf = RandomForestClassifier(**rf_params)

我们开始训练:


In [22]:
rf.fit(smote_train, smote_target)
print("Fitting of Random Forest as finished")


Fitting of Random Forest as finished

现在我们可以在测试数据上进行预测:


In [24]:
rf_predictions = rf.predict(test)
print("Predictions finished")


Predictions finished

对预测进行打分:


In [25]:
accuracy_score(target_val, rf_predictions)


Out[25]:
0.87771739130434778

Accuracy of the model

我们观察到,使用随机森林分类可以得到 88% 的正确率,乍一看,这像是一个非常好的模型,如果我们考虑我们的数据分布是 84% yes 和 %26 no,就会发现这个面型预测的和蒙的差不多

Feature Ranking via the Random Forest

sklearn 随机森林分类包含了一个非常方便和有用的属性是 featureimportances,它可以显示出对于特征森林算法来说最重要的特征,下图显示了对于最重要的几个特征:


In [26]:
# Scatter plot 
trace = go.Scatter(
    y = rf.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = rf.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')


Most RF important features: Overtime, Marital Status

通过上图可以看出对于我们最重要的几个特征,算法将加班特征的重要性拍到最高,其次是婚姻状况

我不知道对于你来说哪个重要,但是对于我来说加班确实影响到了我对工作的满意程度,也许这样我们队分类器就不会感到惊讶,因为我们的分类器已经达到了目标并把加班时间重要性排到最高

Visualising Tree Diagram with Graphviz

让我们显示我们的特征树,可以使用 DecisionTreeClassifier 对象遍历单个决策树特征并使用 export_graphviz() 函数来显示 png 图像:


In [27]:
from sklearn import tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re

decision_tree = tree.DecisionTreeClassifier(max_depth = 4)
decision_tree.fit(train, target_train)

# Predicting results for test dataset
y_pred = decision_tree.predict(test)

# Export our trained model as a .dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(decision_tree,
                              out_file=f,
                              max_depth = 4,
                              impurity = False,
                              feature_names = attrition_final.columns.values,
                              class_names = ['No', 'Yes'],
                              rounded = True,
                              filled= True )
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])

# Annotating chart with PIL
img = Image.open("tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")


Out[27]:

B. Gradient Boosted Classifier

梯度增强法是一种组合技术,非常像随机森林树,是将弱树学习者的组合结合成一棵强树,这个技术涉及到定义一些方法(算法)来最小化损失函数 (loss function)。因此,顾名思义,最小化损失函数的方法就是指梯度下降方法,指向了减少损失函数值的方向。

sklearn 中使用 Gradient Boosted classifier 非常简单,只需要几行代码,我们首先设置分类参数:

Initialising Gradient Boosting Parameters

一般来说,在设置梯度增强分类有几个关键参数, 估计数量, 模型的最大深度,每个叶子的最少样本。


In [28]:
# Gradient Boosting Parameters
gb_params ={
    'n_estimators': 500,
    'max_features': 0.9,
    'learning_rate' : 0.2,
    'max_depth': 11,
    'min_samples_leaf': 2,
    'subsample': 1,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

定义了参数后,我们可以训练预测得分了


In [30]:
gb = GradientBoostingClassifier(**gb_params)
# Fit the model to our SMOTEd train and target
gb.fit(smote_train, smote_target)
# Get our predictions
gb_predictions = gb.predict(test)
print("Predictions have finished")
accuracy_score(target_val, gb_predictions)


Predictions have finished
Out[30]:
0.88858695652173914

Feature Ranking via the Gradient Boosting Model

我们看一下对于 Gradient Boosting Model 最重要的参数


In [33]:
# Scatter plot 
trace = go.Scatter(
    y = gb.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = gb.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Model Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter')


GBM most important features

Monthly Income, Overtime, Daily and Monthly Rate

CONCLUSION

我们简单的分析了员工的属性,并应用了特征工程,实现了两种算法,得到了 89% 的正确率。

但仍然有改进的空间,可以从数据中应用更多的特征工程,可以通过混合模型来使得模型更准确,例如同时运行多个模型,根据多个模型的结果进行投票