IBM 人员流失预测

Introduction

address: https://github.com/ghvn7777/kaggle/blob/master/ibm_employee/predict_ibm_attrition.ipynb

保持雇员的快乐和对公司满意度是一个古老的挑战，如果你对雇员投资了很多，而他却离开了，这意味着你还要花费更多时间雇用别人，本着 Kaggle 的精神，让我们构建一个预测模型来根据 IBM 的数据集预测 IBM 员工的流失

这个笔记包括以下内容：

Exploratory Data Analysis: 在这个章节，我们探索数据集分布特征，特征间如何对应并可视化
Feature Engineering and Categorical Encoding: 进行一些特征工程并将我们的的特征编码为多个变量
Implementing Machine Learning models: 我们实现一个随机森林和梯度增强模型，然后看在这些模型中特征的重要性

Let's Go.



In [1]:

    
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Import statements required for Plotly 
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss
from imblearn.over_sampling import SMOTE
import xgboost

# Import and suppress warnings
import warnings
warnings.filterwarnings('ignore')









    











    



/home/kaka/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning:

This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

1. Exploratory Data Analysis

让我们通过 Pandas 加载 datasets，我们快速看一下前几行，重点的关注是 attrition



In [2]:

    
attrition = pd.read_csv('./inputs/WA_Fn-UseC_-HR-Employee-Attrition.csv')
attrition.head()









    Out[2]:






  
    
      
      Age
      Attrition
      BusinessTravel
      DailyRate
      Department
      DistanceFromHome
      Education
      EducationField
      EmployeeCount
      EmployeeNumber
      ...
      RelationshipSatisfaction
      StandardHours
      StockOptionLevel
      TotalWorkingYears
      TrainingTimesLastYear
      WorkLifeBalance
      YearsAtCompany
      YearsInCurrentRole
      YearsSinceLastPromotion
      YearsWithCurrManager
    
  
  
    
      0
      41
      Yes
      Travel_Rarely
      1102
      Sales
      1
      2
      Life Sciences
      1
      1
      ...
      1
      80
      0
      8
      0
      1
      6
      4
      0
      5
    
    
      1
      49
      No
      Travel_Frequently
      279
      Research & Development
      8
      1
      Life Sciences
      1
      2
      ...
      4
      80
      1
      10
      3
      3
      10
      7
      1
      7
    
    
      2
      37
      Yes
      Travel_Rarely
      1373
      Research & Development
      2
      2
      Other
      1
      4
      ...
      2
      80
      0
      7
      3
      3
      0
      0
      0
      0
    
    
      3
      33
      No
      Travel_Frequently
      1392
      Research & Development
      3
      4
      Life Sciences
      1
      5
      ...
      3
      80
      0
      8
      3
      3
      8
      7
      3
      0
    
    
      4
      27
      No
      Travel_Rarely
      591
      Research & Development
      2
      1
      Medical
      1
      7
      ...
      4
      80
      1
      6
      3
      3
      2
      2
      2
      2
    
  

5 rows × 35 columns

从数据集中看，我们的目标列是 Attrition

此外，我们的数据是类型和数字数据混合的，对于这些非数字的类别，我们后面会将其编码成数字，这里我们首先探索数据集，首先检查一下数据集的完整性，简单的检查一下数据集中有没有空的或者无穷的数据

Data quality checks

可以使用 isnull() 函数来看有没有空的数据



In [3]:

    
#Looking for NaN
attrition.isnull().any()









    Out[3]:





Age                         False
Attrition                   False
BusinessTravel              False
DailyRate                   False
Department                  False
DistanceFromHome            False
Education                   False
EducationField              False
EmployeeCount               False
EmployeeNumber              False
EnvironmentSatisfaction     False
Gender                      False
HourlyRate                  False
JobInvolvement              False
JobLevel                    False
JobRole                     False
JobSatisfaction             False
MaritalStatus               False
MonthlyIncome               False
MonthlyRate                 False
NumCompaniesWorked          False
Over18                      False
OverTime                    False
PercentSalaryHike           False
PerformanceRating           False
RelationshipSatisfaction    False
StandardHours               False
StockOptionLevel            False
TotalWorkingYears           False
TrainingTimesLastYear       False
WorkLifeBalance             False
YearsAtCompany              False
YearsInCurrentRole          False
YearsSinceLastPromotion     False
YearsWithCurrManager        False
dtype: bool

Distribution of the dataset

一般前几步会探索数据集特征如何分布，为了实现这个，我们调用 Seaborn plotting 库中的 kdeplot() 函数并且生成双变量图如下:



In [4]:

    
# Plotting the KDEplots
f, axes = plt.subplots(3, 3, figsize=(10, 10), sharex=False, sharey=False)

# Defining our colormap scheme
# s本来想调颜色的，后来都手工指定了 0.333....
#s = np.linspace(0, 3, 10) # [0,3] 区间等间隔生成 10 个数
# 创建一系列调色板，light 是调色板的最浅颜色的强度，１表示最强，
# as_cmap 为真表示使用 matplotlib 颜色表
cmap = sns.cubehelix_palette(start=0.0, light=1, as_cmap=True)

# Generate and plot
x = attrition['Age'].values
y = attrition['TotalWorkingYears'].values
# 画出单变量或双变量核密度预测, shade=True 表示数据是双变量时候填充轮廓
# cut=5 表示从每个内核的极端数据点切去几个 bw (带宽, 也是 kdeplot 的参数，作用控制估计与数据的拟合程度)
# cut 越大，整个图像越小数据越密集
# ax 参数指定在哪个轴上绘制，默认使用当前轴
sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=axes[0,0])
axes[0,0].set( title = 'Age against Total working years')

cmap = sns.cubehelix_palette(start=0.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['Age'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,1])
axes[0,1].set( title = 'Age against Daily Rate')

cmap = sns.cubehelix_palette(start=0.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsInCurrentRole'].values
y = attrition['Age'].values
sns.kdeplot(x, y, cmap=cmap, shade=True, ax=axes[0,2])
axes[0,2].set( title = 'Years in role against Age')

cmap = sns.cubehelix_palette(start=1.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['DistanceFromHome'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,0])
axes[1,0].set( title = 'Daily Rate against DistancefromHome')

cmap = sns.cubehelix_palette(start=1.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['DailyRate'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,1])
axes[1,1].set( title = 'Daily Rate against Job satisfaction')

cmap = sns.cubehelix_palette(start=1.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[1,2])
axes[1,2].set( title = 'Daily Rate against distance')

cmap = sns.cubehelix_palette(start=2.0, light=1, as_cmap=True)
# Generate and plot
x = attrition['YearsAtCompany'].values
y = attrition['DailyRate'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,0])
axes[2,0].set( title = 'Years at company against Daily Rate')

cmap = sns.cubehelix_palette(start=2.333333333333, light=1, as_cmap=True)
# Generate and plot
x = attrition['RelationshipSatisfaction'].values
y = attrition['YearsWithCurrManager'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,1])
axes[2,1].set( title = 'Relationship Satisfaction vs years with manager')

cmap = sns.cubehelix_palette(start=2.666666666667, light=1, as_cmap=True)
# Generate and plot
x = attrition['WorkLifeBalance'].values
y = attrition['JobSatisfaction'].values
sns.kdeplot(x, y, cmap=cmap, shade=True,  ax=axes[2,2])
axes[2,2].set( title = 'WorklifeBalance against Satisfaction')

f.tight_layout()



In [5]:

    
# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
attrition["Attrition_numerical"] = attrition["Attrition"].apply(lambda x: target_map[x])



In [6]:

    
attrition









    Out[6]:






  
    
      
      Age
      Attrition
      BusinessTravel
      DailyRate
      Department
      DistanceFromHome
      Education
      EducationField
      EmployeeCount
      EmployeeNumber
      ...
      StandardHours
      StockOptionLevel
      TotalWorkingYears
      TrainingTimesLastYear
      WorkLifeBalance
      YearsAtCompany
      YearsInCurrentRole
      YearsSinceLastPromotion
      YearsWithCurrManager
      Attrition_numerical
    
  
  
    
      0
      41
      Yes
      Travel_Rarely
      1102
      Sales
      1
      2
      Life Sciences
      1
      1
      ...
      80
      0
      8
      0
      1
      6
      4
      0
      5
      1
    
    
      1
      49
      No
      Travel_Frequently
      279
      Research & Development
      8
      1
      Life Sciences
      1
      2
      ...
      80
      1
      10
      3
      3
      10
      7
      1
      7
      0
    
    
      2
      37
      Yes
      Travel_Rarely
      1373
      Research & Development
      2
      2
      Other
      1
      4
      ...
      80
      0
      7
      3
      3
      0
      0
      0
      0
      1
    
    
      3
      33
      No
      Travel_Frequently
      1392
      Research & Development
      3
      4
      Life Sciences
      1
      5
      ...
      80
      0
      8
      3
      3
      8
      7
      3
      0
      0
    
    
      4
      27
      No
      Travel_Rarely
      591
      Research & Development
      2
      1
      Medical
      1
      7
      ...
      80
      1
      6
      3
      3
      2
      2
      2
      2
      0
    
    
      5
      32
      No
      Travel_Frequently
      1005
      Research & Development
      2
      2
      Life Sciences
      1
      8
      ...
      80
      0
      8
      2
      2
      7
      7
      3
      6
      0
    
    
      6
      59
      No
      Travel_Rarely
      1324
      Research & Development
      3
      3
      Medical
      1
      10
      ...
      80
      3
      12
      3
      2
      1
      0
      0
      0
      0
    
    
      7
      30
      No
      Travel_Rarely
      1358
      Research & Development
      24
      1
      Life Sciences
      1
      11
      ...
      80
      1
      1
      2
      3
      1
      0
      0
      0
      0
    
    
      8
      38
      No
      Travel_Frequently
      216
      Research & Development
      23
      3
      Life Sciences
      1
      12
      ...
      80
      0
      10
      2
      3
      9
      7
      1
      8
      0
    
    
      9
      36
      No
      Travel_Rarely
      1299
      Research & Development
      27
      3
      Medical
      1
      13
      ...
      80
      2
      17
      3
      2
      7
      7
      7
      7
      0
    
    
      10
      35
      No
      Travel_Rarely
      809
      Research & Development
      16
      3
      Medical
      1
      14
      ...
      80
      1
      6
      5
      3
      5
      4
      0
      3
      0
    
    
      11
      29
      No
      Travel_Rarely
      153
      Research & Development
      15
      2
      Life Sciences
      1
      15
      ...
      80
      0
      10
      3
      3
      9
      5
      0
      8
      0
    
    
      12
      31
      No
      Travel_Rarely
      670
      Research & Development
      26
      1
      Life Sciences
      1
      16
      ...
      80
      1
      5
      1
      2
      5
      2
      4
      3
      0
    
    
      13
      34
      No
      Travel_Rarely
      1346
      Research & Development
      19
      2
      Medical
      1
      18
      ...
      80
      1
      3
      2
      3
      2
      2
      1
      2
      0
    
    
      14
      28
      Yes
      Travel_Rarely
      103
      Research & Development
      24
      3
      Life Sciences
      1
      19
      ...
      80
      0
      6
      4
      3
      4
      2
      0
      3
      1
    
    
      15
      29
      No
      Travel_Rarely
      1389
      Research & Development
      21
      4
      Life Sciences
      1
      20
      ...
      80
      1
      10
      1
      3
      10
      9
      8
      8
      0
    
    
      16
      32
      No
      Travel_Rarely
      334
      Research & Development
      5
      2
      Life Sciences
      1
      21
      ...
      80
      2
      7
      5
      2
      6
      2
      0
      5
      0
    
    
      17
      22
      No
      Non-Travel
      1123
      Research & Development
      16
      2
      Medical
      1
      22
      ...
      80
      2
      1
      2
      2
      1
      0
      0
      0
      0
    
    
      18
      53
      No
      Travel_Rarely
      1219
      Sales
      2
      4
      Life Sciences
      1
      23
      ...
      80
      0
      31
      3
      3
      25
      8
      3
      7
      0
    
    
      19
      38
      No
      Travel_Rarely
      371
      Research & Development
      2
      3
      Life Sciences
      1
      24
      ...
      80
      0
      6
      3
      3
      3
      2
      1
      2
      0
    
    
      20
      24
      No
      Non-Travel
      673
      Research & Development
      11
      2
      Other
      1
      26
      ...
      80
      1
      5
      5
      2
      4
      2
      1
      3
      0
    
    
      21
      36
      Yes
      Travel_Rarely
      1218
      Sales
      9
      4
      Life Sciences
      1
      27
      ...
      80
      0
      10
      4
      3
      5
      3
      0
      3
      1
    
    
      22
      34
      No
      Travel_Rarely
      419
      Research & Development
      7
      4
      Life Sciences
      1
      28
      ...
      80
      0
      13
      4
      3
      12
      6
      2
      11
      0
    
    
      23
      21
      No
      Travel_Rarely
      391
      Research & Development
      15
      2
      Life Sciences
      1
      30
      ...
      80
      0
      0
      6
      3
      0
      0
      0
      0
      0
    
    
      24
      34
      Yes
      Travel_Rarely
      699
      Research & Development
      6
      1
      Medical
      1
      31
      ...
      80
      0
      8
      2
      3
      4
      2
      1
      3
      1
    
    
      25
      53
      No
      Travel_Rarely
      1282
      Research & Development
      5
      3
      Other
      1
      32
      ...
      80
      1
      26
      3
      2
      14
      13
      4
      8
      0
    
    
      26
      32
      Yes
      Travel_Frequently
      1125
      Research & Development
      16
      1
      Life Sciences
      1
      33
      ...
      80
      0
      10
      5
      3
      10
      2
      6
      7
      1
    
    
      27
      42
      No
      Travel_Rarely
      691
      Sales
      8
      4
      Marketing
      1
      35
      ...
      80
      1
      10
      2
      3
      9
      7
      4
      2
      0
    
    
      28
      44
      No
      Travel_Rarely
      477
      Research & Development
      7
      4
      Medical
      1
      36
      ...
      80
      1
      24
      4
      3
      22
      6
      5
      17
      0
    
    
      29
      46
      No
      Travel_Rarely
      705
      Sales
      2
      4
      Marketing
      1
      38
      ...
      80
      0
      22
      2
      2
      2
      2
      2
      1
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1440
      36
      No
      Travel_Frequently
      688
      Research & Development
      4
      2
      Life Sciences
      1
      2025
      ...
      80
      3
      18
      3
      3
      4
      2
      0
      2
      0
    
    
      1441
      56
      No
      Non-Travel
      667
      Research & Development
      1
      4
      Life Sciences
      1
      2026
      ...
      80
      1
      13
      2
      2
      13
      12
      1
      9
      0
    
    
      1442
      29
      Yes
      Travel_Rarely
      1092
      Research & Development
      1
      4
      Medical
      1
      2027
      ...
      80
      3
      4
      3
      4
      2
      2
      2
      2
      1
    
    
      1443
      42
      No
      Travel_Rarely
      300
      Research & Development
      2
      3
      Life Sciences
      1
      2031
      ...
      80
      0
      24
      2
      2
      22
      6
      4
      14
      0
    
    
      1444
      56
      Yes
      Travel_Rarely
      310
      Research & Development
      7
      2
      Technical Degree
      1
      2032
      ...
      80
      1
      14
      4
      1
      10
      9
      9
      8
      1
    
    
      1445
      41
      No
      Travel_Rarely
      582
      Research & Development
      28
      4
      Life Sciences
      1
      2034
      ...
      80
      1
      21
      3
      3
      20
      7
      0
      10
      0
    
    
      1446
      34
      No
      Travel_Rarely
      704
      Sales
      28
      3
      Marketing
      1
      2035
      ...
      80
      2
      8
      2
      3
      8
      7
      1
      7
      0
    
    
      1447
      36
      No
      Non-Travel
      301
      Sales
      15
      4
      Marketing
      1
      2036
      ...
      80
      1
      15
      4
      2
      15
      12
      11
      11
      0
    
    
      1448
      41
      No
      Travel_Rarely
      930
      Sales
      3
      3
      Life Sciences
      1
      2037
      ...
      80
      1
      14
      5
      3
      5
      4
      0
      4
      0
    
    
      1449
      32
      No
      Travel_Rarely
      529
      Research & Development
      2
      3
      Technical Degree
      1
      2038
      ...
      80
      0
      4
      4
      3
      4
      2
      1
      2
      0
    
    
      1450
      35
      No
      Travel_Rarely
      1146
      Human Resources
      26
      4
      Life Sciences
      1
      2040
      ...
      80
      0
      9
      2
      3
      9
      0
      1
      7
      0
    
    
      1451
      38
      No
      Travel_Rarely
      345
      Sales
      10
      2
      Life Sciences
      1
      2041
      ...
      80
      1
      10
      1
      3
      10
      7
      1
      9
      0
    
    
      1452
      50
      Yes
      Travel_Frequently
      878
      Sales
      1
      4
      Life Sciences
      1
      2044
      ...
      80
      2
      12
      3
      3
      6
      3
      0
      1
      1
    
    
      1453
      36
      No
      Travel_Rarely
      1120
      Sales
      11
      4
      Marketing
      1
      2045
      ...
      80
      1
      8
      2
      2
      6
      3
      0
      0
      0
    
    
      1454
      45
      No
      Travel_Rarely
      374
      Sales
      20
      3
      Life Sciences
      1
      2046
      ...
      80
      0
      8
      3
      3
      5
      3
      0
      1
      0
    
    
      1455
      40
      No
      Travel_Rarely
      1322
      Research & Development
      2
      4
      Life Sciences
      1
      2048
      ...
      80
      0
      8
      2
      3
      2
      2
      2
      2
      0
    
    
      1456
      35
      No
      Travel_Frequently
      1199
      Research & Development
      18
      4
      Life Sciences
      1
      2049
      ...
      80
      2
      10
      2
      4
      10
      2
      0
      2
      0
    
    
      1457
      40
      No
      Travel_Rarely
      1194
      Research & Development
      2
      4
      Medical
      1
      2051
      ...
      80
      3
      20
      2
      3
      5
      3
      0
      2
      0
    
    
      1458
      35
      No
      Travel_Rarely
      287
      Research & Development
      1
      4
      Life Sciences
      1
      2052
      ...
      80
      1
      4
      5
      3
      4
      3
      1
      1
      0
    
    
      1459
      29
      No
      Travel_Rarely
      1378
      Research & Development
      13
      2
      Other
      1
      2053
      ...
      80
      1
      10
      2
      3
      4
      3
      0
      3
      0
    
    
      1460
      29
      No
      Travel_Rarely
      468
      Research & Development
      28
      4
      Medical
      1
      2054
      ...
      80
      0
      5
      3
      1
      5
      4
      0
      4
      0
    
    
      1461
      50
      Yes
      Travel_Rarely
      410
      Sales
      28
      3
      Marketing
      1
      2055
      ...
      80
      1
      20
      3
      3
      3
      2
      2
      0
      1
    
    
      1462
      39
      No
      Travel_Rarely
      722
      Sales
      24
      1
      Marketing
      1
      2056
      ...
      80
      1
      21
      2
      2
      20
      9
      9
      6
      0
    
    
      1463
      31
      No
      Non-Travel
      325
      Research & Development
      5
      3
      Medical
      1
      2057
      ...
      80
      0
      10
      2
      3
      9
      4
      1
      7
      0
    
    
      1464
      26
      No
      Travel_Rarely
      1167
      Sales
      5
      3
      Other
      1
      2060
      ...
      80
      0
      5
      2
      3
      4
      2
      0
      0
      0
    
    
      1465
      36
      No
      Travel_Frequently
      884
      Research & Development
      23
      2
      Medical
      1
      2061
      ...
      80
      1
      17
      3
      3
      5
      2
      0
      3
      0
    
    
      1466
      39
      No
      Travel_Rarely
      613
      Research & Development
      6
      1
      Medical
      1
      2062
      ...
      80
      1
      9
      5
      3
      7
      7
      1
      7
      0
    
    
      1467
      27
      No
      Travel_Rarely
      155
      Research & Development
      4
      3
      Life Sciences
      1
      2064
      ...
      80
      1
      6
      0
      3
      6
      2
      0
      3
      0
    
    
      1468
      49
      No
      Travel_Frequently
      1023
      Sales
      2
      3
      Medical
      1
      2065
      ...
      80
      0
      17
      3
      2
      9
      6
      0
      8
      0
    
    
      1469
      34
      No
      Travel_Rarely
      628
      Research & Development
      8
      3
      Medical
      1
      2068
      ...
      80
      0
      6
      3
      4
      4
      3
      1
      2
      0
    
  

1470 rows × 36 columns

Correlation of Features

接下来的探索工具是关于矩阵的，通过绘制相关矩阵，我们可以很好的描述特征之间的关联，在 Pandas dataframe 中，我们可以使用 corr 函数可以为 dataframe 的每对列数据提供皮尔森相关系数（也叫矩阵相关系数，用来反映两个变量线性相关程度的统计量）

在这里，我将使用 Plotly 库中的 Heatmap() 函数绘出皮尔森相关系数矩阵:



In [7]:

    
# creating a list of only numerical values
numerical = [u'Age', u'DailyRate', u'DistanceFromHome', u'Education', u'EmployeeNumber', u'EnvironmentSatisfaction',
       u'HourlyRate', u'JobInvolvement', u'JobLevel', u'JobSatisfaction',
       u'MonthlyIncome', u'MonthlyRate', u'NumCompaniesWorked',
       u'PercentSalaryHike', u'PerformanceRating', u'RelationshipSatisfaction',
       u'StockOptionLevel', u'TotalWorkingYears',
       u'TrainingTimesLastYear', u'WorkLifeBalance', u'YearsAtCompany',
       u'YearsInCurrentRole', u'YearsSinceLastPromotion',
       u'YearsWithCurrManager']
data = [
    go.Heatmap(
        z= attrition[numerical].astype(float).corr().values, # Generating the Pearson correlation
        x=attrition[numerical].columns.values,
        y=attrition[numerical].columns.values,
        colorscale='Viridis',
        reversescale = False, #反转色域
        text = True,
        opacity = 1.0 #不透明度
        
    )
]


layout = go.Layout(
    title='Pearson Correlation of numerical features',
    xaxis = dict(ticks='', nticks=36),
    yaxis = dict(ticks='' ),
    width = 900, height = 700,
    
)


fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='labelled-heatmap')

Takeaway from the plots

从上图中，我们可以看到有相当多的列好像彼此关系很差，一般来说，做一个预测模型，我们的训练数据最好彼此不相关，因为我们不需要冗余数据，在这个例子中，在我们有相当多的相关特征情况下，或许我们应该应用 PCA（Principal Component Analysis --> 主成分分析）来减少特征空间

Pairplot Visualisations

现在让我们创建一些 Seaborn pairplot 并且设置 Attrition 列作为目标变量得到各个特征分布对人员流失的影响



In [8]:

    
# Refining our list of numerical variables
numerical = [u'Age', u'DailyRate',  u'JobSatisfaction',
       u'MonthlyIncome', u'PerformanceRating',
        u'WorkLifeBalance', u'YearsAtCompany', u'Attrition_numerical']

#g = sns.pairplot(attrition[numerical], hue='Attrition_numerical', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
#g.set(xticklabels=[])

2. Feature Engineering & Categorical Encoding

我们对数据集进行了简单的探索，现在我们处理特征工程和对分类进行数字编码，特征工程简单地说就是从已有的特征创建新的特征关系，特征工程非常重要。

在开始之前，我们使用 dtype 方法将数字和分类隔离



In [9]:

    
attrition









    Out[9]:






  
    
      
      Age
      Attrition
      BusinessTravel
      DailyRate
      Department
      DistanceFromHome
      Education
      EducationField
      EmployeeCount
      EmployeeNumber
      ...
      StandardHours
      StockOptionLevel
      TotalWorkingYears
      TrainingTimesLastYear
      WorkLifeBalance
      YearsAtCompany
      YearsInCurrentRole
      YearsSinceLastPromotion
      YearsWithCurrManager
      Attrition_numerical
    
  
  
    
      0
      41
      Yes
      Travel_Rarely
      1102
      Sales
      1
      2
      Life Sciences
      1
      1
      ...
      80
      0
      8
      0
      1
      6
      4
      0
      5
      1
    
    
      1
      49
      No
      Travel_Frequently
      279
      Research & Development
      8
      1
      Life Sciences
      1
      2
      ...
      80
      1
      10
      3
      3
      10
      7
      1
      7
      0
    
    
      2
      37
      Yes
      Travel_Rarely
      1373
      Research & Development
      2
      2
      Other
      1
      4
      ...
      80
      0
      7
      3
      3
      0
      0
      0
      0
      1
    
    
      3
      33
      No
      Travel_Frequently
      1392
      Research & Development
      3
      4
      Life Sciences
      1
      5
      ...
      80
      0
      8
      3
      3
      8
      7
      3
      0
      0
    
    
      4
      27
      No
      Travel_Rarely
      591
      Research & Development
      2
      1
      Medical
      1
      7
      ...
      80
      1
      6
      3
      3
      2
      2
      2
      2
      0
    
    
      5
      32
      No
      Travel_Frequently
      1005
      Research & Development
      2
      2
      Life Sciences
      1
      8
      ...
      80
      0
      8
      2
      2
      7
      7
      3
      6
      0
    
    
      6
      59
      No
      Travel_Rarely
      1324
      Research & Development
      3
      3
      Medical
      1
      10
      ...
      80
      3
      12
      3
      2
      1
      0
      0
      0
      0
    
    
      7
      30
      No
      Travel_Rarely
      1358
      Research & Development
      24
      1
      Life Sciences
      1
      11
      ...
      80
      1
      1
      2
      3
      1
      0
      0
      0
      0
    
    
      8
      38
      No
      Travel_Frequently
      216
      Research & Development
      23
      3
      Life Sciences
      1
      12
      ...
      80
      0
      10
      2
      3
      9
      7
      1
      8
      0
    
    
      9
      36
      No
      Travel_Rarely
      1299
      Research & Development
      27
      3
      Medical
      1
      13
      ...
      80
      2
      17
      3
      2
      7
      7
      7
      7
      0
    
    
      10
      35
      No
      Travel_Rarely
      809
      Research & Development
      16
      3
      Medical
      1
      14
      ...
      80
      1
      6
      5
      3
      5
      4
      0
      3
      0
    
    
      11
      29
      No
      Travel_Rarely
      153
      Research & Development
      15
      2
      Life Sciences
      1
      15
      ...
      80
      0
      10
      3
      3
      9
      5
      0
      8
      0
    
    
      12
      31
      No
      Travel_Rarely
      670
      Research & Development
      26
      1
      Life Sciences
      1
      16
      ...
      80
      1
      5
      1
      2
      5
      2
      4
      3
      0
    
    
      13
      34
      No
      Travel_Rarely
      1346
      Research & Development
      19
      2
      Medical
      1
      18
      ...
      80
      1
      3
      2
      3
      2
      2
      1
      2
      0
    
    
      14
      28
      Yes
      Travel_Rarely
      103
      Research & Development
      24
      3
      Life Sciences
      1
      19
      ...
      80
      0
      6
      4
      3
      4
      2
      0
      3
      1
    
    
      15
      29
      No
      Travel_Rarely
      1389
      Research & Development
      21
      4
      Life Sciences
      1
      20
      ...
      80
      1
      10
      1
      3
      10
      9
      8
      8
      0
    
    
      16
      32
      No
      Travel_Rarely
      334
      Research & Development
      5
      2
      Life Sciences
      1
      21
      ...
      80
      2
      7
      5
      2
      6
      2
      0
      5
      0
    
    
      17
      22
      No
      Non-Travel
      1123
      Research & Development
      16
      2
      Medical
      1
      22
      ...
      80
      2
      1
      2
      2
      1
      0
      0
      0
      0
    
    
      18
      53
      No
      Travel_Rarely
      1219
      Sales
      2
      4
      Life Sciences
      1
      23
      ...
      80
      0
      31
      3
      3
      25
      8
      3
      7
      0
    
    
      19
      38
      No
      Travel_Rarely
      371
      Research & Development
      2
      3
      Life Sciences
      1
      24
      ...
      80
      0
      6
      3
      3
      3
      2
      1
      2
      0
    
    
      20
      24
      No
      Non-Travel
      673
      Research & Development
      11
      2
      Other
      1
      26
      ...
      80
      1
      5
      5
      2
      4
      2
      1
      3
      0
    
    
      21
      36
      Yes
      Travel_Rarely
      1218
      Sales
      9
      4
      Life Sciences
      1
      27
      ...
      80
      0
      10
      4
      3
      5
      3
      0
      3
      1
    
    
      22
      34
      No
      Travel_Rarely
      419
      Research & Development
      7
      4
      Life Sciences
      1
      28
      ...
      80
      0
      13
      4
      3
      12
      6
      2
      11
      0
    
    
      23
      21
      No
      Travel_Rarely
      391
      Research & Development
      15
      2
      Life Sciences
      1
      30
      ...
      80
      0
      0
      6
      3
      0
      0
      0
      0
      0
    
    
      24
      34
      Yes
      Travel_Rarely
      699
      Research & Development
      6
      1
      Medical
      1
      31
      ...
      80
      0
      8
      2
      3
      4
      2
      1
      3
      1
    
    
      25
      53
      No
      Travel_Rarely
      1282
      Research & Development
      5
      3
      Other
      1
      32
      ...
      80
      1
      26
      3
      2
      14
      13
      4
      8
      0
    
    
      26
      32
      Yes
      Travel_Frequently
      1125
      Research & Development
      16
      1
      Life Sciences
      1
      33
      ...
      80
      0
      10
      5
      3
      10
      2
      6
      7
      1
    
    
      27
      42
      No
      Travel_Rarely
      691
      Sales
      8
      4
      Marketing
      1
      35
      ...
      80
      1
      10
      2
      3
      9
      7
      4
      2
      0
    
    
      28
      44
      No
      Travel_Rarely
      477
      Research & Development
      7
      4
      Medical
      1
      36
      ...
      80
      1
      24
      4
      3
      22
      6
      5
      17
      0
    
    
      29
      46
      No
      Travel_Rarely
      705
      Sales
      2
      4
      Marketing
      1
      38
      ...
      80
      0
      22
      2
      2
      2
      2
      2
      1
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1440
      36
      No
      Travel_Frequently
      688
      Research & Development
      4
      2
      Life Sciences
      1
      2025
      ...
      80
      3
      18
      3
      3
      4
      2
      0
      2
      0
    
    
      1441
      56
      No
      Non-Travel
      667
      Research & Development
      1
      4
      Life Sciences
      1
      2026
      ...
      80
      1
      13
      2
      2
      13
      12
      1
      9
      0
    
    
      1442
      29
      Yes
      Travel_Rarely
      1092
      Research & Development
      1
      4
      Medical
      1
      2027
      ...
      80
      3
      4
      3
      4
      2
      2
      2
      2
      1
    
    
      1443
      42
      No
      Travel_Rarely
      300
      Research & Development
      2
      3
      Life Sciences
      1
      2031
      ...
      80
      0
      24
      2
      2
      22
      6
      4
      14
      0
    
    
      1444
      56
      Yes
      Travel_Rarely
      310
      Research & Development
      7
      2
      Technical Degree
      1
      2032
      ...
      80
      1
      14
      4
      1
      10
      9
      9
      8
      1
    
    
      1445
      41
      No
      Travel_Rarely
      582
      Research & Development
      28
      4
      Life Sciences
      1
      2034
      ...
      80
      1
      21
      3
      3
      20
      7
      0
      10
      0
    
    
      1446
      34
      No
      Travel_Rarely
      704
      Sales
      28
      3
      Marketing
      1
      2035
      ...
      80
      2
      8
      2
      3
      8
      7
      1
      7
      0
    
    
      1447
      36
      No
      Non-Travel
      301
      Sales
      15
      4
      Marketing
      1
      2036
      ...
      80
      1
      15
      4
      2
      15
      12
      11
      11
      0
    
    
      1448
      41
      No
      Travel_Rarely
      930
      Sales
      3
      3
      Life Sciences
      1
      2037
      ...
      80
      1
      14
      5
      3
      5
      4
      0
      4
      0
    
    
      1449
      32
      No
      Travel_Rarely
      529
      Research & Development
      2
      3
      Technical Degree
      1
      2038
      ...
      80
      0
      4
      4
      3
      4
      2
      1
      2
      0
    
    
      1450
      35
      No
      Travel_Rarely
      1146
      Human Resources
      26
      4
      Life Sciences
      1
      2040
      ...
      80
      0
      9
      2
      3
      9
      0
      1
      7
      0
    
    
      1451
      38
      No
      Travel_Rarely
      345
      Sales
      10
      2
      Life Sciences
      1
      2041
      ...
      80
      1
      10
      1
      3
      10
      7
      1
      9
      0
    
    
      1452
      50
      Yes
      Travel_Frequently
      878
      Sales
      1
      4
      Life Sciences
      1
      2044
      ...
      80
      2
      12
      3
      3
      6
      3
      0
      1
      1
    
    
      1453
      36
      No
      Travel_Rarely
      1120
      Sales
      11
      4
      Marketing
      1
      2045
      ...
      80
      1
      8
      2
      2
      6
      3
      0
      0
      0
    
    
      1454
      45
      No
      Travel_Rarely
      374
      Sales
      20
      3
      Life Sciences
      1
      2046
      ...
      80
      0
      8
      3
      3
      5
      3
      0
      1
      0
    
    
      1455
      40
      No
      Travel_Rarely
      1322
      Research & Development
      2
      4
      Life Sciences
      1
      2048
      ...
      80
      0
      8
      2
      3
      2
      2
      2
      2
      0
    
    
      1456
      35
      No
      Travel_Frequently
      1199
      Research & Development
      18
      4
      Life Sciences
      1
      2049
      ...
      80
      2
      10
      2
      4
      10
      2
      0
      2
      0
    
    
      1457
      40
      No
      Travel_Rarely
      1194
      Research & Development
      2
      4
      Medical
      1
      2051
      ...
      80
      3
      20
      2
      3
      5
      3
      0
      2
      0
    
    
      1458
      35
      No
      Travel_Rarely
      287
      Research & Development
      1
      4
      Life Sciences
      1
      2052
      ...
      80
      1
      4
      5
      3
      4
      3
      1
      1
      0
    
    
      1459
      29
      No
      Travel_Rarely
      1378
      Research & Development
      13
      2
      Other
      1
      2053
      ...
      80
      1
      10
      2
      3
      4
      3
      0
      3
      0
    
    
      1460
      29
      No
      Travel_Rarely
      468
      Research & Development
      28
      4
      Medical
      1
      2054
      ...
      80
      0
      5
      3
      1
      5
      4
      0
      4
      0
    
    
      1461
      50
      Yes
      Travel_Rarely
      410
      Sales
      28
      3
      Marketing
      1
      2055
      ...
      80
      1
      20
      3
      3
      3
      2
      2
      0
      1
    
    
      1462
      39
      No
      Travel_Rarely
      722
      Sales
      24
      1
      Marketing
      1
      2056
      ...
      80
      1
      21
      2
      2
      20
      9
      9
      6
      0
    
    
      1463
      31
      No
      Non-Travel
      325
      Research & Development
      5
      3
      Medical
      1
      2057
      ...
      80
      0
      10
      2
      3
      9
      4
      1
      7
      0
    
    
      1464
      26
      No
      Travel_Rarely
      1167
      Sales
      5
      3
      Other
      1
      2060
      ...
      80
      0
      5
      2
      3
      4
      2
      0
      0
      0
    
    
      1465
      36
      No
      Travel_Frequently
      884
      Research & Development
      23
      2
      Medical
      1
      2061
      ...
      80
      1
      17
      3
      3
      5
      2
      0
      3
      0
    
    
      1466
      39
      No
      Travel_Rarely
      613
      Research & Development
      6
      1
      Medical
      1
      2062
      ...
      80
      1
      9
      5
      3
      7
      7
      1
      7
      0
    
    
      1467
      27
      No
      Travel_Rarely
      155
      Research & Development
      4
      3
      Life Sciences
      1
      2064
      ...
      80
      1
      6
      0
      3
      6
      2
      0
      3
      0
    
    
      1468
      49
      No
      Travel_Frequently
      1023
      Sales
      2
      3
      Medical
      1
      2065
      ...
      80
      0
      17
      3
      2
      9
      6
      0
      8
      0
    
    
      1469
      34
      No
      Travel_Rarely
      628
      Research & Development
      8
      3
      Medical
      1
      2068
      ...
      80
      0
      6
      3
      4
      4
      3
      1
      2
      0
    
  

1470 rows × 36 columns



In [10]:

    
# Drop the Attrition_numerical column from attrition dataset first - Don't want to include that
attrition = attrition.drop(['Attrition_numerical'], axis=1)

# Empty list to store columns with categorical data
categorical = []
for col, value in attrition.iteritems():
    if value.dtype == 'object':
        categorical.append(col)

# Store the numerical columns in a list numerical
print(categorical)
numerical = attrition.columns.difference(categorical)









    



['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']



In [11]:

    
numerical









    Out[11]:





Index(['Age', 'DailyRate', 'DistanceFromHome', 'Education', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'MonthlyIncome',
       'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike',
       'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

确定我们的特征包含分类数据，我们可以将 numerical 编码，可以使用 Pandas 的 get_dummies() 方法



In [12]:

    
# Store the categorical data in a dataframe called attrition_cat
attrition_cat = attrition[categorical] #提取出不是数字的列
attrition_cat = attrition_cat.drop(['Attrition'], axis=1) # Dropping the target column
print(attrition_cat)









    



         BusinessTravel              Department    EducationField  Gender  \
0         Travel_Rarely                   Sales     Life Sciences  Female   
1     Travel_Frequently  Research & Development     Life Sciences    Male   
2         Travel_Rarely  Research & Development             Other    Male   
3     Travel_Frequently  Research & Development     Life Sciences  Female   
4         Travel_Rarely  Research & Development           Medical    Male   
5     Travel_Frequently  Research & Development     Life Sciences    Male   
6         Travel_Rarely  Research & Development           Medical  Female   
7         Travel_Rarely  Research & Development     Life Sciences    Male   
8     Travel_Frequently  Research & Development     Life Sciences    Male   
9         Travel_Rarely  Research & Development           Medical    Male   
10        Travel_Rarely  Research & Development           Medical    Male   
11        Travel_Rarely  Research & Development     Life Sciences  Female   
12        Travel_Rarely  Research & Development     Life Sciences    Male   
13        Travel_Rarely  Research & Development           Medical    Male   
14        Travel_Rarely  Research & Development     Life Sciences    Male   
15        Travel_Rarely  Research & Development     Life Sciences  Female   
16        Travel_Rarely  Research & Development     Life Sciences    Male   
17           Non-Travel  Research & Development           Medical    Male   
18        Travel_Rarely                   Sales     Life Sciences  Female   
19        Travel_Rarely  Research & Development     Life Sciences    Male   
20           Non-Travel  Research & Development             Other  Female   
21        Travel_Rarely                   Sales     Life Sciences    Male   
22        Travel_Rarely  Research & Development     Life Sciences  Female   
23        Travel_Rarely  Research & Development     Life Sciences    Male   
24        Travel_Rarely  Research & Development           Medical    Male   
25        Travel_Rarely  Research & Development             Other  Female   
26    Travel_Frequently  Research & Development     Life Sciences  Female   
27        Travel_Rarely                   Sales         Marketing    Male   
28        Travel_Rarely  Research & Development           Medical  Female   
29        Travel_Rarely                   Sales         Marketing  Female   
...                 ...                     ...               ...     ...   
1440  Travel_Frequently  Research & Development     Life Sciences  Female   
1441         Non-Travel  Research & Development     Life Sciences    Male   
1442      Travel_Rarely  Research & Development           Medical    Male   
1443      Travel_Rarely  Research & Development     Life Sciences    Male   
1444      Travel_Rarely  Research & Development  Technical Degree    Male   
1445      Travel_Rarely  Research & Development     Life Sciences  Female   
1446      Travel_Rarely                   Sales         Marketing  Female   
1447         Non-Travel                   Sales         Marketing    Male   
1448      Travel_Rarely                   Sales     Life Sciences    Male   
1449      Travel_Rarely  Research & Development  Technical Degree    Male   
1450      Travel_Rarely         Human Resources     Life Sciences  Female   
1451      Travel_Rarely                   Sales     Life Sciences  Female   
1452  Travel_Frequently                   Sales     Life Sciences    Male   
1453      Travel_Rarely                   Sales         Marketing  Female   
1454      Travel_Rarely                   Sales     Life Sciences  Female   
1455      Travel_Rarely  Research & Development     Life Sciences    Male   
1456  Travel_Frequently  Research & Development     Life Sciences    Male   
1457      Travel_Rarely  Research & Development           Medical  Female   
1458      Travel_Rarely  Research & Development     Life Sciences  Female   
1459      Travel_Rarely  Research & Development             Other    Male   
1460      Travel_Rarely  Research & Development           Medical  Female   
1461      Travel_Rarely                   Sales         Marketing    Male   
1462      Travel_Rarely                   Sales         Marketing  Female   
1463         Non-Travel  Research & Development           Medical    Male   
1464      Travel_Rarely                   Sales             Other  Female   
1465  Travel_Frequently  Research & Development           Medical    Male   
1466      Travel_Rarely  Research & Development           Medical    Male   
1467      Travel_Rarely  Research & Development     Life Sciences    Male   
1468  Travel_Frequently                   Sales           Medical    Male   
1469      Travel_Rarely  Research & Development           Medical    Male   

                        JobRole MaritalStatus Over18 OverTime  
0               Sales Executive        Single      Y      Yes  
1            Research Scientist       Married      Y       No  
2         Laboratory Technician        Single      Y      Yes  
3            Research Scientist       Married      Y      Yes  
4         Laboratory Technician       Married      Y       No  
5         Laboratory Technician        Single      Y       No  
6         Laboratory Technician       Married      Y      Yes  
7         Laboratory Technician      Divorced      Y       No  
8        Manufacturing Director        Single      Y       No  
9     Healthcare Representative       Married      Y       No  
10        Laboratory Technician       Married      Y       No  
11        Laboratory Technician        Single      Y      Yes  
12           Research Scientist      Divorced      Y       No  
13        Laboratory Technician      Divorced      Y       No  
14        Laboratory Technician        Single      Y      Yes  
15       Manufacturing Director      Divorced      Y       No  
16           Research Scientist      Divorced      Y      Yes  
17        Laboratory Technician      Divorced      Y      Yes  
18                      Manager       Married      Y       No  
19           Research Scientist        Single      Y      Yes  
20       Manufacturing Director      Divorced      Y       No  
21         Sales Representative        Single      Y       No  
22            Research Director        Single      Y       No  
23           Research Scientist        Single      Y       No  
24           Research Scientist        Single      Y       No  
25                      Manager      Divorced      Y       No  
26           Research Scientist        Single      Y      Yes  
27              Sales Executive       Married      Y       No  
28    Healthcare Representative       Married      Y       No  
29                      Manager        Single      Y       No  
...                         ...           ...    ...      ...  
1440     Manufacturing Director      Divorced      Y       No  
1441  Healthcare Representative      Divorced      Y       No  
1442         Research Scientist       Married      Y      Yes  
1443                    Manager       Married      Y       No  
1444      Laboratory Technician       Married      Y       No  
1445     Manufacturing Director       Married      Y       No  
1446            Sales Executive       Married      Y       No  
1447            Sales Executive      Divorced      Y       No  
1448            Sales Executive      Divorced      Y       No  
1449         Research Scientist        Single      Y       No  
1450            Human Resources        Single      Y      Yes  
1451            Sales Executive       Married      Y       No  
1452            Sales Executive      Divorced      Y       No  
1453            Sales Executive       Married      Y       No  
1454            Sales Executive        Single      Y       No  
1455         Research Scientist        Single      Y       No  
1456  Healthcare Representative       Married      Y      Yes  
1457         Research Scientist       Married      Y       No  
1458         Research Scientist       Married      Y       No  
1459      Laboratory Technician       Married      Y      Yes  
1460         Research Scientist        Single      Y       No  
1461            Sales Executive      Divorced      Y      Yes  
1462            Sales Executive       Married      Y       No  
1463     Manufacturing Director        Single      Y       No  
1464       Sales Representative        Single      Y       No  
1465      Laboratory Technician       Married      Y       No  
1466  Healthcare Representative       Married      Y       No  
1467     Manufacturing Director       Married      Y      Yes  
1468            Sales Executive       Married      Y       No  
1469      Laboratory Technician       Married      Y       No  

[1470 rows x 8 columns]

应用 get_dummies() 方法自动编码, 我们可以很方便的用以下代码看编码后的结果



In [13]:

    
attrition_cat = pd.get_dummies(attrition_cat)
attrition_cat.head(3)









    Out[13]:






  
    
      
      BusinessTravel_Non-Travel
      BusinessTravel_Travel_Frequently
      BusinessTravel_Travel_Rarely
      Department_Human Resources
      Department_Research & Development
      Department_Sales
      EducationField_Human Resources
      EducationField_Life Sciences
      EducationField_Marketing
      EducationField_Medical
      ...
      JobRole_Research Director
      JobRole_Research Scientist
      JobRole_Sales Executive
      JobRole_Sales Representative
      MaritalStatus_Divorced
      MaritalStatus_Married
      MaritalStatus_Single
      Over18_Y
      OverTime_No
      OverTime_Yes
    
  
  
    
      0
      0
      0
      1
      0
      0
      1
      0
      1
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      1
      0
      1
    
    
      1
      0
      1
      0
      0
      1
      0
      0
      1
      0
      0
      ...
      0
      1
      0
      0
      0
      1
      0
      1
      1
      0
    
    
      2
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      1
      0
      1
    
  

3 rows × 29 columns

提取出是数字的列



In [14]:

    
# Store the numerical features to a dataframe attrition_num
attrition_num = attrition[numerical]

我们编码了非数字的变量，并将数字的提取出来，现在我们要将它们合并成最终的训练数据



In [15]:

    
# Concat the two dataframes together columnwise
attrition_final = pd.concat([attrition_num, attrition_cat], axis=1)

Target variable

最后，我们需要目标变量, 由 attrition 列给出，我们需要将其编码，１代表 Yes, 0 代表 No



In [16]:

    
# Define a dictionary for the target mapping
target_map = {'Yes':1, 'No':0}
# Use the pandas apply method to numerically encode our attrition target variable
target = attrition["Attrition"].apply(lambda x: target_map[x])
target.head(3)









    Out[16]:





0    1
1    0
2    1
Name: Attrition, dtype: int64

然而，如果检查 Yes 和 No 的数量就会发现，数据有非常大的偏差



In [17]:

    
data = [go.Bar(
            x=attrition["Attrition"].value_counts().index.values,
            y= attrition["Attrition"].value_counts().values
    )]

py.iplot(data, filename='basic-bar')

因此，我们现在的数据是不平衡的，有很多方法可以解决数据不平衡的问题，在这里我们使用 SMOTE 过采样技术来处理不平衡

3. Implementing Machine Learning Models

进行了一些探索分析和简单的特征工程，确保我们的所有的数据都被编码，我们现在可以建立自己的模型

在这个笔记一开始时，我们说我们的目标是为了评估和对比一些不同模型的表现

分离和测试数据

在我们训练数据之前，需要有一个训练集和测试集，不同于 Kaggle 比赛，一般我们都会有现成的训练集和测试集，这里我们使用 sklearn 来分离数据



In [18]:

    
# Import the train_test_split method
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import StratifiedShuffleSplit

# Split data into train and test sets as well as for validation and testing
train, test, target_train, target_val = train_test_split(attrition_final, target, train_size= 0.75,random_state=0);
#train, test, target_train, target_val = StratifiedShuffleSplit(attrition_final, target, random_state=0);

SMOTE to oversample due to the skewness in target

既然我们已经注意到了目标值的不平衡，让我们通过 imblearn 包实现。



In [19]:

    
oversampler=SMOTE(random_state=0)
smote_train, smote_target = oversampler.fit_sample(train,target_train)

A. Random Forest Classifier

随机森林分类方法是无处不在的决策树，作为独立模型的决策树通常被认为是 "弱学习" 模型，因为它的预测性较差，然而随机森林分类是收集一组决策树，用其组合能力获得较强的预测性能，称为强学习

Initialising Random Forest parameters

我们将使用 scikit-learn 的库中的 Random Forest mode, 我们首先定义我们的参数



In [20]:

    
seed = 0   # We set our random seed to zero for reproducibility
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 800,
    'warm_start': True, 
    'max_features': 0.3,
    'max_depth': 9,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

我们可以使用 scikit-learn 的 RandomForestClassifier() 函数来初始化随机森林并将参数传入



In [21]:

    
rf = RandomForestClassifier(**rf_params)

我们开始训练:



In [22]:

    
rf.fit(smote_train, smote_target)
print("Fitting of Random Forest as finished")









    



Fitting of Random Forest as finished

现在我们可以在测试数据上进行预测:



In [24]:

    
rf_predictions = rf.predict(test)
print("Predictions finished")









    



Predictions finished

对预测进行打分:



In [25]:

    
accuracy_score(target_val, rf_predictions)









    Out[25]:





0.87771739130434778

Accuracy of the model

我们观察到，使用随机森林分类可以得到 88% 的正确率，乍一看，这像是一个非常好的模型，如果我们考虑我们的数据分布是 84% yes 和　%26　no,就会发现这个面型预测的和蒙的差不多

Feature Ranking via the Random Forest

sklearn 随机森林分类包含了一个非常方便和有用的属性是 featureimportances，它可以显示出对于特征森林算法来说最重要的特征，下图显示了对于最重要的几个特征:



In [26]:

    
# Scatter plot 
trace = go.Scatter(
    y = rf.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = rf.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

Most RF important features: Overtime, Marital Status

通过上图可以看出对于我们最重要的几个特征，算法将加班特征的重要性拍到最高，其次是婚姻状况

我不知道对于你来说哪个重要，但是对于我来说加班确实影响到了我对工作的满意程度，也许这样我们队分类器就不会感到惊讶，因为我们的分类器已经达到了目标并把加班时间重要性排到最高

Visualising Tree Diagram with Graphviz

让我们显示我们的特征树，可以使用 DecisionTreeClassifier 对象遍历单个决策树特征并使用 export_graphviz() 函数来显示 png 图像:



In [27]:

    
from sklearn import tree
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re

decision_tree = tree.DecisionTreeClassifier(max_depth = 4)
decision_tree.fit(train, target_train)

# Predicting results for test dataset
y_pred = decision_tree.predict(test)

# Export our trained model as a .dot file
with open("tree1.dot", 'w') as f:
     f = tree.export_graphviz(decision_tree,
                              out_file=f,
                              max_depth = 4,
                              impurity = False,
                              feature_names = attrition_final.columns.values,
                              class_names = ['No', 'Yes'],
                              rounded = True,
                              filled= True )
        
#Convert .dot to .png to allow display in web notebook
check_call(['dot','-Tpng','tree1.dot','-o','tree1.png'])

# Annotating chart with PIL
img = Image.open("tree1.png")
draw = ImageDraw.Draw(img)
img.save('sample-out.png')
PImage("sample-out.png")









    Out[27]:

B. Gradient Boosted Classifier

梯度增强法是一种组合技术，非常像随机森林树，是将弱树学习者的组合结合成一棵强树，这个技术涉及到定义一些方法（算法）来最小化损失函数 (loss function)。因此，顾名思义，最小化损失函数的方法就是指梯度下降方法，指向了减少损失函数值的方向。

sklearn 中使用 Gradient Boosted classifier 非常简单，只需要几行代码，我们首先设置分类参数:

Initialising Gradient Boosting Parameters

一般来说，在设置梯度增强分类有几个关键参数, 估计数量, 模型的最大深度，每个叶子的最少样本。



In [28]:

    
# Gradient Boosting Parameters
gb_params ={
    'n_estimators': 500,
    'max_features': 0.9,
    'learning_rate' : 0.2,
    'max_depth': 11,
    'min_samples_leaf': 2,
    'subsample': 1,
    'max_features' : 'sqrt',
    'random_state' : seed,
    'verbose': 0
}

定义了参数后，我们可以训练预测得分了



In [30]:

    
gb = GradientBoostingClassifier(**gb_params)
# Fit the model to our SMOTEd train and target
gb.fit(smote_train, smote_target)
# Get our predictions
gb_predictions = gb.predict(test)
print("Predictions have finished")
accuracy_score(target_val, gb_predictions)









    



Predictions have finished






    Out[30]:





0.88858695652173914

Feature Ranking via the Gradient Boosting Model

我们看一下对于 Gradient Boosting Model 最重要的参数



In [33]:

    
# Scatter plot 
trace = go.Scatter(
    y = gb.feature_importances_,
    x = attrition_final.columns.values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 13,
        #size= rf.feature_importances_,
        #color = np.random.randn(500), #set color equal to a variable
        color = gb.feature_importances_,
        colorscale='Portland',
        showscale=True
    ),
    text = attrition_final.columns.values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Model Feature Importance',
    hovermode= 'closest',
     xaxis= dict(
         ticklen= 5,
         showgrid=False,
        zeroline=False,
        showline=False
     ),
    yaxis=dict(
        title= 'Feature Importance',
        showgrid=False,
        zeroline=False,
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter')

GBM most important features

Monthly Income, Overtime, Daily and Monthly Rate

CONCLUSION

我们简单的分析了员工的属性，并应用了特征工程，实现了两种算法，得到了 89% 的正确率。

但仍然有改进的空间，可以从数据中应用更多的特征工程，可以通过混合模型来使得模型更准确，例如同时运行多个模型，根据多个模型的结果进行投票

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	...	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	...	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	...	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	...	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	...	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	...	4	80	1	6	3	3	2	2	2	2

	BusinessTravel_Travel_Frequently	BusinessTravel_Travel_Rarely	Department_Research & Development	Department_Sales	EducationField_Life Sciences	...	JobRole_Research Scientist	JobRole_Sales Executive	MaritalStatus_Married	MaritalStatus_Single	Over18_Y	OverTime_No	OverTime_Yes
0	0	1	0	1	1	...	0	1	0	1	1	0	1
1	1	0	1	0	1	...	1	0	1	0	1	1	0
2	0	1	1	0	0	...	0	0	0	1	1	0	1

	BusinessTravel_Travel_Frequently	BusinessTravel_Travel_Rarely	Department_Research & Development	Department_Sales	EducationField_Life Sciences	...	JobRole_Research Scientist	JobRole_Sales Executive	MaritalStatus_Married	MaritalStatus_Single	Over18_Y	OverTime_No	OverTime_Yes
0	0	1	0	1	1	...	0	1	0	1	1	0	1
1	1	0	1	0	1	...	1	0	1	0	1	1	0
2	0	1	1	0	0	...	0	0	0	1	1	0	1

	BusinessTravel_Travel_Frequently	BusinessTravel_Travel_Rarely	Department_Research & Development	Department_Sales	EducationField_Life Sciences	...	JobRole_Research Scientist	JobRole_Sales Executive	MaritalStatus_Married	MaritalStatus_Single	Over18_Y	OverTime_No	OverTime_Yes
0	0	1	0	1	1	...	0	1	0	1	1	0	1
1	1	0	1	0	1	...	1	0	1	0	1	1	0
2	0	1	1	0	0	...	0	0	0	1	1	0	1