In [2]:

    
# %load nbinit.py
from IPython.display import HTML
HTML("""
<style>
.container { width: 100% !important; padding-left: 1em; padding-right: 2em; }
div.output_stderr { background: #FFA; }
</style>
""")









    Out[2]:

Please, rename this file to HW6.ipynb and save it in MSA8010F16/HW6

Homework 6: Preprocessing Data

We use a data set from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Bank+Marketing to experiment with a Decision Tree classifier http://www.saedsayad.com/decision_tree.htm

Scikit-Learn: http://scikit-learn.org/stable/modules/tree.html#tree

Book slides:

Bank Marketing Data Set

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information:

Input variables:

bank client data:

  1 age (numeric)
  2 job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
  3 marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
  4 education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
  5 default: has credit in default? (categorical: 'no','yes','unknown')
  6 housing: has housing loan? (categorical: 'no','yes','unknown')
  7 loan: has personal loan? (categorical: 'no','yes','unknown')

related with the last contact of the current campaign:

  8 contact: contact communication type (categorical: 'cellular','telephone') 
  9 month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
  10 day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
  11 duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

  12 campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13 pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14 previous: number of contacts performed before this campaign and for this client (numeric)
  15 poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

social and economic context attributes

  16 emp.var.rate: employment variation rate - quarterly indicator (numeric)
  17 cons.price.idx: consumer price index - monthly indicator (numeric) 
  18 cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
  19 euribor3m: euribor 3 month rate - daily indicator (numeric)
  20 nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 y - has the client subscribed a term deposit? (binary: 'yes','no')



In [3]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt



In [4]:

    
DATAFILE = '/home/data/archive.ics.uci.edu/BankMarketing/bank.csv'
###DATAFILE = 'data/bank.csv'  ### using locally



In [5]:

    
df = pd.read_csv(DATAFILE, sep=';')
list(df.columns)









    Out[5]:





['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'y']

Step 1: Investigate Data Set

We have a number of categorical data: What's their cardinality? How are the levels distributed?
What's the distribution on numeric values? Do we see any correlations?

Let's first look at columns (i.e. variables) with continuous values. We can get a sense of the distribution from aggregate functions like mean, standard variation, quantiles, as well as, minimum and maximum values.

The Pandas method describe creates a table view of those metrics. (The methods can also be used to identify numeric features in the data frame.



In [6]:

    
### use sets and '-' difference operation 'A-B'. Also there is a symmetric different '^'
all_features = set(df.columns)-set(['y'])
num_features = set(df.describe().columns)
cat_features = all_features-num_features

print("All features:         ", ", ".join(all_features), "\nNumerical features:   ", ", ".join(num_features), "\nCategorical features: ", ", ".join(cat_features))









    



All features:          balance, default, campaign, age, housing, contact, marital, month, previous, job, education, day, pdays, loan, duration, poutcome 
Numerical features:    balance, day, pdays, campaign, age, duration, previous 
Categorical features:  job, education, default, loan, housing, contact, poutcome, marital, month



In [26]:

    
set(df.columns)-set(df.describe().columns)-set('y')









    Out[26]:





{'contact',
 'default',
 'education',
 'housing',
 'job',
 'loan',
 'marital',
 'month',
 'poutcome',
 'y'}



In [7]:

    
### Describe Columns
help(pd.DataFrame.describe)









    



Help on function describe in module pandas.core.generic:

describe(self, percentiles=None, include=None, exclude=None)
    Generate various summary statistics, excluding NaN values.
    
    Parameters
    ----------
    percentiles : array-like, optional
        The percentiles to include in the output. Should all
        be in the interval [0, 1]. By default `percentiles` is
        [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
    include, exclude : list-like, 'all', or None (default)
        Specify the form of the returned result. Either:
    
        - None to both (default). The result will include only
          numeric-typed columns or, if none are, only categorical columns.
        - A list of dtypes or strings to be included/excluded.
          To select all numeric types use numpy numpy.number. To select
          categorical objects use type object. See also the select_dtypes
          documentation. eg. df.describe(include=['O'])
        - If include is the string 'all', the output column-set will
          match the input one.
    
    Returns
    -------
    summary: NDFrame of summary statistics
    
    Notes
    -----
    The output DataFrame index depends on the requested dtypes:
    
    For numeric dtypes, it will include: count, mean, std, min,
    max, and lower, 50, and upper percentiles.
    
    For object dtypes (e.g. timestamps or strings), the index
    will include the count, unique, most common, and frequency of the
    most common. Timestamps also include the first and last items.
    
    For mixed dtypes, the index will be the union of the corresponding
    output types. Non-applicable entries will be filled with NaN.
    Note that mixed-dtype outputs can only be returned from mixed-dtype
    inputs and appropriate use of the include/exclude arguments.
    
    If multiple values have the highest count, then the
    `count` and `most common` pair will be arbitrarily chosen from
    among those with the highest count.
    
    The include, exclude arguments are ignored for Series.
    
    See Also
    --------
    DataFrame.select_dtypes



In [8]:

    
### Let's get the description of the numeric data for each of the target values separately.
### We need to rename the columns before we can properly join the tables. The column names may look strange...
desc_yes = df[df.y=='yes'].describe().rename_axis(lambda c: "%s|A"%c, axis='columns')
desc_no  = df[df.y=='no'].describe().rename_axis(lambda c: "%s|B"%c, axis='columns')



In [ ]:

    
### ...but this way we can get them in the desired order...
desc = desc_yes.join(desc_no).reindex_axis(sorted(desc_yes.columns), axis=1)
### ...because we're changing them anyway:



In [12]:

    
#desc.set_axis(1, [sorted(list(num_features)*2), ['yes', 'no']*len(num_features)])
#desc









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-e6a6450c77ef> in <module>()
----> 1 desc.set_axis(1, [sorted(list(num_features)*2), ['yes', 'no']*len(num_features)])
      2 #desc

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in set_axis(self, axis, labels)
    423     def set_axis(self, axis, labels):
    424         """ public verson of axis assignment """
--> 425         setattr(self, self._get_axis_name(axis), labels)
    426 
    427     def _set_axis(self, axis, labels):

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2683         try:
   2684             object.__getattribute__(self, name)
-> 2685             return object.__setattr__(self, name, value)
   2686         except AttributeError:
   2687             pass

pandas/src/properties.pyx in pandas.lib.AxisProperty.__set__ (pandas/lib.c:44748)()

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    426 
    427     def _set_axis(self, axis, labels):
--> 428         self._data.set_axis(axis, labels)
    429         self._clear_item_cache()
    430 

/usr/lib64/python3.4/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2633             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2634                              'new values have %d elements' %
-> 2635                              (old_len, new_len))
   2636 
   2637         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 7 elements, new values have 14 elements

Let's look at the distribution of numerical features...



In [13]:

    
%matplotlib inline
fig = plt.figure(figsize=(32, 8))
for i in range(len(num_features)):
    f = list(num_features)[i]
    plt.subplot(2, 4, i+1)
    hst = plt.hist(df[f], alpha=0.5)
    plt.title(f)
plt.suptitle('Distribution of Numeric Values', fontsize=20)
None

Now, let's look at the categorical variables and their distribution...



In [14]:

    
for f in cat_features:
    tab = df[f].value_counts()
    print('%s:\t%s' % (f, ', '.join([ ("%s(%d)" %(tab.index[i], tab.values[i])) for i in range(len(tab))]) ))









    



job:	management(969), blue-collar(946), technician(768), admin.(478), services(417), retired(230), self-employed(183), entrepreneur(168), unemployed(128), housemaid(112), student(84), unknown(38)
education:	secondary(2306), tertiary(1350), primary(678), unknown(187)
default:	no(4445), yes(76)
loan:	no(3830), yes(691)
housing:	yes(2559), no(1962)
contact:	cellular(2896), unknown(1324), telephone(301)
poutcome:	unknown(3705), failure(490), other(197), success(129)
marital:	married(2797), single(1196), divorced(528)
month:	may(1398), jul(706), aug(633), jun(531), nov(389), apr(293), feb(222), jan(148), oct(80), sep(52), mar(49), dec(20)

Results in a data frame:



In [15]:

    
mat = pd.DataFrame(
    [ df[f].value_counts() for f in list(cat_features) ],
    index=list(cat_features)
    ).stack()

pd.DataFrame(mat.values, index=mat.index)









    Out[15]:






  
    
      
      
      0
    
  
  
    
      job
      admin.
      478.0
    
    
      blue-collar
      946.0
    
    
      entrepreneur
      168.0
    
    
      housemaid
      112.0
    
    
      management
      969.0
    
    
      retired
      230.0
    
    
      self-employed
      183.0
    
    
      services
      417.0
    
    
      student
      84.0
    
    
      technician
      768.0
    
    
      unemployed
      128.0
    
    
      unknown
      38.0
    
    
      education
      primary
      678.0
    
    
      secondary
      2306.0
    
    
      tertiary
      1350.0
    
    
      unknown
      187.0
    
    
      default
      no
      4445.0
    
    
      yes
      76.0
    
    
      loan
      no
      3830.0
    
    
      yes
      691.0
    
    
      housing
      no
      1962.0
    
    
      yes
      2559.0
    
    
      contact
      cellular
      2896.0
    
    
      telephone
      301.0
    
    
      unknown
      1324.0
    
    
      poutcome
      failure
      490.0
    
    
      other
      197.0
    
    
      success
      129.0
    
    
      unknown
      3705.0
    
    
      marital
      divorced
      528.0
    
    
      married
      2797.0
    
    
      single
      1196.0
    
    
      month
      apr
      293.0
    
    
      aug
      633.0
    
    
      dec
      20.0
    
    
      feb
      222.0
    
    
      jan
      148.0
    
    
      jul
      706.0
    
    
      jun
      531.0
    
    
      mar
      49.0
    
    
      may
      1398.0
    
    
      nov
      389.0
    
    
      oct
      80.0
    
    
      sep
      52.0



In [ ]:

Step 2: Prepare for ML algorithm

The ML algorithms in Scikit-Learn use Matrices (with numeric values). We need to convert our data-frame into a feature matrix X and a target vector y. Many algorithms also require the features to be in the same range. Decision-trees don't bother because they don't perform any operations across features.

Use the pd.DataFrame.as_matrix method to convert a DataFrame into a matrix.



In [ ]:

    
help(pd.DataFrame.as_matrix)



In [16]:

    
## We copy our original dataframe into a new one, and then perform replacements on categorical levels.
## We may also keep track of our replacement
level_substitution = {}

def levels2index(levels):
    dct = {}
    for i in range(len(levels)):
        dct[levels[i]] = i
    return dct

df_num = df.copy()

for c in cat_features:
    level_substitution[c] = levels2index(df[c].unique())
    df_num[c].replace(level_substitution[c], inplace=True)

## same for target
df_num.y.replace({'no':0, 'yes':1}, inplace=True)

df_num









    Out[16]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      y
    
  
  
    
      0
      30
      0
      0
      0
      0
      1787
      0
      0
      0
      19
      0
      79
      1
      -1
      0
      0
      0
    
    
      1
      33
      1
      0
      1
      0
      4789
      1
      1
      0
      11
      1
      220
      1
      339
      4
      1
      0
    
    
      2
      35
      2
      1
      2
      0
      1350
      1
      0
      0
      16
      2
      185
      1
      330
      1
      1
      0
    
    
      3
      30
      2
      0
      2
      0
      1476
      1
      1
      1
      3
      3
      199
      4
      -1
      0
      0
      0
    
    
      4
      59
      3
      0
      1
      0
      0
      1
      0
      1
      5
      1
      226
      1
      -1
      0
      0
      0
    
    
      5
      35
      2
      1
      2
      0
      747
      0
      0
      0
      23
      4
      141
      2
      176
      3
      1
      0
    
    
      6
      36
      4
      0
      2
      0
      307
      1
      0
      0
      14
      1
      341
      1
      330
      2
      2
      0
    
    
      7
      39
      5
      0
      1
      0
      147
      1
      0
      0
      6
      1
      151
      2
      -1
      0
      0
      0
    
    
      8
      41
      6
      0
      2
      0
      221
      1
      0
      1
      14
      1
      57
      2
      -1
      0
      0
      0
    
    
      9
      43
      1
      0
      0
      0
      -88
      1
      1
      0
      17
      2
      313
      1
      147
      2
      1
      0
    
    
      10
      39
      1
      0
      1
      0
      9374
      1
      0
      1
      20
      1
      273
      1
      -1
      0
      0
      0
    
    
      11
      43
      7
      0
      1
      0
      264
      1
      0
      0
      17
      2
      113
      2
      -1
      0
      0
      0
    
    
      12
      36
      5
      0
      2
      0
      1109
      0
      0
      0
      13
      5
      328
      2
      -1
      0
      0
      0
    
    
      13
      20
      8
      1
      1
      0
      502
      0
      0
      0
      30
      2
      261
      1
      -1
      0
      0
      1
    
    
      14
      31
      3
      0
      1
      0
      360
      1
      1
      0
      29
      6
      89
      1
      241
      1
      1
      0
    
    
      15
      40
      2
      0
      2
      0
      194
      0
      1
      0
      29
      5
      189
      2
      -1
      0
      0
      0
    
    
      16
      56
      5
      0
      1
      0
      4073
      0
      0
      0
      27
      5
      239
      5
      -1
      0
      0
      0
    
    
      17
      37
      7
      1
      2
      0
      2317
      1
      0
      0
      20
      2
      114
      1
      152
      2
      1
      0
    
    
      18
      25
      3
      1
      0
      0
      -221
      1
      0
      1
      23
      1
      250
      1
      -1
      0
      0
      0
    
    
      19
      31
      1
      0
      1
      0
      132
      0
      0
      0
      7
      7
      148
      1
      152
      1
      2
      0
    
    
      20
      38
      2
      2
      3
      0
      0
      1
      0
      0
      18
      8
      96
      2
      -1
      0
      0
      0
    
    
      21
      42
      2
      2
      2
      0
      16
      0
      0
      0
      19
      8
      140
      3
      -1
      0
      0
      0
    
    
      22
      44
      1
      1
      1
      0
      106
      0
      0
      1
      12
      3
      109
      2
      -1
      0
      0
      0
    
    
      23
      44
      6
      0
      1
      0
      93
      0
      0
      0
      7
      7
      125
      2
      -1
      0
      0
      0
    
    
      24
      26
      9
      0
      2
      0
      543
      0
      0
      0
      30
      6
      169
      3
      -1
      0
      0
      0
    
    
      25
      41
      2
      0
      2
      0
      5883
      0
      0
      0
      20
      8
      182
      2
      -1
      0
      0
      0
    
    
      26
      55
      3
      0
      0
      0
      627
      1
      0
      1
      5
      1
      247
      1
      -1
      0
      0
      0
    
    
      27
      67
      10
      0
      3
      0
      696
      0
      0
      2
      17
      5
      119
      1
      105
      2
      1
      0
    
    
      28
      56
      4
      0
      1
      0
      784
      0
      1
      0
      30
      7
      149
      2
      -1
      0
      0
      0
    
    
      29
      53
      7
      0
      1
      0
      105
      0
      1
      0
      21
      5
      74
      2
      -1
      0
      0
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4491
      35
      3
      1
      1
      0
      0
      1
      0
      0
      16
      2
      169
      1
      -1
      0
      0
      0
    
    
      4492
      32
      5
      1
      1
      0
      309
      1
      1
      0
      16
      2
      346
      1
      234
      3
      1
      0
    
    
      4493
      28
      5
      1
      2
      0
      0
      1
      0
      1
      4
      3
      205
      6
      -1
      0
      0
      0
    
    
      4494
      26
      5
      1
      1
      0
      668
      1
      0
      1
      28
      1
      576
      3
      -1
      0
      0
      1
    
    
      4495
      48
      2
      0
      2
      0
      1175
      1
      0
      2
      18
      8
      1476
      3
      -1
      0
      0
      0
    
    
      4496
      30
      3
      1
      1
      0
      363
      0
      0
      0
      28
      7
      171
      3
      -1
      0
      0
      0
    
    
      4497
      31
      6
      1
      2
      0
      38
      0
      0
      0
      20
      8
      185
      2
      -1
      0
      0
      0
    
    
      4498
      31
      2
      0
      2
      0
      1183
      1
      0
      1
      27
      1
      676
      6
      -1
      0
      0
      0
    
    
      4499
      45
      3
      2
      0
      0
      942
      0
      0
      0
      21
      8
      362
      1
      -1
      0
      0
      0
    
    
      4500
      38
      7
      0
      1
      0
      4196
      1
      0
      0
      12
      1
      193
      2
      -1
      0
      0
      0
    
    
      4501
      34
      2
      0
      2
      0
      297
      1
      0
      0
      26
      5
      63
      4
      -1
      0
      0
      0
    
    
      4502
      42
      1
      0
      1
      0
      -91
      1
      1
      0
      5
      4
      43
      1
      -1
      0
      0
      0
    
    
      4503
      60
      4
      0
      0
      0
      362
      0
      1
      0
      29
      7
      816
      6
      -1
      0
      0
      1
    
    
      4504
      42
      3
      1
      1
      0
      1080
      1
      1
      0
      13
      1
      951
      3
      370
      4
      1
      1
    
    
      4505
      32
      7
      1
      1
      0
      620
      1
      0
      1
      26
      1
      1234
      3
      -1
      0
      0
      1
    
    
      4506
      42
      0
      2
      2
      0
      -166
      0
      0
      0
      29
      5
      85
      4
      -1
      0
      0
      0
    
    
      4507
      33
      1
      0
      1
      0
      288
      1
      0
      0
      17
      2
      306
      1
      -1
      0
      0
      0
    
    
      4508
      42
      7
      0
      3
      0
      642
      1
      1
      1
      16
      1
      509
      2
      -1
      0
      0
      0
    
    
      4509
      51
      5
      0
      2
      0
      2506
      0
      0
      0
      30
      8
      210
      3
      -1
      0
      0
      0
    
    
      4510
      36
      5
      2
      1
      0
      566
      1
      0
      1
      20
      1
      129
      2
      -1
      0
      0
      0
    
    
      4511
      46
      3
      0
      1
      0
      668
      1
      0
      1
      15
      1
      1263
      2
      -1
      0
      0
      1
    
    
      4512
      40
      3
      0
      1
      0
      1100
      1
      0
      1
      29
      1
      660
      2
      -1
      0
      0
      0
    
    
      4513
      49
      3
      0
      1
      0
      322
      0
      0
      0
      14
      5
      356
      2
      -1
      0
      0
      0
    
    
      4514
      38
      3
      0
      1
      0
      1205
      1
      0
      0
      20
      2
      45
      4
      153
      1
      1
      0
    
    
      4515
      32
      1
      1
      1
      0
      473
      1
      0
      0
      7
      7
      624
      5
      -1
      0
      0
      0
    
    
      4516
      33
      1
      0
      1
      0
      -333
      1
      0
      0
      30
      7
      329
      5
      -1
      0
      0
      0
    
    
      4517
      57
      4
      0
      2
      1
      -3313
      1
      1
      1
      9
      1
      153
      1
      -1
      0
      0
      0
    
    
      4518
      57
      5
      0
      1
      0
      295
      0
      0
      0
      19
      5
      151
      11
      -1
      0
      0
      0
    
    
      4519
      28
      3
      0
      1
      0
      1137
      0
      0
      0
      6
      4
      129
      4
      211
      3
      2
      0
    
    
      4520
      44
      6
      1
      2
      0
      1136
      1
      1
      0
      3
      2
      345
      2
      249
      7
      2
      0
    
  

4521 rows × 17 columns



In [17]:

    
level_substitution









    Out[17]:





{'contact': {'cellular': 0, 'telephone': 2, 'unknown': 1},
 'default': {'no': 0, 'yes': 1},
 'education': {'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3},
 'housing': {'no': 0, 'yes': 1},
 'job': {'admin.': 7,
  'blue-collar': 3,
  'entrepreneur': 6,
  'housemaid': 9,
  'management': 2,
  'retired': 10,
  'self-employed': 4,
  'services': 1,
  'student': 8,
  'technician': 5,
  'unemployed': 0,
  'unknown': 11},
 'loan': {'no': 0, 'yes': 1},
 'marital': {'divorced': 2, 'married': 0, 'single': 1},
 'month': {'apr': 2,
  'aug': 5,
  'dec': 11,
  'feb': 4,
  'jan': 6,
  'jul': 7,
  'jun': 3,
  'mar': 10,
  'may': 1,
  'nov': 8,
  'oct': 0,
  'sep': 9},
 'poutcome': {'failure': 1, 'other': 2, 'success': 3, 'unknown': 0}}



In [ ]:



In [ ]:

Step 3: Training

Now that we have our DataFrame prepared, we can create the feature matrix X and target vector y:

split data into training and test sets
fit the model



In [18]:

    
X = df_num[list(all_features)].as_matrix()
y = df_num.y.as_matrix()
X, y









    Out[18]:





(array([[1787,    0,    1, ...,    0,   79,    0],
        [4789,    0,    1, ...,    1,  220,    1],
        [1350,    0,    1, ...,    0,  185,    1],
        ..., 
        [ 295,    0,   11, ...,    0,  151,    0],
        [1137,    0,    4, ...,    0,  129,    2],
        [1136,    0,    2, ...,    1,  345,    2]]),
 array([0, 0, 0, ..., 0, 0, 0]))



In [19]:

    
### Scikit-learn provides us with a nice function to split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)



In [20]:

    
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5)



In [21]:

    
clf.fit(X_train, y_train)
score_train = clf.score(X_train, y_train)
score_test = clf.score(X_test, y_test)
print('Ratio of correctly classified samples for:\n\tTraining-set:\t%f\n\tTest-set:\t%f'%(score_train, score_test))









    



Ratio of correctly classified samples for:
	Training-set:	0.913717
	Test-set:	0.903261

score returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted. For binary classification it means percentage of correctly classified samples. The score should be close to 1. Though, one single number does not tell the whole story...

Step 4: Evaluate Model

predict $\hat y$ for your model on test set
calculate confusion matrix and derive measures
visualize if suitable

Let's see what we got. We can actually print the entire decision tree and trace for each sample ... though you may need to use the viz-wall for that.



In [33]:

    
import sklearn.tree
import pydot_ng as pdot
dot_data = sklearn.tree.export_graphviz(clf, out_file=None, feature_names = list(all_features), class_names=['no', 'yes'])
graph = pdot.graph_from_dot_data(dot_data)
#--- we can save the graph into a file ... preferrably vector graphics
#graph.write_svg('mydt.svg')
graph.write_pdf('/home/pmolnar/public_html/mydt.pdf')

#--- or display right here 
##from IPython.display import HTML
HTML(str(graph.create_svg().decode('utf-8')))









    Out[33]:

Now, we use out classifier and predict on the test set (In order to get the ŷ character type: 'y\hat' followed by the TAB-key.)



In [ ]:

    
ŷ = clf.predict(X_test)



In [ ]:

    
## a function that produces the confusion matrix: 1. parameter y=actual target, 2. parameter ŷ=predicted
def binary_confusion_matrix(y,ŷ):
    TP = ((y+ŷ)== 2).sum()
    TN = ((y+ŷ)== 0).sum()
    FP = ((y-ŷ)== -1).sum()
    FN = ((y-ŷ)== 1).sum()
    return pd.DataFrame( [[TP, FP], [FN, TN]], index=[['Prediction', 'Prediction'],['Yes', 'No']], columns=[['Actual', 'Actual'],['Yes', 'No']])

cm = binary_confusion_matrix(y_test, ŷ)
cm



In [ ]:

    
### Scikit-Learn can do that too ... so so nice though
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, ŷ)
cm



In [ ]:

    
### Here are some metrics 
from sklearn.metrics import classification_report
print(classification_report(y_test, ŷ))



In [ ]:



In [ ]:

    
### http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
import itertools
np.set_printoptions(precision=2)
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



In [ ]:

    
%matplotlib inline

fig = plt.figure()
plot_confusion_matrix(cm, classes=['No', 'Yes'], normalize=True, title='Normalized confusion matrix')
plt.show()

Step 5: Figure out how to improve and go back to Step 2 or 3

This is an experiemnt. What can we change to improve the performance of the model?

Include or exclude certain features
Scale or transform values of feature vectors
Identify outliers (noise) and remove them
Adjust parameters of the ML algorithm



In [ ]:

		0
job	admin.	478.0
	blue-collar	946.0
	entrepreneur	168.0
	housemaid	112.0
	management	969.0
	retired	230.0
	self-employed	183.0
	services	417.0
	student	84.0
	technician	768.0
	unemployed	128.0
	unknown	38.0
education	primary	678.0
	secondary	2306.0
	tertiary	1350.0
	unknown	187.0
default	no	4445.0
default	yes	76.0
loan	no	3830.0
loan	yes	691.0
housing	no	1962.0
housing	yes	2559.0
contact	cellular	2896.0
	telephone	301.0
	unknown	1324.0
poutcome	failure	490.0
	other	197.0
	success	129.0
	unknown	3705.0
marital	divorced	528.0
	married	2797.0
	single	1196.0
month	apr	293.0
	aug	633.0
	dec	20.0
	feb	222.0
	jan	148.0
	jul	706.0
	jun	531.0
	mar	49.0
	may	1398.0
	nov	389.0
	oct	80.0
	sep	52.0

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	0	0	0	0	1787	0	0	0	19	0	79	1	-1	0	0	0
1	33	1	0	1	0	4789	1	1	0	11	1	220	1	339	4	1	0
2	35	2	1	2	0	1350	1	0	0	16	2	185	1	330	1	1	0
3	30	2	0	2	0	1476	1	1	1	3	3	199	4	-1	0	0	0
4	59	3	0	1	0	0	1	0	1	5	1	226	1	-1	0	0	0
5	35	2	1	2	0	747	0	0	0	23	4	141	2	176	3	1	0
6	36	4	0	2	0	307	1	0	0	14	1	341	1	330	2	2	0
7	39	5	0	1	0	147	1	0	0	6	1	151	2	-1	0	0	0
8	41	6	0	2	0	221	1	0	1	14	1	57	2	-1	0	0	0
9	43	1	0	0	0	-88	1	1	0	17	2	313	1	147	2	1	0
10	39	1	0	1	0	9374	1	0	1	20	1	273	1	-1	0	0	0
11	43	7	0	1	0	264	1	0	0	17	2	113	2	-1	0	0	0
12	36	5	0	2	0	1109	0	0	0	13	5	328	2	-1	0	0	0
13	20	8	1	1	0	502	0	0	0	30	2	261	1	-1	0	0	1
14	31	3	0	1	0	360	1	1	0	29	6	89	1	241	1	1	0
15	40	2	0	2	0	194	0	1	0	29	5	189	2	-1	0	0	0
16	56	5	0	1	0	4073	0	0	0	27	5	239	5	-1	0	0	0
17	37	7	1	2	0	2317	1	0	0	20	2	114	1	152	2	1	0
18	25	3	1	0	0	-221	1	0	1	23	1	250	1	-1	0	0	0
19	31	1	0	1	0	132	0	0	0	7	7	148	1	152	1	2	0
20	38	2	2	3	0	0	1	0	0	18	8	96	2	-1	0	0	0
21	42	2	2	2	0	16	0	0	0	19	8	140	3	-1	0	0	0
22	44	1	1	1	0	106	0	0	1	12	3	109	2	-1	0	0	0
23	44	6	0	1	0	93	0	0	0	7	7	125	2	-1	0	0	0
24	26	9	0	2	0	543	0	0	0	30	6	169	3	-1	0	0	0
25	41	2	0	2	0	5883	0	0	0	20	8	182	2	-1	0	0	0
26	55	3	0	0	0	627	1	0	1	5	1	247	1	-1	0	0	0
27	67	10	0	3	0	696	0	0	2	17	5	119	1	105	2	1	0
28	56	4	0	1	0	784	0	1	0	30	7	149	2	-1	0	0	0
29	53	7	0	1	0	105	0	1	0	21	5	74	2	-1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4491	35	3	1	1	0	0	1	0	0	16	2	169	1	-1	0	0	0
4492	32	5	1	1	0	309	1	1	0	16	2	346	1	234	3	1	0
4493	28	5	1	2	0	0	1	0	1	4	3	205	6	-1	0	0	0
4494	26	5	1	1	0	668	1	0	1	28	1	576	3	-1	0	0	1
4495	48	2	0	2	0	1175	1	0	2	18	8	1476	3	-1	0	0	0
4496	30	3	1	1	0	363	0	0	0	28	7	171	3	-1	0	0	0
4497	31	6	1	2	0	38	0	0	0	20	8	185	2	-1	0	0	0
4498	31	2	0	2	0	1183	1	0	1	27	1	676	6	-1	0	0	0
4499	45	3	2	0	0	942	0	0	0	21	8	362	1	-1	0	0	0
4500	38	7	0	1	0	4196	1	0	0	12	1	193	2	-1	0	0	0
4501	34	2	0	2	0	297	1	0	0	26	5	63	4	-1	0	0	0
4502	42	1	0	1	0	-91	1	1	0	5	4	43	1	-1	0	0	0
4503	60	4	0	0	0	362	0	1	0	29	7	816	6	-1	0	0	1
4504	42	3	1	1	0	1080	1	1	0	13	1	951	3	370	4	1	1
4505	32	7	1	1	0	620	1	0	1	26	1	1234	3	-1	0	0	1
4506	42	0	2	2	0	-166	0	0	0	29	5	85	4	-1	0	0	0
4507	33	1	0	1	0	288	1	0	0	17	2	306	1	-1	0	0	0
4508	42	7	0	3	0	642	1	1	1	16	1	509	2	-1	0	0	0
4509	51	5	0	2	0	2506	0	0	0	30	8	210	3	-1	0	0	0
4510	36	5	2	1	0	566	1	0	1	20	1	129	2	-1	0	0	0
4511	46	3	0	1	0	668	1	0	1	15	1	1263	2	-1	0	0	1
4512	40	3	0	1	0	1100	1	0	1	29	1	660	2	-1	0	0	0
4513	49	3	0	1	0	322	0	0	0	14	5	356	2	-1	0	0	0
4514	38	3	0	1	0	1205	1	0	0	20	2	45	4	153	1	1	0
4515	32	1	1	1	0	473	1	0	0	7	7	624	5	-1	0	0	0
4516	33	1	0	1	0	-333	1	0	0	30	7	329	5	-1	0	0	0
4517	57	4	0	2	1	-3313	1	1	1	9	1	153	1	-1	0	0	0
4518	57	5	0	1	0	295	0	0	0	19	5	151	11	-1	0	0	0
4519	28	3	0	1	0	1137	0	0	0	6	4	129	4	211	3	2	0
4520	44	6	1	2	0	1136	1	1	0	3	2	345	2	249	7	2	0