In [2]:
# %load nbinit.py
from IPython.display import HTML
HTML("""
<style>
.container { width: 100% !important; padding-left: 1em; padding-right: 2em; }
div.output_stderr { background: #FFA; }
</style>
""")


Out[2]:
Please, rename this file to HW6.ipynb and save it in MSA8010F16/HW6

Homework 6: Preprocessing Data

We use a data set from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Bank+Marketing to experiment with a Decision Tree classifier http://www.saedsayad.com/decision_tree.htm

Scikit-Learn: http://scikit-learn.org/stable/modules/tree.html#tree

Book slides:

Bank Marketing Data Set

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Data Set Information:

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Attribute Information:

Input variables:

  • bank client data:
      1 age (numeric)
      2 job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
      3 marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
      4 education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
      5 default: has credit in default? (categorical: 'no','yes','unknown')
      6 housing: has housing loan? (categorical: 'no','yes','unknown')
      7 loan: has personal loan? (categorical: 'no','yes','unknown')
  • related with the last contact of the current campaign:
      8 contact: contact communication type (categorical: 'cellular','telephone') 
      9 month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
      10 day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
      11 duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  • other attributes:
      12 campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
      13 pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
      14 previous: number of contacts performed before this campaign and for this client (numeric)
      15 poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
  • social and economic context attributes
      16 emp.var.rate: employment variation rate - quarterly indicator (numeric)
      17 cons.price.idx: consumer price index - monthly indicator (numeric) 
      18 cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
      19 euribor3m: euribor 3 month rate - daily indicator (numeric)
      20 nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target): 21 y - has the client subscribed a term deposit? (binary: 'yes','no')


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
DATAFILE = '/home/data/archive.ics.uci.edu/BankMarketing/bank.csv'
###DATAFILE = 'data/bank.csv'  ### using locally

In [5]:
df = pd.read_csv(DATAFILE, sep=';')
list(df.columns)


Out[5]:
['age',
 'job',
 'marital',
 'education',
 'default',
 'balance',
 'housing',
 'loan',
 'contact',
 'day',
 'month',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'y']

Step 1: Investigate Data Set

  • We have a number of categorical data: What's their cardinality? How are the levels distributed?
  • What's the distribution on numeric values? Do we see any correlations?

Let's first look at columns (i.e. variables) with continuous values. We can get a sense of the distribution from aggregate functions like mean, standard variation, quantiles, as well as, minimum and maximum values.

The Pandas method describe creates a table view of those metrics. (The methods can also be used to identify numeric features in the data frame.


In [6]:
### use sets and '-' difference operation 'A-B'. Also there is a symmetric different '^'
all_features = set(df.columns)-set(['y'])
num_features = set(df.describe().columns)
cat_features = all_features-num_features

print("All features:         ", ", ".join(all_features), "\nNumerical features:   ", ", ".join(num_features), "\nCategorical features: ", ", ".join(cat_features))


All features:          balance, default, campaign, age, housing, contact, marital, month, previous, job, education, day, pdays, loan, duration, poutcome 
Numerical features:    balance, day, pdays, campaign, age, duration, previous 
Categorical features:  job, education, default, loan, housing, contact, poutcome, marital, month

In [26]:
set(df.columns)-set(df.describe().columns)-set('y')


Out[26]:
{'contact',
 'default',
 'education',
 'housing',
 'job',
 'loan',
 'marital',
 'month',
 'poutcome',
 'y'}

In [7]:
### Describe Columns
help(pd.DataFrame.describe)


Help on function describe in module pandas.core.generic:

describe(self, percentiles=None, include=None, exclude=None)
    Generate various summary statistics, excluding NaN values.
    
    Parameters
    ----------
    percentiles : array-like, optional
        The percentiles to include in the output. Should all
        be in the interval [0, 1]. By default `percentiles` is
        [.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
    include, exclude : list-like, 'all', or None (default)
        Specify the form of the returned result. Either:
    
        - None to both (default). The result will include only
          numeric-typed columns or, if none are, only categorical columns.
        - A list of dtypes or strings to be included/excluded.
          To select all numeric types use numpy numpy.number. To select
          categorical objects use type object. See also the select_dtypes
          documentation. eg. df.describe(include=['O'])
        - If include is the string 'all', the output column-set will
          match the input one.
    
    Returns
    -------
    summary: NDFrame of summary statistics
    
    Notes
    -----
    The output DataFrame index depends on the requested dtypes:
    
    For numeric dtypes, it will include: count, mean, std, min,
    max, and lower, 50, and upper percentiles.
    
    For object dtypes (e.g. timestamps or strings), the index
    will include the count, unique, most common, and frequency of the
    most common. Timestamps also include the first and last items.
    
    For mixed dtypes, the index will be the union of the corresponding
    output types. Non-applicable entries will be filled with NaN.
    Note that mixed-dtype outputs can only be returned from mixed-dtype
    inputs and appropriate use of the include/exclude arguments.
    
    If multiple values have the highest count, then the
    `count` and `most common` pair will be arbitrarily chosen from
    among those with the highest count.
    
    The include, exclude arguments are ignored for Series.
    
    See Also
    --------
    DataFrame.select_dtypes


In [8]:
### Let's get the description of the numeric data for each of the target values separately.
### We need to rename the columns before we can properly join the tables. The column names may look strange...
desc_yes = df[df.y=='yes'].describe().rename_axis(lambda c: "%s|A"%c, axis='columns')
desc_no  = df[df.y=='no'].describe().rename_axis(lambda c: "%s|B"%c, axis='columns')

In [ ]:
### ...but this way we can get them in the desired order...
desc = desc_yes.join(desc_no).reindex_axis(sorted(desc_yes.columns), axis=1)
### ...because we're changing them anyway:

In [12]:
#desc.set_axis(1, [sorted(list(num_features)*2), ['yes', 'no']*len(num_features)])
#desc


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-e6a6450c77ef> in <module>()
----> 1 desc.set_axis(1, [sorted(list(num_features)*2), ['yes', 'no']*len(num_features)])
      2 #desc

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in set_axis(self, axis, labels)
    423     def set_axis(self, axis, labels):
    424         """ public verson of axis assignment """
--> 425         setattr(self, self._get_axis_name(axis), labels)
    426 
    427     def _set_axis(self, axis, labels):

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   2683         try:
   2684             object.__getattribute__(self, name)
-> 2685             return object.__setattr__(self, name, value)
   2686         except AttributeError:
   2687             pass

pandas/src/properties.pyx in pandas.lib.AxisProperty.__set__ (pandas/lib.c:44748)()

/usr/lib64/python3.4/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    426 
    427     def _set_axis(self, axis, labels):
--> 428         self._data.set_axis(axis, labels)
    429         self._clear_item_cache()
    430 

/usr/lib64/python3.4/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
   2633             raise ValueError('Length mismatch: Expected axis has %d elements, '
   2634                              'new values have %d elements' %
-> 2635                              (old_len, new_len))
   2636 
   2637         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 7 elements, new values have 14 elements

Let's look at the distribution of numerical features...


In [13]:
%matplotlib inline
fig = plt.figure(figsize=(32, 8))
for i in range(len(num_features)):
    f = list(num_features)[i]
    plt.subplot(2, 4, i+1)
    hst = plt.hist(df[f], alpha=0.5)
    plt.title(f)
plt.suptitle('Distribution of Numeric Values', fontsize=20)
None


Now, let's look at the categorical variables and their distribution...


In [14]:
for f in cat_features:
    tab = df[f].value_counts()
    print('%s:\t%s' % (f, ', '.join([ ("%s(%d)" %(tab.index[i], tab.values[i])) for i in range(len(tab))]) ))


job:	management(969), blue-collar(946), technician(768), admin.(478), services(417), retired(230), self-employed(183), entrepreneur(168), unemployed(128), housemaid(112), student(84), unknown(38)
education:	secondary(2306), tertiary(1350), primary(678), unknown(187)
default:	no(4445), yes(76)
loan:	no(3830), yes(691)
housing:	yes(2559), no(1962)
contact:	cellular(2896), unknown(1324), telephone(301)
poutcome:	unknown(3705), failure(490), other(197), success(129)
marital:	married(2797), single(1196), divorced(528)
month:	may(1398), jul(706), aug(633), jun(531), nov(389), apr(293), feb(222), jan(148), oct(80), sep(52), mar(49), dec(20)

Results in a data frame:


In [15]:
mat = pd.DataFrame(
    [ df[f].value_counts() for f in list(cat_features) ],
    index=list(cat_features)
    ).stack()

pd.DataFrame(mat.values, index=mat.index)


Out[15]:
0
job admin. 478.0
blue-collar 946.0
entrepreneur 168.0
housemaid 112.0
management 969.0
retired 230.0
self-employed 183.0
services 417.0
student 84.0
technician 768.0
unemployed 128.0
unknown 38.0
education primary 678.0
secondary 2306.0
tertiary 1350.0
unknown 187.0
default no 4445.0
yes 76.0
loan no 3830.0
yes 691.0
housing no 1962.0
yes 2559.0
contact cellular 2896.0
telephone 301.0
unknown 1324.0
poutcome failure 490.0
other 197.0
success 129.0
unknown 3705.0
marital divorced 528.0
married 2797.0
single 1196.0
month apr 293.0
aug 633.0
dec 20.0
feb 222.0
jan 148.0
jul 706.0
jun 531.0
mar 49.0
may 1398.0
nov 389.0
oct 80.0
sep 52.0

In [ ]:

Step 2: Prepare for ML algorithm

The ML algorithms in Scikit-Learn use Matrices (with numeric values). We need to convert our data-frame into a feature matrix X and a target vector y. Many algorithms also require the features to be in the same range. Decision-trees don't bother because they don't perform any operations across features.

Use the pd.DataFrame.as_matrix method to convert a DataFrame into a matrix.


In [ ]:
help(pd.DataFrame.as_matrix)

In [16]:
## We copy our original dataframe into a new one, and then perform replacements on categorical levels.
## We may also keep track of our replacement
level_substitution = {}

def levels2index(levels):
    dct = {}
    for i in range(len(levels)):
        dct[levels[i]] = i
    return dct

df_num = df.copy()

for c in cat_features:
    level_substitution[c] = levels2index(df[c].unique())
    df_num[c].replace(level_substitution[c], inplace=True)

## same for target
df_num.y.replace({'no':0, 'yes':1}, inplace=True)

df_num


Out[16]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 0 0 0 0 1787 0 0 0 19 0 79 1 -1 0 0 0
1 33 1 0 1 0 4789 1 1 0 11 1 220 1 339 4 1 0
2 35 2 1 2 0 1350 1 0 0 16 2 185 1 330 1 1 0
3 30 2 0 2 0 1476 1 1 1 3 3 199 4 -1 0 0 0
4 59 3 0 1 0 0 1 0 1 5 1 226 1 -1 0 0 0
5 35 2 1 2 0 747 0 0 0 23 4 141 2 176 3 1 0
6 36 4 0 2 0 307 1 0 0 14 1 341 1 330 2 2 0
7 39 5 0 1 0 147 1 0 0 6 1 151 2 -1 0 0 0
8 41 6 0 2 0 221 1 0 1 14 1 57 2 -1 0 0 0
9 43 1 0 0 0 -88 1 1 0 17 2 313 1 147 2 1 0
10 39 1 0 1 0 9374 1 0 1 20 1 273 1 -1 0 0 0
11 43 7 0 1 0 264 1 0 0 17 2 113 2 -1 0 0 0
12 36 5 0 2 0 1109 0 0 0 13 5 328 2 -1 0 0 0
13 20 8 1 1 0 502 0 0 0 30 2 261 1 -1 0 0 1
14 31 3 0 1 0 360 1 1 0 29 6 89 1 241 1 1 0
15 40 2 0 2 0 194 0 1 0 29 5 189 2 -1 0 0 0
16 56 5 0 1 0 4073 0 0 0 27 5 239 5 -1 0 0 0
17 37 7 1 2 0 2317 1 0 0 20 2 114 1 152 2 1 0
18 25 3 1 0 0 -221 1 0 1 23 1 250 1 -1 0 0 0
19 31 1 0 1 0 132 0 0 0 7 7 148 1 152 1 2 0
20 38 2 2 3 0 0 1 0 0 18 8 96 2 -1 0 0 0
21 42 2 2 2 0 16 0 0 0 19 8 140 3 -1 0 0 0
22 44 1 1 1 0 106 0 0 1 12 3 109 2 -1 0 0 0
23 44 6 0 1 0 93 0 0 0 7 7 125 2 -1 0 0 0
24 26 9 0 2 0 543 0 0 0 30 6 169 3 -1 0 0 0
25 41 2 0 2 0 5883 0 0 0 20 8 182 2 -1 0 0 0
26 55 3 0 0 0 627 1 0 1 5 1 247 1 -1 0 0 0
27 67 10 0 3 0 696 0 0 2 17 5 119 1 105 2 1 0
28 56 4 0 1 0 784 0 1 0 30 7 149 2 -1 0 0 0
29 53 7 0 1 0 105 0 1 0 21 5 74 2 -1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4491 35 3 1 1 0 0 1 0 0 16 2 169 1 -1 0 0 0
4492 32 5 1 1 0 309 1 1 0 16 2 346 1 234 3 1 0
4493 28 5 1 2 0 0 1 0 1 4 3 205 6 -1 0 0 0
4494 26 5 1 1 0 668 1 0 1 28 1 576 3 -1 0 0 1
4495 48 2 0 2 0 1175 1 0 2 18 8 1476 3 -1 0 0 0
4496 30 3 1 1 0 363 0 0 0 28 7 171 3 -1 0 0 0
4497 31 6 1 2 0 38 0 0 0 20 8 185 2 -1 0 0 0
4498 31 2 0 2 0 1183 1 0 1 27 1 676 6 -1 0 0 0
4499 45 3 2 0 0 942 0 0 0 21 8 362 1 -1 0 0 0
4500 38 7 0 1 0 4196 1 0 0 12 1 193 2 -1 0 0 0
4501 34 2 0 2 0 297 1 0 0 26 5 63 4 -1 0 0 0
4502 42 1 0 1 0 -91 1 1 0 5 4 43 1 -1 0 0 0
4503 60 4 0 0 0 362 0 1 0 29 7 816 6 -1 0 0 1
4504 42 3 1 1 0 1080 1 1 0 13 1 951 3 370 4 1 1
4505 32 7 1 1 0 620 1 0 1 26 1 1234 3 -1 0 0 1
4506 42 0 2 2 0 -166 0 0 0 29 5 85 4 -1 0 0 0
4507 33 1 0 1 0 288 1 0 0 17 2 306 1 -1 0 0 0
4508 42 7 0 3 0 642 1 1 1 16 1 509 2 -1 0 0 0
4509 51 5 0 2 0 2506 0 0 0 30 8 210 3 -1 0 0 0
4510 36 5 2 1 0 566 1 0 1 20 1 129 2 -1 0 0 0
4511 46 3 0 1 0 668 1 0 1 15 1 1263 2 -1 0 0 1
4512 40 3 0 1 0 1100 1 0 1 29 1 660 2 -1 0 0 0
4513 49 3 0 1 0 322 0 0 0 14 5 356 2 -1 0 0 0
4514 38 3 0 1 0 1205 1 0 0 20 2 45 4 153 1 1 0
4515 32 1 1 1 0 473 1 0 0 7 7 624 5 -1 0 0 0
4516 33 1 0 1 0 -333 1 0 0 30 7 329 5 -1 0 0 0
4517 57 4 0 2 1 -3313 1 1 1 9 1 153 1 -1 0 0 0
4518 57 5 0 1 0 295 0 0 0 19 5 151 11 -1 0 0 0
4519 28 3 0 1 0 1137 0 0 0 6 4 129 4 211 3 2 0
4520 44 6 1 2 0 1136 1 1 0 3 2 345 2 249 7 2 0

4521 rows × 17 columns


In [17]:
level_substitution


Out[17]:
{'contact': {'cellular': 0, 'telephone': 2, 'unknown': 1},
 'default': {'no': 0, 'yes': 1},
 'education': {'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3},
 'housing': {'no': 0, 'yes': 1},
 'job': {'admin.': 7,
  'blue-collar': 3,
  'entrepreneur': 6,
  'housemaid': 9,
  'management': 2,
  'retired': 10,
  'self-employed': 4,
  'services': 1,
  'student': 8,
  'technician': 5,
  'unemployed': 0,
  'unknown': 11},
 'loan': {'no': 0, 'yes': 1},
 'marital': {'divorced': 2, 'married': 0, 'single': 1},
 'month': {'apr': 2,
  'aug': 5,
  'dec': 11,
  'feb': 4,
  'jan': 6,
  'jul': 7,
  'jun': 3,
  'mar': 10,
  'may': 1,
  'nov': 8,
  'oct': 0,
  'sep': 9},
 'poutcome': {'failure': 1, 'other': 2, 'success': 3, 'unknown': 0}}

In [ ]:


In [ ]:

Step 3: Training

Now that we have our DataFrame prepared, we can create the feature matrix X and target vector y:

  1. split data into training and test sets
  2. fit the model

In [18]:
X = df_num[list(all_features)].as_matrix()
y = df_num.y.as_matrix()
X, y


Out[18]:
(array([[1787,    0,    1, ...,    0,   79,    0],
        [4789,    0,    1, ...,    1,  220,    1],
        [1350,    0,    1, ...,    0,  185,    1],
        ..., 
        [ 295,    0,   11, ...,    0,  151,    0],
        [1137,    0,    4, ...,    0,  129,    2],
        [1136,    0,    2, ...,    1,  345,    2]]),
 array([0, 0, 0, ..., 0, 0, 0]))

In [19]:
### Scikit-learn provides us with a nice function to split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4, random_state=42)

In [20]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5)

In [21]:
clf.fit(X_train, y_train)
score_train = clf.score(X_train, y_train)
score_test = clf.score(X_test, y_test)
print('Ratio of correctly classified samples for:\n\tTraining-set:\t%f\n\tTest-set:\t%f'%(score_train, score_test))


Ratio of correctly classified samples for:
	Training-set:	0.913717
	Test-set:	0.903261

score returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted. For binary classification it means percentage of correctly classified samples. The score should be close to 1. Though, one single number does not tell the whole story...

Step 4: Evaluate Model

  1. predict $\hat y$ for your model on test set
  2. calculate confusion matrix and derive measures
  3. visualize if suitable

Let's see what we got. We can actually print the entire decision tree and trace for each sample ... though you may need to use the viz-wall for that.


In [33]:
import sklearn.tree
import pydot_ng as pdot
dot_data = sklearn.tree.export_graphviz(clf, out_file=None, feature_names = list(all_features), class_names=['no', 'yes'])
graph = pdot.graph_from_dot_data(dot_data)
#--- we can save the graph into a file ... preferrably vector graphics
#graph.write_svg('mydt.svg')
graph.write_pdf('/home/pmolnar/public_html/mydt.pdf')

#--- or display right here 
##from IPython.display import HTML
HTML(str(graph.create_svg().decode('utf-8')))


Out[33]:
Tree 0 duration <= 561.0 gini = 0.2149 samples = 2712 value = [2380, 332] class = no 1 poutcome <= 2.5 gini = 0.1485 samples = 2414 value = [2219, 195] class = no 0->1 True 28 duration <= 764.5 gini = 0.4968 samples = 298 value = [161, 137] class = no 0->28 False 2 month <= 0.5 gini = 0.119 samples = 2345 value = [2196, 149] class = no 1->2 17 duration <= 156.5 gini = 0.4444 samples = 69 value = [23, 46] class = yes 1->17 3 duration <= 242.0 gini = 0.4997 samples = 39 value = [20, 19] class = no 2->3 10 duration <= 222.5 gini = 0.1064 samples = 2306 value = [2176, 130] class = no 2->10 4 day <= 19.5 gini = 0.417 samples = 27 value = [19, 8] class = no 3->4 7 day <= 6.5 gini = 0.1528 samples = 12 value = [1, 11] class = yes 3->7 5 gini = 0.1244 samples = 15 value = [14, 1] class = no 4->5 6 gini = 0.4861 samples = 12 value = [5, 7] class = yes 4->6 8 gini = 0.0 samples = 1 value = [1, 0] class = no 7->8 9 gini = 0.0 samples = 11 value = [0, 11] class = yes 7->9 11 month <= 9.5 gini = 0.0492 samples = 1545 value = [1506, 39] class = no 10->11 14 pdays <= 391.0 gini = 0.2106 samples = 761 value = [670, 91] class = no 10->14 12 gini = 0.0424 samples = 1523 value = [1490, 33] class = no 11->12 13 gini = 0.3967 samples = 22 value = [16, 6] class = no 11->13 15 gini = 0.198 samples = 754 value = [670, 84] class = no 14->15 16 gini = 0.0 samples = 7 value = [0, 7] class = yes 14->16 18 pdays <= 92.5 gini = 0.1975 samples = 9 value = [8, 1] class = no 17->18 21 month <= 1.5 gini = 0.375 samples = 60 value = [15, 45] class = yes 17->21 19 gini = 0.0 samples = 1 value = [0, 1] class = yes 18->19 20 gini = 0.0 samples = 8 value = [8, 0] class = no 18->20 22 balance <= 289.5 gini = 0.4978 samples = 15 value = [7, 8] class = yes 21->22 25 job <= 6.0 gini = 0.2923 samples = 45 value = [8, 37] class = yes 21->25 23 gini = 0.0 samples = 4 value = [0, 4] class = yes 22->23 24 gini = 0.4628 samples = 11 value = [7, 4] class = no 22->24 26 gini = 0.1284 samples = 29 value = [2, 27] class = yes 25->26 27 gini = 0.4688 samples = 16 value = [6, 10] class = yes 25->27 29 contact <= 0.5 gini = 0.4522 samples = 165 value = [108, 57] class = no 28->29 44 campaign <= 3.5 gini = 0.4794 samples = 133 value = [53, 80] class = yes 28->44 30 poutcome <= 2.5 gini = 0.4931 samples = 102 value = [57, 45] class = no 29->30 37 age <= 51.5 gini = 0.3084 samples = 63 value = [51, 12] class = no 29->37 31 job <= 1.5 gini = 0.4817 samples = 94 value = [56, 38] class = no 30->31 34 education <= 2.5 gini = 0.2188 samples = 8 value = [1, 7] class = yes 30->34 32 gini = 0.18 samples = 10 value = [9, 1] class = no 31->32 33 gini = 0.4929 samples = 84 value = [47, 37] class = no 31->33 35 gini = 0.0 samples = 7 value = [0, 7] class = yes 34->35 36 gini = 0.0 samples = 1 value = [1, 0] class = no 34->36 38 duration <= 563.5 gini = 0.2227 samples = 47 value = [41, 6] class = no 37->38 41 duration <= 694.0 gini = 0.4688 samples = 16 value = [10, 6] class = no 37->41 39 gini = 0.0 samples = 1 value = [0, 1] class = yes 38->39 40 gini = 0.1938 samples = 46 value = [41, 5] class = no 38->40 42 gini = 0.2975 samples = 11 value = [9, 2] class = no 41->42 43 gini = 0.32 samples = 5 value = [1, 4] class = yes 41->43 45 marital <= 1.5 gini = 0.4954 samples = 104 value = [47, 57] class = yes 44->45 52 campaign <= 18.5 gini = 0.3282 samples = 29 value = [6, 23] class = yes 44->52 46 education <= 2.5 gini = 0.4995 samples = 95 value = [46, 49] class = yes 45->46 49 duration <= 804.0 gini = 0.1975 samples = 9 value = [1, 8] class = yes 45->49 47 gini = 0.4969 samples = 89 value = [41, 48] class = yes 46->47 48 gini = 0.2778 samples = 6 value = [5, 1] class = no 46->48 50 gini = 0.0 samples = 1 value = [1, 0] class = no 49->50 51 gini = 0.0 samples = 8 value = [0, 8] class = yes 49->51 53 loan <= 0.5 gini = 0.2934 samples = 28 value = [5, 23] class = yes 52->53 56 gini = 0.0 samples = 1 value = [1, 0] class = no 52->56 54 gini = 0.1653 samples = 22 value = [2, 20] class = yes 53->54 55 gini = 0.5 samples = 6 value = [3, 3] class = no 53->55

Now, we use out classifier and predict on the test set (In order to get the character type: 'y\hat' followed by the TAB-key.)


In [ ]:
 = clf.predict(X_test)

In [ ]:
## a function that produces the confusion matrix: 1. parameter y=actual target, 2. parameter ŷ=predicted
def binary_confusion_matrix(y,):
    TP = ((y+)== 2).sum()
    TN = ((y+)== 0).sum()
    FP = ((y-)== -1).sum()
    FN = ((y-)== 1).sum()
    return pd.DataFrame( [[TP, FP], [FN, TN]], index=[['Prediction', 'Prediction'],['Yes', 'No']], columns=[['Actual', 'Actual'],['Yes', 'No']])

cm = binary_confusion_matrix(y_test, )
cm

In [ ]:
### Scikit-Learn can do that too ... so so nice though
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, )
cm

In [ ]:
### Here are some metrics 
from sklearn.metrics import classification_report
print(classification_report(y_test, ))

In [ ]:


In [ ]:
### http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
import itertools
np.set_printoptions(precision=2)
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [ ]:
%matplotlib inline

fig = plt.figure()
plot_confusion_matrix(cm, classes=['No', 'Yes'], normalize=True, title='Normalized confusion matrix')
plt.show()

Step 5: Figure out how to improve and go back to Step 2 or 3

This is an experiemnt. What can we change to improve the performance of the model?

  • Include or exclude certain features
  • Scale or transform values of feature vectors
  • Identify outliers (noise) and remove them
  • Adjust parameters of the ML algorithm

In [ ]: