HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [2]:
# Download data:

! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip"
! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"


--2015-01-12 13:42:24--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
Resolving archive.ics.uci.edu... 128.195.1.95
Connecting to archive.ics.uci.edu|128.195.1.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 579043 (565K) [application/zip]
Saving to: 'bank.zip.1'

100%[======================================>] 579,043     3.38MB/s   in 0.2s   

2015-01-12 13:42:25 (3.38 MB/s) - 'bank.zip.1' saved [579043/579043]

--2015-01-12 13:42:25--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
Resolving archive.ics.uci.edu... 128.195.1.95
Connecting to archive.ics.uci.edu|128.195.1.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 444572 (434K) [application/zip]
Saving to: 'bank-additional.zip'

100%[======================================>] 444,572     --.-K/s   in 0.1s    

2015-01-12 13:42:25 (3.61 MB/s) - 'bank-additional.zip' saved [444572/444572]


In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

%matplotlib inline

In [2]:
ldata = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=';') # ldata means large data
sdata = pd.read_csv("./bank-additional/bank-additional.csv", sep=';') # sdata means... yup, small.

In [3]:
sdata.head()


Out[3]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 30 blue-collar married basic.9y no yes no cellular may fri ... 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
1 39 services single high.school no no no telephone may fri ... 4 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
2 25 services married high.school no yes no telephone jun wed ... 1 999 0 nonexistent 1.4 94.465 -41.8 4.962 5228.1 no
3 38 services married basic.9y no unknown unknown telephone jun fri ... 3 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1 no
4 47 admin. married university.degree no yes no cellular nov mon ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8 no

5 rows × 21 columns


In [4]:
sdata.describe()


Out[4]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
count 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000
mean 40.113620 256.788055 2.537266 960.422190 0.190337 0.084972 93.579704 -40.499102 3.621356 5166.481695
std 10.313362 254.703736 2.568159 191.922786 0.541788 1.563114 0.579349 4.594578 1.733591 73.667904
min 18.000000 0.000000 1.000000 0.000000 0.000000 -3.400000 92.201000 -50.800000 0.635000 4963.600000
25% 32.000000 103.000000 1.000000 999.000000 0.000000 -1.800000 93.075000 -42.700000 1.334000 5099.100000
50% 38.000000 181.000000 2.000000 999.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 47.000000 317.000000 3.000000 999.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 88.000000 3643.000000 35.000000 999.000000 6.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000

In [5]:
test = sdata.copy()
test.drop('y', axis=1)


Out[5]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 30 blue-collar married basic.9y no yes no cellular may fri 487 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1
1 39 services single high.school no no no telephone may fri 346 4 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0
2 25 services married high.school no yes no telephone jun wed 227 1 999 0 nonexistent 1.4 94.465 -41.8 4.962 5228.1
3 38 services married basic.9y no unknown unknown telephone jun fri 17 3 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1
4 47 admin. married university.degree no yes no cellular nov mon 58 1 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8
5 32 services single university.degree no no no cellular sep thu 128 3 999 2 failure -1.1 94.199 -37.5 0.884 4963.6
6 32 admin. single university.degree no yes no cellular sep mon 290 4 999 0 nonexistent -1.1 94.199 -37.5 0.879 4963.6
7 41 entrepreneur married university.degree unknown yes no cellular nov mon 44 2 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8
8 31 services divorced professional.course no no no cellular nov tue 68 1 999 1 failure -0.1 93.200 -42.0 4.153 5195.8
9 35 blue-collar married basic.9y unknown no no telephone may thu 170 1 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0
10 25 services single basic.6y unknown yes no cellular jul thu 301 1 999 0 nonexistent 1.4 93.918 -42.7 4.958 5228.1
11 36 self-employed single basic.4y no no no cellular jul thu 148 1 999 0 nonexistent 1.4 93.918 -42.7 4.968 5228.1
12 36 admin. married high.school no no no telephone may wed 97 2 999 0 nonexistent 1.1 93.994 -36.4 4.859 5191.0
13 47 blue-collar married basic.4y no yes no telephone jun thu 211 2 999 0 nonexistent 1.4 94.465 -41.8 4.958 5228.1
14 29 admin. single high.school no no no cellular may fri 553 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1
15 27 services single university.degree no no no cellular jul wed 698 2 999 0 nonexistent 1.4 93.918 -42.7 4.963 5228.1
16 44 admin. divorced university.degree no no no cellular jul wed 191 6 999 0 nonexistent 1.4 93.918 -42.7 4.957 5228.1
17 46 admin. divorced university.degree no yes no telephone jul mon 59 4 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1
18 45 entrepreneur married university.degree unknown yes yes cellular aug mon 38 2 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1
19 50 blue-collar married basic.4y no no yes cellular jul tue 849 1 999 0 nonexistent 1.4 93.918 -42.7 4.961 5228.1
20 55 services married basic.6y unknown yes no cellular jul tue 326 6 999 0 nonexistent 1.4 93.918 -42.7 4.962 5228.1
21 39 technician divorced high.school no no no cellular mar mon 222 1 12 2 success -1.8 93.369 -34.8 0.639 5008.7
22 29 technician single university.degree no yes yes cellular aug wed 626 3 999 0 nonexistent 1.4 93.444 -36.1 4.967 5228.1
23 40 management married high.school no no yes cellular aug wed 119 1 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1
24 44 technician married professional.course unknown yes no telephone may fri 388 7 999 0 nonexistent 1.1 93.994 -36.4 4.864 5191.0
25 38 technician married professional.course no yes no cellular aug mon 479 1 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1
26 36 technician divorced professional.course no no no telephone may wed 446 1 999 0 nonexistent 1.1 93.994 -36.4 4.856 5191.0
27 28 blue-collar married basic.6y unknown no no cellular may mon 68 2 999 1 failure -1.8 92.893 -46.2 1.299 5099.1
28 47 admin. single unknown unknown no no telephone may thu 127 1 999 0 nonexistent 1.1 93.994 -36.4 4.860 5191.0
29 34 admin. married university.degree no no no cellular aug tue 109 1 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4089 25 admin. single university.degree no yes yes cellular oct fri 115 1 999 1 failure -3.4 92.431 -26.9 0.739 5017.5
4090 43 blue-collar married basic.4y unknown yes yes telephone may tue 593 2 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0
4091 38 management married high.school unknown no no telephone may thu 879 2 999 0 nonexistent 1.1 93.994 -36.4 4.860 5191.0
4092 30 blue-collar single high.school no no no telephone jul wed 71 1 999 0 nonexistent 1.4 93.918 -42.7 4.956 5228.1
4093 56 retired married basic.4y unknown no no cellular jul tue 580 3 999 0 nonexistent 1.4 93.918 -42.7 4.961 5228.1
4094 62 blue-collar married basic.4y no yes no cellular nov mon 152 1 6 1 success -3.4 92.649 -30.1 0.719 5017.5
4095 36 admin. single university.degree no no yes cellular aug fri 69 2 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1
4096 33 services married high.school no no no telephone may mon 146 2 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0
4097 41 blue-collar divorced basic.9y no no no cellular aug tue 102 1 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1
4098 34 housemaid single university.degree no yes no cellular aug thu 159 3 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1
4099 58 admin. divorced high.school no no no cellular aug tue 290 1 999 0 nonexistent 1.4 93.444 -36.1 4.963 5228.1
4100 41 admin. divorced high.school no no no cellular apr fri 620 1 999 0 nonexistent -1.8 93.075 -47.1 1.405 5099.1
4101 35 entrepreneur single university.degree no yes no cellular jul mon 88 5 999 0 nonexistent 1.4 93.918 -42.7 4.960 5228.1
4102 31 blue-collar single basic.9y unknown no yes telephone jun fri 70 2 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1
4103 43 services married high.school no no no telephone may mon 77 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0
4104 42 technician divorced professional.course no yes no cellular aug mon 408 1 999 0 nonexistent 1.4 93.444 -36.1 4.970 5228.1
4105 47 housemaid married basic.4y unknown yes no telephone jul tue 159 2 999 0 nonexistent 1.4 93.918 -42.7 4.961 5228.1
4106 45 entrepreneur divorced basic.9y no yes no cellular may tue 29 3 999 0 nonexistent -1.8 92.893 -46.2 1.344 5099.1
4107 36 admin. married university.degree unknown yes no cellular aug wed 155 11 999 0 nonexistent 1.4 93.444 -36.1 4.964 5228.1
4108 32 admin. married university.degree no yes no telephone may thu 151 5 999 0 nonexistent -1.8 92.893 -46.2 1.266 5099.1
4109 63 retired married high.school no no no cellular oct wed 1386 1 999 0 nonexistent -3.4 92.431 -26.9 0.740 5017.5
4110 53 housemaid divorced basic.6y unknown unknown unknown telephone may fri 85 2 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0
4111 30 technician married university.degree no no yes cellular jun fri 131 1 999 1 failure -1.7 94.055 -39.8 0.748 4991.6
4112 31 technician single professional.course no yes no cellular nov thu 155 1 999 0 nonexistent -0.1 93.200 -42.0 4.076 5195.8
4113 31 admin. single university.degree no yes no cellular nov thu 463 1 999 0 nonexistent -0.1 93.200 -42.0 4.076 5195.8
4114 30 admin. married basic.6y no yes yes cellular jul thu 53 1 999 0 nonexistent 1.4 93.918 -42.7 4.958 5228.1
4115 39 admin. married high.school no yes no telephone jul fri 219 1 999 0 nonexistent 1.4 93.918 -42.7 4.959 5228.1
4116 27 student single high.school no no no cellular may mon 64 2 999 1 failure -1.8 92.893 -46.2 1.354 5099.1
4117 58 admin. married high.school no no no cellular aug fri 528 1 999 0 nonexistent 1.4 93.444 -36.1 4.966 5228.1
4118 34 management single high.school no yes no cellular nov wed 175 1 999 0 nonexistent -0.1 93.200 -42.0 4.120 5195.8

4119 rows × 20 columns


In [6]:
# 1. Preprocessing !
# Instead of label encoding, I'm just going to use get_dummies.
# Then we'll do some scaling of the numerical (non-categorical) values
def dummies_and_scale(df):
    # separate target
    target = df['y']
    target.replace('no', '0', inplace=True)
    target.replace('yes', '1', inplace=True)
    df = df.drop('y', axis=1)
    # scale numerical values
    numeric_cols = df[df.describe().columns]
    scaled_cols = pd.DataFrame(preprocessing.scale(numeric_cols), index = df.index, columns = df.describe().columns)
    df[df.describe().columns] = scaled_cols
    dummies = pd.get_dummies(df)
    
    return dummies, target

In [7]:
data, target = dummies_and_scale(sdata)

In [8]:
target.value_counts()


Out[8]:
0    3668
1     451
dtype: int64

In [9]:
sdata['y'].value_counts()


Out[9]:
0    3668
1     451
dtype: int64

In [10]:
sdata.describe()


Out[10]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
count 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000
mean 40.113620 256.788055 2.537266 960.422190 0.190337 0.084972 93.579704 -40.499102 3.621356 5166.481695
std 10.313362 254.703736 2.568159 191.922786 0.541788 1.563114 0.579349 4.594578 1.733591 73.667904
min 18.000000 0.000000 1.000000 0.000000 0.000000 -3.400000 92.201000 -50.800000 0.635000 4963.600000
25% 32.000000 103.000000 1.000000 999.000000 0.000000 -1.800000 93.075000 -42.700000 1.334000 5099.100000
50% 38.000000 181.000000 2.000000 999.000000 0.000000 1.100000 93.749000 -41.800000 4.857000 5191.000000
75% 47.000000 317.000000 3.000000 999.000000 0.000000 1.400000 93.994000 -36.400000 4.961000 5228.100000
max 88.000000 3643.000000 35.000000 999.000000 6.000000 1.400000 94.767000 -26.900000 5.045000 5228.100000

In [11]:
data.describe()


Out[11]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed ... month_oct month_sep day_of_week_fri day_of_week_mon day_of_week_thu day_of_week_tue day_of_week_wed poutcome_failure poutcome_nonexistent poutcome_success
count 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 4.119000e+03 ... 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000 4119.000000
mean -1.794038e-16 -5.175111e-18 5.045733e-17 -2.022606e-16 -3.708830e-17 -2.415052e-17 1.332074e-14 -1.932041e-16 1.397280e-16 1.656036e-16 ... 0.016752 0.015538 0.186453 0.207575 0.208789 0.204176 0.193008 0.110221 0.855305 0.034474
std 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 1.000121e+00 ... 0.128355 0.123693 0.389519 0.405620 0.406492 0.403147 0.394707 0.313203 0.351836 0.182466
min -2.144432e+00 -1.008306e+00 -5.986595e-01 -5.004819e+00 -3.513560e-01 -2.229776e+00 -2.380037e+00 -2.242241e+00 -1.722850e+00 -2.754338e+00 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% -7.868050e-01 -6.038652e-01 -5.986595e-01 2.010313e-01 -3.513560e-01 -1.206054e+00 -8.712637e-01 -4.790790e-01 -1.319592e+00 -9.147793e-01 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% -2.049648e-01 -2.975899e-01 -2.092283e-01 2.010313e-01 -3.513560e-01 6.494413e-01 2.922527e-01 -2.831721e-01 7.128522e-01 3.328625e-01 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
75% 6.677955e-01 2.364286e-01 1.802029e-01 2.010313e-01 -3.513560e-01 8.413892e-01 7.151926e-01 8.922691e-01 7.728506e-01 8.365351e-01 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
max 4.643704e+00 1.329632e+01 1.264200e+01 2.010313e-01 1.072442e+01 8.413892e-01 2.049611e+00 2.960175e+00 8.213108e-01 8.365351e-01 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 63 columns


In [12]:
# Just double checking: std is close to 0 and mean is basically 0 (except for dummies obviously).

Training the Models & Plot the Learning Curve


In [13]:
from sklearn.learning_curve import learning_curve
from sklearn import cross_validation


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 10)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [14]:
# Above from: http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html

In [15]:
cv = cross_validation.ShuffleSplit(data.shape[0], n_iter=2,
                                   test_size=0.2, random_state=0)
estimator=KNeighborsClassifier()
plot_learning_curve(estimator, "KNN", data, target, cv=cv, n_jobs=4)

plt.show()



In [16]:
cv = cross_validation.ShuffleSplit(data.shape[0], n_iter=2,
                                   test_size=0.2, random_state=0)
estimator=RandomForestClassifier()
plot_learning_curve(estimator, "Random Forest", data, target, cv=cv, n_jobs=4)

plt.show()



In [17]:
# Let's try again with a different cross validation value:

cv = 2
estimator=KNeighborsClassifier()
plot_learning_curve(estimator, "KNN", data, target, cv=cv, n_jobs=4)

plt.show()



In [20]:
# That is looking better ! I'll try to random forest too...

cv = 2
estimator=RandomForestClassifier()
plot_learning_curve(estimator, "Random Forest", data, target, cv=cv)

plt.show()



In [21]:
# Ok, so after running the larger set as well I've determined there may be features that are too important.

In [22]:
from sklearn.cross_validation import train_test_split

In [25]:
x_train, x_test, y_train, y_test = train_test_split?

In [26]:
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.2)

In [27]:
estimator = RandomForestClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)

In [28]:
from sklearn.metrics import confusion_matrix

In [29]:
confusion_matrix(y_pred, y_test)


/Library/Python/2.7/site-packages/numpy-1.9.1-py2.7-macosx-10.9-intel.egg/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)
Out[29]:
array([[716,  65],
       [ 16,  27]])

In [31]:
estimator.feature_importances_.max()


Out[31]:
0.23830676743965801

In [32]:
estimator.feature_importances_.sum()


Out[32]:
1.0

In [ ]:
# So for random forests, the most important feature is only 23%

In [33]:
# Let's try the KNN too
estimator = KNeighborsClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
confusion_matrix(y_pred, y_test)


Out[33]:
array([[710,  56],
       [ 22,  36]])

In [36]:
# So Random forests is better, but only slightly

In [38]:
from sklearn.grid_search import GridSearchCV as gs
rf = RandomForestClassifier()
rf_paramters = {'n_estimators' : range(10,50,10)}

In [40]:
rf.fit(x_train, y_train)
rfg = gs(rf, rf_paramters)
rfg.fit(x_train, y_train)


Out[40]:
GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [41]:
rfg.best_params_


Out[41]:
{'n_estimators': 30}

In [42]:
rf = RandomForestClassifier()
rf_paramters = {'n_estimators' : range(25,50,1)}
rf.fit(x_train, y_train)
rfg = gs(rf, rf_paramters)
rfg.fit(x_train, y_train)
rfg.best_params_


Out[42]:
{'n_estimators': 32}

In [43]:
rfg.best_score_


Out[43]:
0.90986342943854326

In [44]:
# That's seems too good ?

In [45]:
from sklearn.metrics import classification_report

In [46]:
classification_report?

In [47]:
classification_report(y_test, y_pred)


Out[47]:
'             precision    recall  f1-score   support\n\n          0       0.93      0.97      0.95       732\n          1       0.62      0.39      0.48        92\n\navg / total       0.89      0.91      0.90       824\n'

In [48]:
import pprint

In [49]:
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.93      0.97      0.95       732
          1       0.62      0.39      0.48        92

avg / total       0.89      0.91      0.90       824


In [50]:
rf = RandomForestClassifier(n_estimators=32).fit(x_train, y_train)
y_pred = rf.predict(x_test)
print confusion_matrix(y_test, y_pred)


[[715  17]
 [ 63  29]]

In [51]:
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.92      0.98      0.95       732
          1       0.63      0.32      0.42        92

avg / total       0.89      0.90      0.89       824


In [56]:
features = zip(rf.feature_importances_, sdata.columns)

In [61]:
features.sort(reverse=True)

In [62]:
features


Out[62]:
[(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration'),
 (0.0098731345417975187, 'nr.employed'),
 (0.0082567916457971686, 'campaign'),
 (0.0072101629135299236, 'emp.var.rate'),
 (0.0070700707499528944, 'poutcome'),
 (0.0054013289400914599, 'cons.conf.idx'),
 (0.0043193754372234001, 'y'),
 (0.0037179746568354031, 'euribor3m'),
 (0.0033291093715349398, 'cons.price.idx'),
 (0.0030211469603017029, 'pdays'),
 (0.00207814438331417, 'previous')]

In [69]:
seven = [
(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education')
]

In [72]:
f7 = [i for i,j in seven]

In [74]:
sum(f7)


Out[74]:
0.5589219045503583

In [ ]:
# So the top 7 cover about 56% of the importances...

In [1]:
ten = [
(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration')
]

f10 = [i for i,j in ten]
sum(f10)


Out[1]:
0.645083548652991

In [3]:
# So ten features is another ~10% or so... it seems hard to know how many I would want to actually drop ?

all_features = [i for i,j in [(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration'),
 (0.0098731345417975187, 'nr.employed'),
 (0.0082567916457971686, 'campaign'),
 (0.0072101629135299236, 'emp.var.rate'),
 (0.0070700707499528944, 'poutcome'),
 (0.0054013289400914599, 'cons.conf.idx'),
 (0.0043193754372234001, 'y'),
 (0.0037179746568354031, 'euribor3m'),
 (0.0033291093715349398, 'cons.price.idx'),
 (0.0030211469603017029, 'pdays'),
 (0.00207814438331417, 'previous')]]

sum(all_features)


Out[3]:
0.6993607882533698

In [4]:
# So all of them max out at .699

In [75]:
# How might we use certain features? It seems that if we can reduce the complexity of the model, it would be good.
# Yet, all those little features add up...

# One idea is to do more grid search on the variables that matter most, 
# then see how close you can get to the whole feature list

# Basically, if we can simplify the model by getting rid of irrelevant features, that is ideal.

In [ ]: