Exercises for Chapter 4

Imputing missing data

Load the Pima diabetes dataset as a pandas dataframe. (Note that the data does not include a header row. You'll have to build that yourself based on the documentation.)


In [61]:
import pandas

names = ['num_times_pregnant', 'glucose_concentration',
          'blood_pressure', 'skin_fold_thickness', 'insulin',
          'bmi', 'diabetes_pedigree', 'age', 'target']

data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/' +\
           'pima-indians-diabetes/pima-indians-diabetes.data'

df = pandas.read_csv(data_url, header=None, index_col=False, names=names)
print(df)


     num_times_pregnant  glucose_concentration  blood_pressure  \
0                     6                    148              72   
1                     1                     85              66   
2                     8                    183              64   
3                     1                     89              66   
4                     0                    137              40   
5                     5                    116              74   
6                     3                     78              50   
7                    10                    115               0   
8                     2                    197              70   
9                     8                    125              96   
10                    4                    110              92   
11                   10                    168              74   
12                   10                    139              80   
13                    1                    189              60   
14                    5                    166              72   
15                    7                    100               0   
16                    0                    118              84   
17                    7                    107              74   
18                    1                    103              30   
19                    1                    115              70   
20                    3                    126              88   
21                    8                     99              84   
22                    7                    196              90   
23                    9                    119              80   
24                   11                    143              94   
25                   10                    125              70   
26                    7                    147              76   
27                    1                     97              66   
28                   13                    145              82   
29                    5                    117              92   
..                  ...                    ...             ...   
738                   2                     99              60   
739                   1                    102              74   
740                  11                    120              80   
741                   3                    102              44   
742                   1                    109              58   
743                   9                    140              94   
744                  13                    153              88   
745                  12                    100              84   
746                   1                    147              94   
747                   1                     81              74   
748                   3                    187              70   
749                   6                    162              62   
750                   4                    136              70   
751                   1                    121              78   
752                   3                    108              62   
753                   0                    181              88   
754                   8                    154              78   
755                   1                    128              88   
756                   7                    137              90   
757                   0                    123              72   
758                   1                    106              76   
759                   6                    190              92   
760                   2                     88              58   
761                   9                    170              74   
762                   9                     89              62   
763                  10                    101              76   
764                   2                    122              70   
765                   5                    121              72   
766                   1                    126              60   
767                   1                     93              70   

     skin_fold_thickness  insulin   bmi  diabetes_pedigree  age  target  
0                     35        0  33.6              0.627   50       1  
1                     29        0  26.6              0.351   31       0  
2                      0        0  23.3              0.672   32       1  
3                     23       94  28.1              0.167   21       0  
4                     35      168  43.1              2.288   33       1  
5                      0        0  25.6              0.201   30       0  
6                     32       88  31.0              0.248   26       1  
7                      0        0  35.3              0.134   29       0  
8                     45      543  30.5              0.158   53       1  
9                      0        0   0.0              0.232   54       1  
10                     0        0  37.6              0.191   30       0  
11                     0        0  38.0              0.537   34       1  
12                     0        0  27.1              1.441   57       0  
13                    23      846  30.1              0.398   59       1  
14                    19      175  25.8              0.587   51       1  
15                     0        0  30.0              0.484   32       1  
16                    47      230  45.8              0.551   31       1  
17                     0        0  29.6              0.254   31       1  
18                    38       83  43.3              0.183   33       0  
19                    30       96  34.6              0.529   32       1  
20                    41      235  39.3              0.704   27       0  
21                     0        0  35.4              0.388   50       0  
22                     0        0  39.8              0.451   41       1  
23                    35        0  29.0              0.263   29       1  
24                    33      146  36.6              0.254   51       1  
25                    26      115  31.1              0.205   41       1  
26                     0        0  39.4              0.257   43       1  
27                    15      140  23.2              0.487   22       0  
28                    19      110  22.2              0.245   57       0  
29                     0        0  34.1              0.337   38       0  
..                   ...      ...   ...                ...  ...     ...  
738                   17      160  36.6              0.453   21       0  
739                    0        0  39.5              0.293   42       1  
740                   37      150  42.3              0.785   48       1  
741                   20       94  30.8              0.400   26       0  
742                   18      116  28.5              0.219   22       0  
743                    0        0  32.7              0.734   45       1  
744                   37      140  40.6              1.174   39       0  
745                   33      105  30.0              0.488   46       0  
746                   41        0  49.3              0.358   27       1  
747                   41       57  46.3              1.096   32       0  
748                   22      200  36.4              0.408   36       1  
749                    0        0  24.3              0.178   50       1  
750                    0        0  31.2              1.182   22       1  
751                   39       74  39.0              0.261   28       0  
752                   24        0  26.0              0.223   25       0  
753                   44      510  43.3              0.222   26       1  
754                   32        0  32.4              0.443   45       1  
755                   39      110  36.5              1.057   37       1  
756                   41        0  32.0              0.391   39       0  
757                    0        0  36.3              0.258   52       1  
758                    0        0  37.5              0.197   26       0  
759                    0        0  35.5              0.278   66       1  
760                   26       16  28.4              0.766   22       0  
761                   31        0  44.0              0.403   43       1  
762                    0        0  22.5              0.142   33       0  
763                   48      180  32.9              0.171   63       0  
764                   27        0  36.8              0.340   27       0  
765                   23      112  26.2              0.245   30       0  
766                    0        0  30.1              0.349   47       1  
767                   31        0  30.4              0.315   23       0  

[768 rows x 9 columns]

Check the dataframe to see which columns contain 0's. Based on the data type of each column, do these 0's all make sense? Which 0's are suspicious?


In [62]:
for name in names:
    print(name, ':', any(df.loc[:, name] == 0))


num_times_pregnant : True
glucose_concentration : True
blood_pressure : True
skin_fold_thickness : True
insulin : True
bmi : True
diabetes_pedigree : False
age : False
target : True

Answer: Columns 2-6 (glucose, blood pressure, skin fold thickness, insulin, and BMI) all contain zeros, but none of these measurements should ever be 0 in a human.

Assume that 0s indiciate missing values, and fix them in the dataset by eliminating samples with missing features. Then run a logistic regression, and measure the performance of the model.


In [63]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

for i in range(1,6):
    df.loc[df.loc[:, names[i]] == 0, names[i]] = np.nan

df_no_nan = df.dropna(axis=0, how='any')

X = df_no_nan.iloc[:, :8].values
y = df_no_nan.iloc[:, 8].values

def fit_and_score_rlr(X, y, normalize=True):
    
    if normalize:
        scaler = StandardScaler().fit(X)
        X_std = scaler.transform(X)
    else:
        X_std = X
        
    X_train, X_test, y_train, y_test = train_test_split(X_std, y,
                                                        test_size=0.33,
                                                        random_state=42)

    rlr = LogisticRegression(C=1)

    rlr.fit(X_train, y_train)
    return rlr.score(X_test, y_test)

fit_and_score_rlr(X, y)


Out[63]:
0.7384615384615385

Next, replace missing features through mean imputation. Run a regression and measure the performance of the model.


In [64]:
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values='NaN', strategy='mean', axis=1)
X = imputer.fit_transform(df.iloc[:, :8].values)

y = df.iloc[:, 8].values

fit_and_score_rlr(X, y)


Out[64]:
0.76377952755905509

Comment on your results.

Answer: Interestingly, there's not a huge performance improvement between the two approaches! In my run, using mean imputation corresponded to about a 3 point increase in model performance. Some ideas for why this might be:

  1. This is a small dataset to start out with, so removing ~half its samples doesn't change performance very much
  2. There's not much information contained in the features with missing data
  3. There are other effects underlying poor performance of the model (e.g. regularization parameters) that are having a greater impact

Preprocessing categorical variables

Load the TA evaluation dataset. As before, the data and header are split into two files, so you'll have to combine them yourself.


In [65]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tae/tae.data'

names = ['native_speaker', 'instructor', 'course', 'season', 'class_size', 'rating']

df = pandas.read_csv(data_url, header=None, index_col=False, names=names)
print(df)


     native_speaker  instructor  course  season  class_size  rating
0                 1          23       3       1          19       3
1                 2          15       3       1          17       3
2                 1          23       3       2          49       3
3                 1           5       2       2          33       3
4                 2           7      11       2          55       3
5                 2          23       3       1          20       3
6                 2           9       5       2          19       3
7                 2          10       3       2          27       3
8                 1          22       3       1          58       3
9                 2          15       3       1          20       3
10                2          10      22       2           9       3
11                2          13       1       2          30       3
12                2          18      21       2          29       3
13                2           6      17       2          39       3
14                2           6      17       2          42       2
15                2           6      17       2          43       2
16                2           7      11       2          10       2
17                2          22       3       2          46       2
18                2          13       3       1          10       2
19                2           7      25       2          42       2
20                2          25       7       2          27       2
21                2          25       7       2          23       2
22                2           2       9       2          31       2
23                2           1      15       1          22       2
24                2          15      13       2          37       2
25                2           7      11       2          13       2
26                2           8       3       2          24       2
27                2          14      15       2          38       2
28                2          21       2       2          42       1
29                2          22       3       2          28       1
..              ...         ...     ...     ...         ...     ...
121               2          13      14       2          17       3
122               2           9       6       2           7       3
123               1          10       3       2          21       3
124               2          14      15       2          36       3
125               1          13       1       2          54       3
126               1           8       3       2          29       3
127               2          20       2       2          45       3
128               2          22       1       2          11       2
129               2          18      12       2          16       2
130               2          20      15       2          18       2
131               1          17      18       2          44       2
132               2          14      23       2          17       2
133               2          24      26       2          21       2
134               2           9      24       2          20       2
135               2          12       8       2          24       2
136               2           9       6       2           5       2
137               2          22       1       2          42       2
138               2           7      11       2          30       1
139               2          10       3       2          19       1
140               2          23       3       2          11       1
141               2          17      18       2          29       1
142               2          16      20       2          15       1
143               2           3       2       2          37       1
144               2          19       4       2          10       1
145               2          23       3       2          24       1
146               2           3       2       2          26       1
147               2          10       3       2          12       1
148               1          18       7       2          48       1
149               2          22       1       2          51       1
150               2           2      10       2          27       1

[151 rows x 6 columns]

Which of the features are categorical? Are they ordinal, or nominal? Which features are numeric?

Answer: According to the documentation:

  1. Native speaker: categorical (nominal)
  2. Instructor: categorical (nominal)
  3. Course: categorical (nominal)
  4. Season: categorical (nominal)
  5. Class size: numeric
  6. Rating: categorical (ordinal)

Encode the categorical variables in a naive fashion, by leaving them in place as numerics. Run a classification and measure performance against a test set.


In [70]:
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

fit_and_score_rlr(X, y, normalize=True)


/home/jean/.virtualenvs/learning/lib/python3.5/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
Out[70]:
0.54000000000000004

Now, encode the categorical variables with a one-hot encoder. Again, run a classification and measure performance.


In [71]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(categorical_features=range(5))
X_encoded = enc.fit_transform(X)

fit_and_score_rlr(X_encoded, y, normalize=False)


Out[71]:
0.56000000000000005

Comment on your results.


In [ ]:

Feature scaling

Raschka mentions that decision trees and random forests do not require standardized features prior to classification, while the rest of the classifiers we've seen so far do. Why might that be? Explain the intuition behind this idea based on the differences between tree-based classifiers and the other classifiers we've seen.

Now, we'll test the two scaling algorithms on the wine dataset. Start by loading the wine dataset.


In [ ]:

Scale the features via "standardization" (as Raschka describes it). Classify and measure performance.


In [ ]:

Scale the features via "normalization" (as Raschka describes it). Again, classify and measure performance.


In [ ]:

Comment on your results.


In [ ]:

Feature selection

  • Implement SBS below. Then, run the tests.

In [68]:
class SBS(object):
    """
    Class to select the k-best features in a dataset via sequential backwards selection.
    """
    def __init__(self):
        """
        Initialize the SBS model.
        """
        pass
    
    def fit(self):
        """
        Fit SBS to a dataset.
        """
        pass
    
    def transform(self):
        """
        Transform a dataset based on the model.
        """
        pass
    
    def fit_transform(self):
        """
        Fit SBS to a dataset and transform it, returning the k-best features.
        """
        pass

Now, we'll practice feature selection. Start by loading the breast cancer dataset.


In [ ]:

Use a random forest to determine the feature importances. Plot the features and their importances.


In [ ]:

Use L1 regularization with a standard C value (0.1) to eliminate low-information features. Again, plot the feature importances using the coef_ attribute of the model.


In [ ]:

How do the feature importances from the random forest/L1 regularization compare?