Copyright (c) 2015, 2016 Sebastian Raschka
Li-Yi Wei, 2016
Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).
In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,sklearn
The use of watermark
is optional. You can install this IPython extension via "pip install watermark
". For more information, please see: https://github.com/rasbt/watermark.
In [2]:
from IPython.display import Image
%matplotlib inline
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
The training data might be incomplete due to various reasons
Most machine learning algorithms/implementations cannot robustly deal with missing data
Thus we need to deal with missing data before training models
We use pandas (Python data analysis) library for dealing with missing data in the examples below
In [3]:
import pandas as pd
from io import StringIO
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
# If you are using Python 2.7, you need
# to convert the string to unicode:
# csv_data = unicode(csv_data)
df = pd.read_csv(StringIO(csv_data))
df
Out[3]:
The columns (A, B, C, D) are features.
The rows (0, 1, 2) are samples.
Missing values become NaN (not a number).
In [4]:
df.isnull()
Out[4]:
In [5]:
df.isnull().sum(axis=0)
Out[5]:
In [6]:
# the default is to drop samples/rows
df.dropna()
Out[6]:
In [7]:
# but we can also elect to drop features/columns
df.dropna(axis=1)
Out[7]:
In [8]:
# only drop rows where all columns are NaN
df.dropna(how='all')
Out[8]:
In [9]:
# drop rows that have not at least 4 non-NaN values
df.dropna(thresh=4)
Out[9]:
In [10]:
# only drop rows where NaN appear in specific columns (here: 'C')
df.dropna(subset=['C'])
Out[10]:
Dropping data might not be desirable, as the resulting data set might become too small.
Interpolating missing values from existing ones can preserve the original data better.
Impute: the process of replacing missing data with substituted values in statistics
In [11]:
from sklearn.preprocessing import Imputer
# options from the imputer library includes mean, median, most_frequent
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data
Out[11]:
For example, 7.5 is the average of 3 and 12. 6 is the average of 4 and 8.
In [12]:
df.values
Out[12]:
We can do better than this, by selecting only the most similar rows for interpolation, instead of all rows. This is how recommendation system could work, e.g. predict your potential rating of a movie or book you have not seen based on item ratings from you and other users.
Programming Collective Intelligence: Building Smart Web 2.0 Applications, by Toby Segaran
Transformer class for data transformation
Key methods
Good API designs are consistent. For example, the fit() method has similar meanings for different classes, such as transformer and estimator.
There are different types of feature data: numerical and categorical. Numerical features are numbers and often "continuous" like real numbers. Categorical features are "discrete", and can be either nominal or ordinal.
In the example below:
A given dataset can contain features of different types. It is important to handle them carefully. For example, do not treat nominal values as numbers without proper mapping.
In [13]:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df
Out[13]:
For some estimators such as decision trees that handle one feature at a time, it is OK to keep the features as they are.
However for other estimators that need to handle multiple features together, we need to convert them into compatible forms before proceeding:
Ordinal features can be converted into numbers, but the conversion often depends on semantics and thus needs to be specified manually (by a human) instead of automatically (by a machine).
In the example below, we can map sizes into numbers. Intuitively, larger sizes should map to larger values. Exactly which values to map to is often a judgment call.
Below, we use Python dictionary to define a mapping.
In [14]:
size_mapping = {'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
Out[14]:
In [15]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)
Out[15]:
In [16]:
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping
Out[16]:
In [17]:
# forward map
df['classlabel'] = df['classlabel'].map(class_mapping)
df
Out[17]:
In [18]:
# inverse map
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df
Out[18]:
We can use LabelEncoder in scikit learn to convert class labels automatically.
In [19]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
Out[19]:
In [20]:
class_le.inverse_transform(y)
Out[20]:
However, unlike class labels, we cannot just convert nominal features (such as colors) directly into integers.
A common mistake is to map nominal features into numerical values, e.g. for colors
In [21]:
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X
Out[21]:
For categorical features, it is important to keep the mapped values "equal distance"
For example, for colors red, green, blue, we want to convert them to values so that each color has equal distance from one another.
This cannot be done in 1D but doable in 2D (how? think about it).
One hot encoding is a straightforward way to make this just work, by mapping n-value nominal feature into n-dimensional binary vector.
In [22]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
Out[22]:
In [23]:
# automatic conversion via the get_dummies method in pd
pd.get_dummies(df[['price', 'color', 'size']])
Out[23]:
In [24]:
df
Out[24]:
Training set to train the models
Test set to evaluate the trained models
Separate the two to avoid over-fitting
Validation set for tuning hyper-parameters
In [25]:
wine_data_remote = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
wine_data_local = '../datasets/wine/wine.data'
df_wine = pd.read_csv(wine_data_local,
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()
Out[25]:
If the link to the Wine dataset provided above does not work for you, you can find a local copy in this repository at ./../datasets/wine/wine.data.
Or you could fetch it via
In [26]:
df_wine = pd.read_csv('https://raw.githubusercontent.com/1iyiwei/pyml/master/code/datasets/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df_wine.head()
Out[26]:
As much training data as possible for accurate model
As much test data as possible for evaluation
Usual rules is 60:40, 70:30, 80:20
Larger datasets can have more portions for training
Other partitions possible
In [27]:
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=0)
print(X.shape)
import numpy as np
print(np.unique(y))
In [28]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
In [29]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
In [30]:
ex = pd.DataFrame([0, 1, 2, 3, 4, 5])
# standardize
ex[1] = (ex[0] - ex[0].mean()) / ex[0].std(ddof=0)
# Please note that pandas uses ddof=1 (sample standard deviation)
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)
# normalize
ex[2] = (ex[0] - ex[0].min()) / (ex[0].max() - ex[0].min())
ex.columns = ['input', 'standardized', 'normalized']
ex
Out[30]:
Amount of data should be sufficient relative to model complexity.
We can sum up both the loss and regularization terms as the total objective: $$\Phi(\mathbf{X}, \mathbf{T}, \Theta) = L\left(\mathbf{X}, \mathbf{T}, \mathbf{Y}=f(\mathbf{X}, \Theta)\right) + P(\Theta)$$
During training, the goal is to optimize the parameters $\Theta$ with respect to the given training data $\mathbf{X}$ and $\mathbf{T}$: $$argmin_\Theta \; \Phi(\mathbf{X}, \mathbf{T}, \Theta)$$ And hope the trained model with generalize well to future data.
Every machine learning task as a goal, which can be formalized as a loss function: $$L(\mathbf{X}, \mathbf{T}, \mathbf{Y})$$ , where $\mathbf{T}$ is some form of target or auxiliary information, such as:
In addition to the objective, we often care about the simplicity of the model, for better efficiency and generalization (avoiding over-fitting). The complexity of the model can be measured by another penalty function: $$P(\Theta)$$ Some common penalty functions include number and/or magnitude of parameters.
We are more likely to bump into sharp corners of an object.
Experiment: drop a circle and a square into a flat floor. What is the probability of hitting any point on the shape?
How about a non-flat floor, e.g. concave or convex with different curvatures?
In [31]:
from sklearn.linear_model import LogisticRegression
# l1 regularization
lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
# compare training and test accuracy to see if there is overfitting
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))
In [32]:
# 3 sets of parameters due to one-versus-rest with 3 classes
lr.intercept_
Out[32]:
In [33]:
# 13 coefficients for 13 wine features; notice many of them are 0
lr.coef_
Out[33]:
In [34]:
from sklearn.linear_model import LogisticRegression
# l2 regularization
lr = LogisticRegression(penalty='l2', C=0.1)
lr.fit(X_train_std, y_train)
# compare training and test accuracy to see if there is overfitting
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))
In [35]:
# notice the disappearance of 0 coefficients due to L2
lr.coef_
Out[35]:
In [36]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'cyan',
'magenta', 'yellow', 'black',
'pink', 'lightgreen', 'lightblue',
'gray', 'indigo', 'orange']
weights, params = [], []
for c in np.arange(-4, 6):
lr = LogisticRegression(penalty='l1', C=10**c, random_state=0)
lr.fit(X_train_std, y_train)
weights.append(lr.coef_[1])
params.append(10**c)
weights = np.array(weights)
for column, color in zip(range(weights.shape[1]), colors):
plt.plot(params, weights[:, column],
label=df_wine.columns[column + 1],
color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center',
bbox_to_anchor=(1.38, 1.03),
ncol=1, fancybox=True)
# plt.savefig('./figures/l1_path.png', dpi=300)
plt.show()
$L_1$ regularization implicitly selects features via zero out
Feature selection
Note: 2 important features might be highly correlated, and thus it is relevant to select only 1
Feature extraction
Feature selection is a way to reduce input data dimensionality. You can think of it as reducing the number of columns of the input data table/frame.
How do we decide which features/columns to keep? Intuitively, we want to keep relevant ones and remove the rest.
We can select these features sequentially, either forward or backward.
Sequential backward selection (SBS) is a simple heuristic. The basic idea is to start with $n$ features, and consider all possible $n-1$ subfeatures, and remove the one that matters the least for model training.
We then move on to reduce the number of features further ($[n-2, n-3, \cdots]$) until reaching the desired number of features.
In [37]:
from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score,
test_size=0.25, random_state=1):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state
def fit(self, X, y):
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=self.test_size,
random_state=self.random_state)
dim = X_train.shape[1]
self.indices_ = tuple(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train,
X_test, y_test, self.indices_)
self.scores_ = [score]
while dim > self.k_features:
scores = []
subsets = []
for p in combinations(self.indices_, r=dim - 1):
score = self._calc_score(X_train, y_train,
X_test, y_test, p)
scores.append(score)
subsets.append(p)
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1
self.scores_.append(scores[best])
self.k_score_ = self.scores_[-1]
return self
def transform(self, X):
return X[:, self.indices_]
def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train)
y_pred = self.estimator.predict(X_test[:, indices])
score = self.scoring(y_test, y_pred)
return score
Below we try to apply the SBS class above. We use the KNN classifer, which can suffer from curse of dimensionality.
In [38]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=2)
# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)
# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.1])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('./sbs.png', dpi=300)
plt.show()
In [39]:
# list the 5 most important features
k5 = list(sbs.subsets_[8]) # 5+8 = 13
print(df_wine.columns[1:][k5])
In [40]:
knn.fit(X_train_std, y_train)
print('Training accuracy:', knn.score(X_train_std, y_train))
print('Test accuracy:', knn.score(X_test_std, y_test))
In [41]:
knn.fit(X_train_std[:, k5], y_train)
print('Training accuracy:', knn.score(X_train_std[:, k5], y_train))
print('Test accuracy:', knn.score(X_test_std[:, k5], y_test))
Note the improved test accuracy by fitting lower dimensional training/test data.
Recall
Information gain (or impurity loss) at each node can measure the importantce of the feature being split
In [42]:
# feature_importances_ from random forest classifier records this info
from sklearn.ensemble import RandomForestClassifier
feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=10000,
random_state=0,
n_jobs=-1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
plt.title('Feature Importances')
plt.bar(range(X_train.shape[1]),
importances[indices],
color='lightblue',
align='center')
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('./random_forest.png', dpi=300)
plt.show()
In [43]:
threshold = 0.15
if False: #Version(sklearn_version) < '0.18':
X_selected = forest.transform(X_train, threshold=threshold)
else:
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=threshold, prefit=True)
X_selected = sfm.transform(X_train)
X_selected.shape
Out[43]:
Now, let's print the 3 features that met the threshold criterion for feature selection that we set earlier (note that this code snippet does not appear in the actual book but was added to this notebook later for illustrative purposes):
In [44]:
for f in range(X_selected.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
Data is important for machine learning: garbage in, garbage out. So pre-process data is important. This chapter covers various topics for data processing, such as handling missing data, treating different types of data (numerical, categorical), and how to avoid over-fitting which can improve both accuracy and speed.