In [2]:
%pylab inline
import numpy as np
import seaborn as sns
import pandas as pd
Common Goal:
In [2]:
from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=5, n_informative=2,
n_redundant=2, n_classes=2, random_state=0)
from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])),
columns = list(range(5)) + ["class"])
In [3]:
df[:5]
Out[3]:
In [4]:
x = df.boxplot(return_type='dict')
plt.show()
In [5]:
df.describe()
Out[5]:
In [6]:
# Pairwise feature plot
_ = sns.pairplot(df[:50], vars=[0, 1, 2, 3, 4], hue="class", size=1.5)
In [7]:
# Correlation Plot
plt.figure(figsize=(10, 10));
_ = sns.heatmap(df.corr(), annot=False)
In [8]:
from IPython.display import Image
Image(filename='images/azure_sheet.png', width=800, height=600)
Out[8]:
The importance of features
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering"
- Andrew Ng
"... some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."
- Pedro Domingo
Regularization is designed to impose simplicity by adding a penalty term that depends on the charactistics of the parameters.
(1) Randomly split the training data $D$ into $D_{train}$ and $D_{val}$, say 70% of the data and 30% of the data respectively.
(2) Train each model $M_i$ on $D_{train}$ only, each time getting a hypothesis $h_i$.
(3) Select and output hypothesis $h_i$ that had the smallest error on the held out validation set.
Disadvantages:
For each model $M_i$, we evaluate the model as follows:
If model selection and true error estimates are to be computed simaltaneously, the data needs to be divided into three disjoin sets.
Training set: A set of examples used for learning
Assumes that there is a smooth but noisy relation that acts as a mapping from hyperparameters to the objective function.
Gather observations in such a manner as to evaluate the machine learning model the least number of times while revealing as much information as possible about the mapping and, in particular, the location of the optimum.
Exploration vs. Exploitation problem.
In [32]:
# Image from Andrew Ng's Stanford CS229 lecture titled "Advice for applying machine learning"
from IPython.display import Image
Image(filename='images/HighVariance.png', width=800, height=600)
# Testing error still decreasing as the training set size increases. Suggests increasing the training set size.
# Large gap Between Training and Test Error.
Out[32]:
In [33]:
# Image from Andrew Ng's Stanford CS229 lecture titled "Advice for applying machine learning"
from IPython.display import Image
Image(filename='images/HighBias.png', width=800, height=600)
# Training error is unacceptably high.
# Small gap between training error and testing error.
Out[33]:
Approach 1:
Approach 2:
Similar to the idea of cross validation, except for components of a system.
Example: Simple Logisitic Regression on spam classification gives 94% performance.