Feature Transformations

- Standard Normal Transform
- Domain-Specific Transform
- Mystery Transform

Dimensionality Reduction

- PCA: Model and Learning
- PCA for Images
- PCA for Visualization

- Chapter 14 of
*Elements of Statistical Learning*by Hastie, Tibshirani, Friedman - A Few Useful Things to Know about Machine Learning
- SciKit-Learn's documentation on data preprocessing
- SciKit-Learn's documentation on dimensionality reduction

First we need to import Numpy, Pandas, MatPlotLib...

```
In [2]:
```from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time
%matplotlib inline

```
In [14]:
```from matplotlib.colors import ListedColormap
# Another messy looking function to make pretty plots of basketball courts
def visualize_court(log_reg_model, coord_type='cart', court_image = '../data/nba/nba_court.jpg'):
two_class_cmap = ListedColormap(['#FFAAAA', '#AAFFAA']) # light red for miss, light green for make
x_min, x_max = 0, 50 #width (feet) of NBA court
y_min, y_max = 0, 47 #length (feet) of NBA half-court
grid_step_size = 0.2
grid_x, grid_y = np.meshgrid(np.arange(x_min, x_max, grid_step_size), np.arange(y_min, y_max, grid_step_size))
features = np.c_[grid_x.ravel(), grid_y.ravel()]
# change coordinate system
if coord_type == 'polar':
features = np.c_[grid_x.ravel(), grid_y.ravel()]
hoop_location = np.array([25., 0.])
features -= hoop_location
dists = np.sqrt(np.sum(features**2, axis=1))
angles = np.arctan2(features[:,1], features[:,0])
features = np.hstack([dists[np.newaxis].T, angles[np.newaxis].T])
grid_predictions = log_reg_model.predict(features)
grid_predictions = grid_predictions.reshape(grid_x.shape)
fig, ax = plt.subplots()
court_image = plt.imread(court_image)
ax.imshow(court_image, interpolation='bilinear', origin='lower',extent=[x_min,x_max,y_min,y_max])
ax.imshow(grid_predictions, cmap=two_class_cmap, interpolation = 'nearest',
alpha = 0.60, origin='lower',extent=[x_min,x_max,y_min,y_max])
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.title( "Make / Miss Prediction Boundaries" )
plt.show()

We also need functions for shuffling the data and calculating classification errrors.

```
In [4]:
```### function for shuffling the data and labels
def shuffle_in_unison(features, labels):
rng_state = np.random.get_state()
np.random.shuffle(features)
np.random.set_state(rng_state)
np.random.shuffle(labels)
### calculate classification errors
# return a percentage: (number misclassified)/(total number of datapoints)
def calc_classification_error(predictions, class_labels):
n = predictions.size
num_of_errors = 0.
for idx in xrange(n):
if (predictions[idx] >= 0.5 and class_labels[idx]==0) or (predictions[idx] < 0.5 and class_labels[idx]==1):
num_of_errors += 1
return num_of_errors/n

```
In [7]:
```# load a dataset of recent movies and their ratings across several websites
movie_data = pd.read_csv('../data/movie_ratings.csv')
# reduce it to just the ratings categories
movie_data = movie_data[['FILM','RottenTomatoes','RottenTomatoes_User','Metacritic','Metacritic_User','Fandango_Ratingvalue', 'IMDB']]
movie_data.head()

```
Out[7]:
```

```
In [8]:
```movie_data.describe()

```
Out[8]:
```

**Data**

We observe pairs $(\mathbf{x}_{i},y_{i})$ where \begin{eqnarray*} y_{i} \in \{ 0, 1\} &:& \mbox{class label} \\ \mathbf{x}_{i} = (x_{i,1}, \dots, x_{i,D}) &:& \mbox{set of $D$ explanatory variables (aka features)} \end{eqnarray*}

** Parameters**

** Transformation Function **

**Error Function**

** Learning $\beta$ **

- Randomly initialize $\beta$
- Until $\alpha || \nabla \mathcal{L} || < tol $:
- $\mathbf{\beta}_{t+1} = \mathbf{\beta}_{t} - \alpha \nabla_{\mathbf{\beta}} \mathcal{L}$

Let's run a logistic regression classifier on this data via SciKit-Learn. If you need a refresher, check out the notebook from the first course and the SciKit-Learn documentation on logistic regression. The goal is to predict if the *IMDB* rating is under or over 7/10, using the other ratings as features. I've started the code. You just need to fill-in the lines for training and computing the error. Note there is no test set yet.

```
In [9]:
```from sklearn.linear_model import LogisticRegression
# set the random number generator for reproducability
np.random.seed(123)
# let's try to predict the IMDB rating from the others
features = movie_data[['RottenTomatoes','RottenTomatoes_User','Metacritic','Metacritic_User','Fandango_Ratingvalue']].as_matrix()
# create classes: more or less that 7/10 rating
labels = (movie_data['IMDB'] >= 7.).astype('int').tolist()
shuffle_in_unison(features, labels)
### Your Code Goes Here ###
# initialize and train a logistic regression model
# compute error on training data
model_LogReg = LogisticRegression()
model_LogReg.fit(features, labels)
predicted_labels = model_LogReg.predict(features)
train_error_rate = calc_classification_error(predicted_labels, labels)
###########################
print "Classification error on training set: %.2f%%" %(train_error_rate*100)
# compute the baseline error since the classes are imbalanced
print "Baseline Error: %.2f%%" %((sum(labels)*100.)/len(labels))

```
```

Good features are crucial for training well-performing classifiers: 'garbage in, garbage out.' In this section we introduce several transformations that are commonly applied to data as a preprocessing step before training a classifier.

Recall the formula of the standard linear model: $$\hat Y = f(\beta^{T} \mathbf{X}) $$ where $\hat Y$ is the predictions, $f(\cdot)$ is the transformation function, $\beta$ is the weights (parameters), and $X$ is the $N \times D$ matrix of features. For simplicity, assume there are just two features: $$ \beta^{T} \mathbf{x}_{i} = \beta_{1}x_{i,1} + \beta_{2}x_{i,2}.$$ Usually $x_{i,1}$ and $x_{i,2}$ will be measured in different units. For instance, in the movie ratings data, the Rotten Tomatoes dimension is on a $0-100$ scale and the Fandango ratings are on $0-5$. The difference in scale causes one dimension to dominate the inner product. Linear models can learn to cope with this imbalance by changing the scales of the weights accordingly, but this makes optimization harder because gradient steps are unequal across dimensions.

One way to get rid of hetergeneous scales is to standardize the data so that the values in each dimension are distributed according to the standard Normal distribution. In math, this means we'll transform the data like so: $$\mathbf{X}_{std} = \frac{\mathbf{X} - \boldsymbol{\mu}_{X}}{\boldsymbol{\sigma}_{X}}. $$ This is also called 'z-score scaling.' Let's examine the affect of this transformation on training error.

```
In [10]:
```# perform z-score scaling
features_mu = np.mean(features, axis=0)
features_sigma = np.std(features, axis=0)
std_features = (features - features_mu)/features_sigma
# re-train model
lm = LogisticRegression()
lm.fit(std_features, labels)
### compute error on training data
predictions = lm.predict(std_features)
print "Classification error on training set: %.3f%%" %(calc_classification_error(predictions, labels)*100)
# compute the baseline error since the classes are imbalanced
print "Baseline Error: %.2f%%" %((sum(labels)*100.)/len(labels))

```
```

```
In [12]:
```nba_shot_data = pd.read_csv('../data/nba/NBA_xy_features.csv')
nba_shot_data.head()

```
Out[12]:
```

And let's run logistic regression on the data just as we did before...

```
In [15]:
```# split data into train and test
train_set_size = int(.80*len(nba_shot_data))
train_features = nba_shot_data.ix[:train_set_size,['x_Coordinate','y_Coordinate']].as_matrix()
test_features = nba_shot_data.ix[train_set_size:,['x_Coordinate','y_Coordinate']].as_matrix()
train_class_labels = nba_shot_data.ix[:train_set_size,['shot_outcome']].as_matrix()
test_class_labels = nba_shot_data.ix[train_set_size:,['shot_outcome']].as_matrix()
#Train logistic regression model
start_time = time.time()
lm.fit(train_features, np.ravel(train_class_labels))
end_time = time.time()
print "Training ended after %.2f seconds." %(end_time-start_time)
# compute the classification error on training data
predictions = lm.predict(test_features)
print "Classification Error on the Test Set: %.2f%%" %(calc_classification_error(predictions, np.array(test_class_labels)) * 100)
# compute the baseline error since the classes are imbalanced
print "Baseline Error: %.2f%%" %(np.sum(test_class_labels)/len(test_class_labels)*100)
# visualize the boundary on the basketball court
visualize_court(lm)

```
```

Now let's transform the Cartesian coordinates into *polar* coordinates: (x,y) --> (radius, angle)...

```
In [16]:
```### Transform coordinate system
# radius coordinate: calculate distance from point to hoop
hoop_location = np.array([25.5, 0.])
train_features -= hoop_location
test_features -= hoop_location
train_dists = np.sqrt(np.sum(train_features**2, axis=1))
test_dists = np.sqrt(np.sum(test_features**2, axis=1))
# angle coordinate: use arctan2 function
train_angles = np.arctan2(train_features[:,1], train_features[:,0])
test_angles = np.arctan2(test_features[:,1], test_features[:,0])
# combine vectors into polar coordinates
polar_train_features = np.hstack([train_dists[np.newaxis].T, train_angles[np.newaxis].T])
polar_test_features = np.hstack([test_dists[np.newaxis].T, test_angles[np.newaxis].T])
pd.DataFrame(polar_train_features, columns=["Radius","Angle"]).head()

```
Out[16]:
```

```
In [17]:
```#Train model
start_time = time.time()
lm.fit(polar_train_features, np.ravel(train_class_labels))
end_time = time.time()
print "Training ended after %.2f seconds." %(end_time-start_time)
# compute the classification error on test data
predictions = lm.predict(polar_test_features)
print "Classification Error on the Test Set: %.2f%%" %(calc_classification_error(predictions, np.array(test_class_labels)) * 100)
# compute the baseline error since the classes are imbalanced
print "Baseline Error: %.2f%%" %(np.sum(test_class_labels)/len(test_class_labels)*100)
# visualize the boundary on the basketball court
visualize_court(lm, coord_type='polar')

```
```

The data folder contains some mysterious data that can't be modeled well with a linear function. Running the code below, we see the squared error is over 70. However, the error can be driven to zero using one of two transformations. See if you can find one or both. The transformations are common ones you surely know.

```
In [24]:
```from sklearn.linear_model import LinearRegression
# load (x,y) where y is the mystery data
x = np.arange(0, 30, .2)[np.newaxis].T
y = np.load(open('../data/mystery_data.npy','rb'))
### transformation goes here ###
x = np.cos(x)
################################
# initialize regression model
lm = LinearRegression()
lm.fit(x,y)
y_hat = lm.predict(x)
squared_error = np.sum((y - y_hat)**2)
if not np.isclose(squared_error,0):
print "The squared error should be zero! Yours is %.8f." %(squared_error)
else:
print "You found the secret transformation! Your squared error is %.8f." %(squared_error)

```
```

Sometimes the data calls for more aggressive transformations. High-dimensional data is usually hard to model because classifiers are likely to overfit. Regularization is one way to combat high dimensionality, but often it can not be enough. This section will cover *dimensionality reduction*--a technique for reducing the number of features while still preserving curcial information. This is a form of unsupervised learning since we use no class information.

Before we load the data, let's take a minute to review how image data is stored on a computer. Of course, all the computer sees are numbers ranging from 0 to 255. Each pixel takes on one of these values. Furthermore, there are three layers to color images--one for red, blue, and green values. Therefore, the painting we'll examine are represented as $300 \times 400 \times 3$-dimensional tensors (multi-dimensional arrays). This layering is depicted below.

While images need to be represented with three dimensions to be visualized, the learning algorithms we'll consider don't need any notion of color values so I've already flattened the images into vector form, i.e. to create a matrix of size $403 \times 360000$. Let's load the dataset...

```
In [26]:
```# un-zip the paintings file
import zipfile
zipper = zipfile.ZipFile('../data/bob_ross/bob_ross_paintings.npy.zip')
zipper.extractall('../data/bob_ross/')
# load the 403 x 360,000 matrix
br_paintings = np.load(open('../data/bob_ross/bob_ross_paintings.npy','rb'))
print "Dataset size: %d x %d"%(br_paintings.shape)

```
```

and then visualize two of the images...

```
In [27]:
```# subplot containing first image
ax1 = plt.subplot(1,2,1)
br_painting = br_paintings[70,:]
ax1.imshow(np.reshape(br_painting, (300, 400, 3)))
# subplot containing second image
ax2 = plt.subplot(1,2,2)
br_painting = br_paintings[33,:]
ax2.imshow(np.reshape(br_painting, (300, 400, 3)))
plt.show()

```
```

As we've seen, the dataset has many, many, many more features (columns) than examples (rows). Simple Lasso or Ridge regularization probably won't be enough to prevent overfitting so we have to do something more drastic. In this section, we'll cover Principal Component Analysis, a popular technique for reducing the dimensionality of data.

PCA does not take into consideration labels, only the input features. We can think of PCA as performing unsupervised 'inverse' prediction. Our goal is: for a datapoint $\mathbf{x}_{i}$, find a lower-dimensional representation $\mathbf{h}_{i}$ such that $\mathbf{x}_{i}$ can be 'predicted' from $\mathbf{h}_{i}$ using a linear transformation. In math, this statement can be written as $$\mathbf{\tilde x}_{i} = \mathbf{h}_{i} \mathbf{W}^{T} \text{ where } \mathbf{h}_{i} = \mathbf{x}_{i} \mathbf{W}. $$ $\mathbf{W}$ is a $D \times K$ matrix of parameters that need to be learned--much like the $\beta$ vector in regression models. $D$ is the dimensionality of the original data, and $K$ is the dimensionality of the compressed representation $\mathbf{h}_{i}$. The graphic below reiterates the above described PCA pipline:

Having defined the PCA model, we look to write learning as an optimization process. Recall that we wish to make a reconstruction of the data, denoted $\mathbf{\tilde x}_{i}$, as close as possible to the original input: $$\mathcal{L} = \sum_{i=1}^{N} (\mathbf{x}_{i} - \mathbf{\tilde x}_{i})^{2}.$$ We can make a substitution for $\mathbf{\tilde x}_{i}$ from the equation above: $$ = \sum_{i=1}^{N} (\mathbf{x}_{i} - \mathbf{h}_{i}\mathbf{W}^{T})^{2}.$$ And we can make another substitution for $\mathbf{h}_{i}$, bringing us to the final form of the loss function: $$ = \sum_{i=1}^{N} (\mathbf{x}_{i} - \mathbf{x}_{i}\mathbf{W}\mathbf{W}^{T})^{2}.$$

We could perform gradient descent on $\mathcal{L}$, just like we do for logistic regression models, but there exists a deterministic solution. We won't show the derivation here, but you can find it here. $\mathbf{W}$ is optimal when it contains the eigenvectors of the data's covariance matrix, and thus we can use a standard eigen decomposition to learn the transform: $$ \boldsymbol{\Sigma}_{\mathbf{X}} = \mathbf{W} \boldsymbol{\Lambda} \boldsymbol{W}^{T} $$ where $\boldsymbol{\Sigma}_{\mathbf{X}}$ is the data's empirical covariance matrix and $\boldsymbol{\Lambda}$ is a diagonal matrix of eigenvalues. Eigen decompositions can be performed effeciently by any scientific computing library, including numpy.

The connection to the data's (co-)variance becomes a little more clear when the intuitions behind PCA are examined. The PCA transformation projects the data onto linear subspaces oriented in the directions of highest variance. To elaborate, assume the data resides in two dimensions according to the following scatter plot. The columns of $\mathbf{W}$--the $K=2$ principal components--would be the green lines below:

*Data*

We observe $\mathbf{x}_{i}$ where \begin{eqnarray*} \mathbf{x}_{i} = (x_{i,1}, \dots, x_{i,D}) &:& \mbox{set of $D$ explanatory variables (aka features). No labels.} \end{eqnarray*}

* Parameters*

$\mathbf{W}$: Matrix with dimensionality $D \times K$, where $D$ is the dimensionality of the original data and $K$ the dimensionality of the new features. The matrix encodes the transformation between the original and new feature spaces.

*Error Function*

Now let's run PCA on the Bob Ross paintings dataset...

```
In [28]:
```from sklearn.decomposition import PCA
pca = PCA(n_components=400)
start_time = time.time()
reduced_paintings = pca.fit_transform(br_paintings)
end_time = time.time()
print "Training took a total of %.2f seconds." %(end_time-start_time)
print "Preserved percentage of original variance: %.2f%%" %(pca.explained_variance_ratio_.sum() * 100)
print "Dataset is now of size: %d x %d"%(reduced_paintings.shape)

```
```

Let's visualize two of the paintings...

```
In [29]:
```img_idx = 70
reconstructed_img = pca.inverse_transform(reduced_paintings[img_idx,:])
original_img = br_paintings[70,:]
# subplot for original image
ax1 = plt.subplot(1,2,1)
ax1.imshow(np.reshape(original_img, (300, 400, 3)))
ax1.set_title("Original Painting")
# subplot for reconstruction
ax2 = plt.subplot(1,2,2)
ax2.imshow(np.reshape(reconstructed_img, (300, 400, 3)))
ax2.set_title("Reconstruction")
plt.show()

```
```

```
In [30]:
```# get the transformation matrix
transformation_mat = pca.components_ # This is the W^T matrix
# two components to show
comp1 = 13
comp2 = 350
# subplot
ax1 = plt.subplot(1,2,1)
filter1 = transformation_mat[comp1-1,:]
ax1.imshow(np.reshape(filter1, (300, 400, 3)))
ax1.set_title("%dth Principal Component"%(comp1))
# subplot
ax2 = plt.subplot(1,2,2)
filter2 = transformation_mat[comp2-1,:]
ax2.imshow(np.reshape(filter2, (300, 400, 3)))
ax2.set_title("%dth Principal Component"%(comp2))
plt.show()

```
```

```
In [32]:
```# get the movie features
movie_features = movie_data[['RottenTomatoes','RottenTomatoes_User','Metacritic','Metacritic_User','Fandango_Ratingvalue']].as_matrix()
# perform standard scaling again but via SciKit-Learn
from sklearn.preprocessing import StandardScaler
z_scaler = StandardScaler()
movie_features = z_scaler.fit_transform(movie_features)
pca = PCA(n_components=2)
start_time = time.time()
movie_2d_proj = pca.fit_transform(movie_features)
end_time = time.time()
print "Training took a total of %.4f seconds." %(end_time-start_time)
print "Preserved percentage of original variance: %.2f%%" %(pca.explained_variance_ratio_.sum() * 100)
print "Dataset is now of size: %d x %d"%(movie_2d_proj.shape)

```
```

```
In [34]:
```labels = movie_data['FILM'].tolist()
classes = movie_data['IMDB'].tolist()
# color the points by IMDB ranking
labels_to_show = []
colors = []
for idx, c in enumerate(classes):
if c > 7.25:
colors.append('g')
if c > 8.:
labels_to_show.append(labels[idx])
else:
colors.append('r')
if c < 4.75:
labels_to_show.append(labels[idx])
# plot data
plt.scatter(movie_2d_proj[:, 0], movie_2d_proj[:, 1], marker = 'o', c = colors, s = 150, alpha = .6)
# add movie title annotations
for label, x, y in zip(labels, movie_2d_proj[:, 0].tolist(), movie_2d_proj[:, 1].tolist()):
if label not in labels_to_show:
continue
if x < 0:
text_x = -20
else:
text_x = 150
plt.annotate(label.decode('utf-8'),xy = (x, y), xytext = (text_x, 40),
textcoords = 'offset points', ha = 'right', va = 'bottom',
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3, rad=0'),
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'b', alpha = 0.2))
plt.title('PCA Projection of Movies')
plt.show()

```
```

```
In [36]:
```from sklearn.datasets import fetch_olivetti_faces
faces_dataset = fetch_olivetti_faces(shuffle=True)
faces = faces_dataset.data # 400 flattened 64x64 images
person_ids = faces_dataset.target # denotes the identity of person (40 total)
print "Dataset size: %d x %d" %(faces.shape)
print "And the images look like this..."
plt.imshow(np.reshape(faces[200,:], (64, 64)), cmap='Greys_r')
plt.show()

```
```

```
In [37]:
```?PCA

```
In [40]:
```### Your code goes here ###
# train PCA model on 'faces'
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
start_time = time.time()
faces_reduced = pca.fit_transform(faces)
end_time = time.time()
###########################
print "Training took a total of %.2f seconds." %(end_time-start_time)
print "Preserved percentage of original variance: %.2f%%" %(pca.explained_variance_ratio_.sum() * 100)
print "Dataset is now of size: %d x %d"%(faces_reduced.shape)

```
```

```
In [41]:
```### Your code goes here ###
# Use learned transformation matrix to project back to the original 4096-dimensional space
# Remember you need to use np.reshape()
###########################
img_idx = 70
reconstructed_img = pca.inverse_transform(faces_reduced[img_idx,:])
original_img = faces[70,:]
# subplot for original image
ax1 = plt.subplot(1,2,1)
ax1.imshow(np.reshape(original_img, (64, 64)), cmap='Greys_r')
ax1.set_title("Original Image")
# subplot for reconstruction
ax2 = plt.subplot(1,2,2)
ax2.imshow(np.reshape(reconstructed_img, (64, 64)), cmap='Greys_r')
ax2.set_title("Reconstruction")
plt.show()

```
```

Your output should look something like what's below (although could be a different face):

```
In [43]:
```### Your code goes here ###
# Now visualize one of the principal components
# Again, remember you need to use np.reshape()
###########################
transformation_mat = pca.components_
# two components to show
comp1 = 5
comp2 = 90
# subplot
ax1 = plt.subplot(1,2,1)
filter1 = transformation_mat[comp1,:]
ax1.imshow(np.reshape(filter1, (64, 64)), cmap='Greys_r')
ax1.set_title("%dth Principal Component"%(comp1))
# subplot
ax2 = plt.subplot(1,2,2)
filter2 = transformation_mat[comp2,:]
ax2.imshow(np.reshape(filter2, (64, 64)), cmap='Greys_r')
ax2.set_title("%dth Principal Component"%(comp2))
plt.show()

```
```

```
In [44]:
```### Your code goes here ###
# Train another PCA model to project the data into two dimensions
# Bonus: color the scatter plot according to the person_ids to see if any structure can be seen
# Run PCA for 2 components
# Generate plot
###########################
pca = PCA(n_components=2)
start_time = time.time()
faces_2d_proj = pca.fit_transform(faces)
end_time = time.time()
print "Training took a total of %.2f seconds." %(end_time-start_time)
print "Preserved percentage of original variance: %.2f%%" %(pca.explained_variance_ratio_.sum() * 100)
print "Dataset is now of size: %d x %d"%(faces_2d_proj.shape)
# Generate plot
# color the points by the person ids
colors = [plt.cm.Set1((c+1)/40.) for c in person_ids]
# plot data
plt.scatter(faces_2d_proj[:, 0], faces_2d_proj[:, 1], marker = 'o', c = colors, s = 175, alpha = .6)
plt.title('2D Projection of Faces Dataset')
plt.show()

```
```

```
In [ ]:
```