Introduction to Dimension Reduction

Professor Robert J. Brunner

</DIV>


Introduction

In the first IPython Notebook for this week, we began the data exploration process, by exploring he Iris data that is included with scikit_learn. We next applied a variety of supervised learning methods to classify data. In this notebook we turn to a different type of data mining, where we look for correlations in the data without explicit training. This Notebook solely focuses on dimension reduction by using PCA. Other readings will demonstrate other techniques.

First we need to load the data into this Notebook and redefine our Helper functions.



In [1]:
%matplotlib inline

# Set up Notebook

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="white")

# Load the Iris Data
iris = sns.load_dataset("iris")

In [2]:
# Convenience function to plot confusion matrix

import numpy as np
import pandas as pd

# This method produces a colored heatmap that displays the relationship
# between predicted and actual types from a machine leanring method.

def confusion(test, predict, title):
    # Define names for the three Iris types
    names = ['setosa', 'versicolor', 'virginica']

    # Make a 2D histogram from the test and result arrays
    pts, xe, ye = np.histogram2d(test, predict, bins=3)

    # For simplicity we create a new DataFrame
    pd_pts = pd.DataFrame(pts.astype(int), index=names, columns=names )
    
    # Display heatmap and add decorations
    hm = sns.heatmap(pd_pts, annot=True, fmt="d")
    hm.axes.set_title(title)
    
    return None

# This method produces a colored scatter plot that displays the intrisic
# clustering of a particular data set. The different types are colored
# uniquely.

def splot_data(col1, col2, data, hue_col, label1, label2, xls, yls, sz=8):
    
    # Make the  scatter plot on the DataFrame
    jp = sns.lmplot(col1, col2, data,
                    fit_reg=False, hue=hue_col, size=sz, scatter_kws ={'s': 60})
    
    # Decorate the plot and set limits
    jp.set_axis_labels(label1, label2)

    jp.axes[0,0].set_xlim(xls)
    jp.axes[0,0].set_ylim(yls)

    sns.despine(offset=0, trim=True)
    sns.set(style="ticks", font_scale=2.0)

Our next step is to build explicit data and label NumPy arrays. We do this primarily because scikit-learn does not work natively with Pandas DataFrames. We can easily grab the underlying NumPy two-dimensional array from a DataFrame by using the values attribute, in this case we first select out the four attribute columns. Next, we create a numerical array for the data types, where 0, 1, and 2 are mapped distinctly into setosa, versicolor, and virginica.



In [3]:
# Now lets get the data and labels

data = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
labels = np.array([i//50 for i in range(iris.shape[0])])

Dimensionality Reduction

When confronted with a large, multi-dimensional data set, one approach to simplify any subsequent analysis is to reduce the number of dimensions that must be processed. In some cases, dimensions can be removed from analysis based on business logic. More generally, however, we can employ machine learning to seek out relationships between the original dimensions (or attributes or columns of a DataFrame) to identify new dimensions that better capture the inherent relationships within the data.

The standard technique to perform this is known as principal component analysis, or PCA. Mathematically,we can derive PCA by using linear algebra to solve a set of linear equations. This process effectively rotates the data into a new set of dimensions, and by ranking the importance of the new dimensions, we can actually leverage fewer dimensions in machine learning algorithms. PCA is demonstrated in the following figure from Wikipedia, where we have a two-dimensional Gaussian distribution. In the original space the data are widely spread. By rotating into a coordinate system aligned with the Gaussian shape, however, we have one primary dimension and a secondary dimension with less spread.

We can easily implement PCA by using scikit-learn. The PCA model requires one tunable parameter that specifies the target number of dimensions. This value can be arbitrarily selected, perhaps based on a prior information, or it can be iteratively determined. After the model is created, we fit the model to the data and next create our new, rotated data set. This is demonstrated in the next code cell.



In [4]:
# Principal Component Analysis
from sklearn.decomposition import PCA

# First create our PCA model
# For now we assume two compponents, to make plotting easier.
pca = PCA(n_components=2)

# Fit model to the data
pca.fit(data)

# Compute the transformed data (rotation to PCA space)
data_reduced = pca.transform(data)

# Need to modify to match number of PCA components
cols = ['PCA1', 'PCA2', 'Species']

# For example, if n_components = 3
# cols = ['PCA1', 'PCA2', 'PCA3', 'Species']

# Now create a new DataFrame to hold the results
# First a temporary np.array
tmp_d = np.concatenate((data_reduced, iris['species'].reshape((150, 1))), axis=1)

iris_pca = pd.DataFrame(tmp_d, columns = cols)

Given the two new domensions, we can first see how they are related to the original four dimensions (this isn't just a rotation since we reduced the number of dimensions). We also can display the original type-tagged data in a scatter plot that is now displayed by using the principal components.



In [5]:
# We can print out rotation matrix

c_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
for row in pca.components_:
    print(r" + ".join("{0:6.3f} * {1:s}".format(val, name) for val, name in zip(row, c_names)))


 0.361 * sepal_length + -0.085 * sepal_width +  0.857 * petal_length +  0.358 * petal_width
-0.657 * sepal_length + -0.730 * sepal_width +  0.173 * petal_length +  0.075 * petal_width

In [6]:
# Display the original data in the new space
splot_data('PCA1', 'PCA2', iris_pca, 'Species', 'First PCA', 'Second PCA', (-4.2, 4.6), (-1.8, 1.6))