Title: Feature Extraction With PCA
Slug: feature_extraction_with_pca
Summary: Feature extraction with PCA using scikit-learn.
Date: 2017-09-13 12:00
Category: Machine Learning
Tags: Feature Engineering
Authors: Chris Albon
Principle Component Analysis (PCA) is a common feature extraction method in data science. Technically, PCA finds the eigenvectors of a covariance matrix with the highest eigenvalues and then uses those to project the data into a new subspace of equal or less dimensions. Practically, PCA converts a matrix of n
features into a new dataset of (hopefully) less than n
features. That is, it reduces the number of features by constructing a new, smaller number variables which capture a signficant portion of the information found in the original features. However, the goal of this tutorial is not to explain the concept of PCA, that is done very well elsewhere, but rather to demonstrate PCA in action.
In [1]:
# Import packages
import numpy as np
from sklearn import decomposition, datasets
from sklearn.preprocessing import StandardScaler
In [2]:
# Load the breast cancer dataset
dataset = datasets.load_breast_cancer()
# Load the features
X = dataset.data
Notice that original data contains 569 observations and 30 features.
In [3]:
# View the shape of the dataset
X.shape
Out[3]:
Here is what the data looks like.
In [4]:
# View the data
X
Out[4]:
In [5]:
# Create a scaler object
sc = StandardScaler()
# Fit the scaler to the features and transform
X_std = sc.fit_transform(X)
Notice that PCA contains a parameter, the number of components. This is the number of output features and will need to be tuned.
In [6]:
# Create a pca object with the 2 components as a parameter
pca = decomposition.PCA(n_components=2)
# Fit the PCA and transform the data
X_std_pca = pca.fit_transform(X_std)
After the PCA, the new data has been reduced to two features, with the same number of rows as the original feature.
In [7]:
# View the new feature data's shape
X_std_pca.shape
Out[7]:
In [8]:
# View the new feature data
X_std_pca
Out[8]: