00 - Principal Component Analysis

Great PCA explanation


In [ ]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

from codefiles.datagen import random_xy, x_plus_noise, data_3d
from codefiles.dataplot import plot_principal_components, plot_3d, plot_2d
# %matplotlib inline
%matplotlib notebook

PCA with Random 2D Data

Totally random data. Generate a 2D dataset


In [ ]:
data_random = random_xy(num_points=100)
plot_2d(data_random)

Initialize PCA - recall that we won't need any kind of target column since this is an unsupervised technique.


In [ ]:
pca_random = PCA()

Now, let's give it the random data.


In [ ]:
pca_random.fit(data_random)

And evaluate the variance of the data. Will we have some axis with significant more variance?


In [ ]:
pca_random.explained_variance_

As we can see, there is not a huge difference in variance between the two axis. It was expected. If we increase num_points in the random_xy(), we'll see them closer together.

Correlated 2D Data

We'll now assess a correlated dataset. Check here for a nice gif on PCA


In [ ]:
# Correlated data
data_correlated = x_plus_noise(slope=1)
plot_2d(data_correlated)

Initialize and fit the correlated data.


In [ ]:
pca_correlated = PCA()
pca_correlated.fit(data_correlated)

In [ ]:
pca_correlated.explained_variance_

Now we can see a principal component with a significant higher magnitude than the other one. There's definitely some knowledge we can use about this, e.g., use only one dimensional data if we have/need to, without losing much information.

Hint: check x_plus_noise() with slope=-1.