In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import scipy.stats
import sklearn.datasets
import sklearn.preprocessing
%matplotlib notebook
plt.xkcd()
None
A data visualization goes a long way in helping you understand the underlying dataset. If the data is highly dimensional, you can use Singular Value Decomposition (SVD) to find a reduced-rank approximation of the data that can be visualized easily.
We start off with the Iris flower dataset. The data is multivariate, with 150 measurements of 4 features (length and width cm of both sepal and petal) on 3 distinct Iris species. Of the 150 measurements, there are 50 measurements each for Iris setosa, Iris versicolor, and Iris virginica.
Scikit Learn's datasets
includes the Iris dataset, so let's load that up and start exploring.
In [2]:
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
print('Iris dataset has {} rows and {} columns\n'.format(*df_iris.shape))
print('Here are the first 5 rows of the data:\n\n{}\n'.format(df_iris.head(5)))
print('Some simple statistics on the Iris dataset:\n\n{}\n'.format(df_iris.describe()))
As we are exploring the dataset, it would be nice to view the data in order to get an idea of how the 3 species might be distributed with respect to one another in terms of their features. Perhaps we are interested in finding clusters, or maybe we would like to find a way to make class predictions?
However, since the data has 4 dimensions, we would be hard-pressed to come up with a good way to graph the data in 4D that we could easily understand.
But what if we could reduce or compress the data so that we could work in 3 dimensions or less?
Singular value decomposition lets us do just that.
Singular value decomposition factorizes an $\mathbb{R}^{m \times n}$ matrix $X$ into
such that
\begin{align} X &= U \, \Sigma \, V^{\intercal} \end{align}We can use numpy.linalg.svd
to factorize the Iris data matrix into three components $U$, $\Sigma$, and $V^{\intercal}$.
In [3]:
U_iris, S_iris, Vt_iris = np.linalg.svd(df_iris)
In [4]:
print('matrix U has {} rows, {} columns\n'.format(*U_iris.shape))
print('{}'.format(pd.DataFrame(U_iris).head(5)))
numpy.linalg.svd
actually returns $V^{\intercal}$ instead of $V$, so it is the columns of $V^{\intercal}$ that correspond to the columns of original data matrix $X$. Hence, the rows are the set of ordered, orthornormal eigenvectors of $X^{\intercal} \, X$.
In [5]:
print('matrix Vt has {} rows, {} columns\n'.format(*Vt_iris.shape))
print('{}'.format(pd.DataFrame(Vt_iris).head()))
The elements $\sigma_{i}$ of diagonal matrix $\Sigma$ are the non-zero singular values of matrix $X$, which are really just the square roots of the non-zero eigenvalues of $X^{\intercal} \, X$ (and also for $X \, X^{\intercal}$). These singular values can be used to determine the amount of variance $X^{\prime}$ explains of the original data matrix $X$ when reducing the dimensions to find a lower rank approximation.
\begin{align} X^{\prime}_{k} &= U_{k} \, \Sigma_{k} \, V^{\intercal}_{k} \\ &\approx X_{r} & \text{ where } rank(X^{\prime}) = k \lt rank(X) = r \end{align}The amount of variance that the reduced rank approximation $X^{\prime}_{k}$ explains of $X_{r}$ is
\begin{align} \text{cum. variance explained} &= \frac{\sum_{j=1}^{k} \sigma_{j}^{2}}{\sum_{i=1}^{r} \sigma_{i}^{2}} \end{align}NOTE: numpy.linalg.svd
actually returns a $\Sigma$ that is not a diagonal matrix, but a list of the entries on the diagonal.
In [6]:
num_sv_iris = np.arange(1, S_iris.size+1)
cum_var_explained_iris = [np.sum(np.square(S_iris[0:n])) / np.sum(np.square(S_iris)) for n in num_sv_iris]
Let's have a look at the cumulative variance explained visually as a function of the number of singular values used when reducing rank to find a lower-ranked matrix $X^{\prime}$ to approximate $X$. This will inform us as to how many dimensions we should use.
In [7]:
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.plot(num_sv_iris,
cum_var_explained_iris,
color='#2171b5',
label='variance explained',
alpha=0.65,
zorder=1000)
plt.scatter(num_sv_iris,
sklearn.preprocessing.normalize(S_iris.reshape((1,-1))),
color='#fc4e2a',
label='singular values (normalized)',
alpha=0.65,
zorder=1000)
plt.legend(loc='center right', scatterpoints=1, fontsize=8)
ax.set_xticks(num_sv_iris)
ax.set_xlim(0.8, 4.1)
ax.set_ylim(0.0, 1.1)
ax.set_xlabel(r'Number of singular values used')
ax.set_ylabel('Variance in data explained')
ax.set_title('Iris dataset: cumulative variance explained & singular values',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.8, zorder=1)
plt.tight_layout()
Judging from the curve representing cumulative variance explained in the figure above, we can see that
Since graphing the Iris dataset in 1D wouldn't be all that interesting (just dots on a line segment), let's try using the first 2 singular values to represent the data on the $x$- and $y$-axes, respectively.
In [8]:
idx_setosa = np.where(iris.target==0)[0]
idx_versicolor = np.where(iris.target==1)[0]
idx_virginica = np.where(iris.target==2)[0]
setosa_x = U_iris[idx_setosa, 0]
setosa_y = U_iris[idx_setosa, 1]
versicolor_x = U_iris[idx_versicolor, 0]
versicolor_y = U_iris[idx_versicolor, 1]
virginica_x = U_iris[idx_virginica, 0]
virginica_y = U_iris[idx_virginica, 1]
We will use different marker shapes and colors to represent the three Iris species on our 2D graph.
In [9]:
fig = plt.figure(figsize=(7.0,5.5))
ax = fig.add_subplot(111)
plt.scatter(setosa_x,
setosa_y,
marker='o',
color='#66c2a5',
label='Iris-setosa',
zorder=1000)
plt.scatter(versicolor_x,
versicolor_y,
marker='D',
color='#fc8d62',
label='Iris-versicolor',
zorder=1000)
plt.scatter(virginica_x,
virginica_y,
marker='^',
color='#8da0cb',
label='Iris-virginica',
zorder=1000)
plt.legend(loc='upper left', scatterpoints=1, fontsize=8)
ax.set_xlabel(r'singular value $\sigma_{1}$')
ax.set_ylabel(r'singular value $\sigma_{2}$')
ax.set_title('2D plot of Iris dataset',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.6, zorder=1)
plt.tight_layout()
There!
Now that we are viewing the originally 4D data with 2 dimensions using the first 2 singular value columns of the $U$ left singular vectors matrix, we can see that there should be a very clear separation for the Iris setosa class and the others. On the other hand, the demarcation between Iris versicolor and Iris virginica might not be as clear cut.
Nevertheless, since this 2D reduced-rank matrix representation $X^{\prime}$ explains nearly 99.8% of the variance in the original dataset, we can be pretty certain that clustering and classification should be possible.
In the previous example, we were exploring the Iris dataset which is a matrix $X \in \mathbb{R}^{150 \times 4}$. Singular value decomposition helped us to find a reduced-rank matrix $X^{\prime} \in \mathbb{R}^{150 \times 2}$ that accurately approximated the original data matrix and let us visualize the 4-dimensional data using 2 dimensions.
Let's now consider another dataset where the values are not in $\mathbb{R}$, but are categorical.
For this example, we explore a fictional survey of 1000 respondents from each of five countries (Canada, USA, England, Italy and Switzerland), asking them what their primary language is (among English, French, Spanish, German and Italian). So in our data we have categories for both country and language.
We read in the data from file using pandas.dataframe.read_csv
.
In [10]:
fin = os.path.join(os.getcwd(),
'data',
'country_language.csv')
df_countries = pd.read_csv(fin, dtype='category')
print("data has {} measurements for {} variables\n".format(*df_countries.shape))
print("Here are the first 10 rows...\n\n{}\n...".format(df_countries.head(10)))
Our next step in exploring the data is to break out the data in terms of the 2 categories.
Here we convert the raw observations into a contingency table $F$, with the countries as rows and the languages as columns. pandas.crosstab
will do just that.
In [11]:
countries = ['Canada', 'USA', 'England', 'Italy', 'Switzerland']
languages = ['English', 'French', 'Spanish', 'German', 'Italian']
F = pd.crosstab(df_countries.country, df_countries.language, margins=False)
F.index = countries
F.columns = languages
print("{}".format(F))
Now say that we are interested in seeing the relation between the countries and the languages.
However, we cannot blindly apply singular value decomposition to contingency table $F$ above.
Since we are working with 2 distinct categories, we can apply correspondence analysis to transform the contingency table into a form where we can use singular value decomposition. At that point, we should be able to find a reduced-rank matrix that approximates the original data, and that in turn would let us graphically represent the the relations beween countries and languages.
The idea is to use the $\chi^{2}$ distance between rows and columns as our basis for singular value decomposition, as the $\chi^{2}$ distribution lets us calculate the independence of qualitative variables.
In [12]:
P = F / F.sum().sum()
print('correspondence matrix P:\n\n{}'.format(P))
Using correspondence matrix $P$, we next obtain the row centroid $p_{i+}$. The row centroid can be interpreted as the marginal frequency distribution over the sum of the countries (rows), and reflects the fact that there were equally 1000 respondents per country in our fictional study.
The row centroid $p_{i+}$ is derived from correspondence matrix $P$ with
\begin{align} p_{i+} &= \sum_{j=1}^{J} p_{ij} \end{align}
In [13]:
row_centroid = P.sum(axis=1)
print('row centroid (marginal frequency distribution over countries):\n\n{}'.format(row_centroid))
Similarly, we obtain the column centroid $p_{+j}$ from $P$. The column centroid can be interpreted as the marginal frequency distribution over the sum of the languages (columns).
The column centroid $p_{+j}$ is derived from correspondence matrix $P$ with
\begin{align} p_{+j} &= \sum_{i=1}^{I} p_{ij} \end{align}
In [14]:
col_centroid = P.sum(axis=0)
print('column centroid (marginal frequency distribution over languages):\n\n{}'.format(col_centroid))
So rather than using the contingency table $F$ as the basis for singular value decomposition, we can look at the $\chi^{2}$ distances between rows and columns for visualizing the relation between countries and languages.
The $\chi^{2}$ statistic is given by
\begin{align} \chi^{2} &= \sum_{i=1}^{I} \sum_{j=1}^{J} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}} \\ &= N \, \sum_{i=1}^{I} \sum_{j=1}^{J} \frac{(p_{ij} - \mu_{ij})^{2}}{\mu_{ij}} \\ &= N \, \Lambda^{2} \\ \\ \Rightarrow \Lambda^{2} &= \frac{\chi^{2}}{N} \\ \Lambda &= \left[ \frac{p_{ij} - \mu_{ij}}{\sqrt{\mu_{ij}}} \right] = \left[ \frac{p_{ij} - p_{i+}\,p_{+j}}{\sqrt{p_{i+}\,p_{+j}}} \right] \end{align}Inertia matrix $\Lambda$ measures the distribution of the individual profiles (rows/columns) around the average profile (centroids). Thus inertia represents the observed deviation from independence.
Through its relation with the statistic $\chi^{2}$, inertia $\Lambda$ (a matrix of standardized residuals) provides the basis for using singular value decomposition.
In [15]:
Mu_ij = row_centroid.values.reshape((P.index.size,1)) * col_centroid.values.reshape((1,P.columns.size))
Lambda = (P - Mu_ij) / np.sqrt(Mu_ij)
print('inertia Lambda:\n\n{}'.format(Lambda))
In [16]:
U_lambda, S_lambda, Vt_lambda = np.linalg.svd(Lambda)
num_sv_lambda = np.arange(1, S_lambda.size+1)
cum_var_explained_lambda = [np.sum(np.square(S_lambda[0:n])) / np.sum(np.square(S_lambda)) for n in num_sv_lambda]
Once again, we look at the cumulative variance explained visually as a function of the number of singular values used when reducing rank to find a lower-ranked inertia matrix $\Lambda^{\prime}$ to approximate $\Lambda$.
In [17]:
fig = plt.figure(figsize=(7.0, 5.5))
ax = fig.add_subplot(111)
plt.plot(num_sv_lambda,
cum_var_explained_lambda,
color='#2171b5',
label='variance explained',
alpha=0.65,
zorder=1000)
plt.scatter(num_sv_lambda,
sklearn.preprocessing.normalize(S_lambda.reshape((1,-1))),
color='#fc4e2a',
label='singular values (normalized)',
alpha=0.65,
zorder=1000)
plt.legend(loc='lower left', scatterpoints=1, fontsize=8)
ax.set_xticks(num_sv_lambda)
ax.set_xlim(0.9, 5.1)
ax.set_ylim(-0.1, 1.1)
ax.set_xlabel(r'Number of singular values used')
ax.set_ylabel('Variance in data explained')
ax.set_title('Countries/languages dataset: cumulative var. explained & singular values',
fontsize=14,
y=1.03)
ax.set_facecolor('0.98')
plt.grid(alpha=0.8, zorder=1)
plt.tight_layout()
Judging from the curve representing cumulative variance explained with respect to the number of singular values used, we see that
To mix things up a bit, let's try visualizing the countries and languages in 3D.
For countries, we will take the first 3 columns of $U$ as the $x$-, $y$- and $z$-coordinates.
But since numpy.linalg.svd
returns $V^{\intercal}$ instead of $V$, we will take the first 3 rows of $V^{\intercal}$ for the $x$-, $y$- and $z$-coordinates for the languages.
In [18]:
country_x = U_lambda[:, 0]
country_y = U_lambda[:, 1]
country_z = U_lambda[:, 2]
lang_x = Vt_lambda[0, :]
lang_y = Vt_lambda[1, :]
lang_z = Vt_lambda[2, :]
That was a bit of work moving from raw data to contingency table, to correspondence matrix, to the inertia matrix, and then finally to singular value decomposition, but we are now ready to see how the categories of country and language relate to one another in 3 dimensions.
In [19]:
import pylab
from mpl_toolkits.mplot3d import Axes3D, proj3d
fig = pylab.figure(figsize=(7.5,5.5))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(country_x, country_y, country_z, marker='s', s=50, c='#2171b5')
cntry_labels = []
for i,(x,y,z) in enumerate(zip(country_x, country_y, country_z)):
x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())
label = pylab.annotate(Lambda.index[i],
xy=(x2,y2),
xytext=(-2,2),
textcoords='offset points',
ha='right',
va='bottom',
color='#2171b5',
alpha=0.9)
cntry_labels.append(label)
ax.scatter(lang_x, lang_y, lang_z, marker='o', s=50, c='#fc4e2a')
lang_labels = []
for i,(x,y,z) in enumerate(zip(lang_x, lang_y, lang_z)):
x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())
label = pylab.annotate(Lambda.columns[i],
xy=(x2,y2),
xytext=(-2,2),
textcoords='offset points',
ha='right',
va='bottom',
color='#fc4e2a',
alpha=0.4)
lang_labels.append(label)
def update_position(e):
for i,(x,y,z) in enumerate(zip(country_x, country_y, country_z)):
x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())
cntry_labels[i].xy = x2, y2
for i,(x,y,z) in enumerate(zip(lang_x, lang_y, lang_z)):
x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())
lang_labels[i].xy = x2, y2
fig.canvas.draw()
fig.canvas.mpl_connect('button_release_event', update_position)
ax.set_xlabel(r'singular value $\sigma_{1}$')
ax.set_xticks([-0.5, 0.0, 0.5])
ax.set_ylabel(r'singular value $\sigma_{2}$')
ax.set_yticks([-0.5, 0.0, 0.4])
ax.set_zlabel(r'singular value $\sigma_{3}$')
ax.set_zticks([-0.5, 0.0, 0.5])
ax.set_title('3D plot of Countries/Languages dataset',
fontsize=16,
y=1.1)
plt.tight_layout()
pylab.show()
And there we are!
You can see how the anglophone countries Canada, USA and England are in close proximity of English and with each other, with Spanish being close to the USA while French is closer to Canada. German is close to Switzerland, with French somewhat in proximity. Italian, however, is out close to Italy, both being located largely in isolation from the other countries and languages.
This should match up with your intuition from contingency table $F$.
Principal components analysis (PCA) and singular value decomposition are closely related, and you may often hear both these terms used in the same breath.
Here is a quick mathematical treatment explaining how PCA and SVD are related.
Consider data matrix $X \in \mathbb{R}^{m \times n}$ where $m > n$, and all $x_{ij}$ are centered about the column means. With principal components analysis, we have
\begin{align} \text{covariance matrix } C &= \frac{1}{m} \, X^{\intercal} \, X & \text{from PCA} \\ &= \frac{1}{m} \, V \, \Gamma \, V^{\intercal} & \text{by eigendecomposition of } X^{\intercal} \, X \\ \\ \text{ but } X &= U \, \Sigma V^{\intercal} & \text{from SVD} \\ \\ \Rightarrow C &= \frac{1}{m} \, V \, \Sigma \, U^{\intercal} \, U \, \Sigma V^{\intercal} \\ &= \frac{1}{m} \, V \, \Sigma^{2} \, V^{\intercal} \\ \end{align}So we see that:
In [20]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(iris.data)
# don't forget to mean-center the data before SVD
X = iris.data - np.mean(iris.data, axis=0)
U, S, Vt = np.linalg.svd(X)
In [21]:
Cov_pca = pca.get_covariance()
print('eigenvalues from PCA:\n{}\n'.format(np.linalg.eigvals(Cov_pca * X.shape[0])))
print('squared singular values from SVD:\n{}'.format(np.square(S)))
In [22]:
print('covariance matrix C derived from PCA:\n{}\n'.format(Cov_pca))
Cov_svd = (1. / X.shape[0]) * Vt.T.dot(np.diag(np.square(S))).dot(Vt)
print('covariance matrix using S and Vt from SVD:\n{}\n'.format(Cov_svd))
allclose = np.allclose(Cov_pca, Cov_svd, atol=1e-1)
print('Are these matrices equivalent (element-wise closeness comparison)?\n{}'.format(allclose))
In [ ]: