To make biological survival possible, Mind at Large has to be funnelled through the reducing valve of the brain and nervous system. What comes out at the other end is a measly trickle of the kind of consciousness which will help us to stay alive on the surface of this particular planet. -- Aldus Huxley, The Doors of Perception
In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.decomposition
import sklearn.manifold
import sklearn.preprocessing
%matplotlib inline
plt.rcParams["figure.figsize"] = (13, 13)
sns.set(context = "paper", font = "monospace")
This is DEFRA data on the consumption in grams per person per week of 17 different types of foods measured and averaged in parts of the UK in 1997.
In [2]:
df_raw = pd.DataFrame(
[
["alcoholic drinks" , 375, 135, 458, 475],
["beverages" , 57, 47, 53, 73],
["carcase meat" , 245, 267, 242, 227],
["cereals" , 1472, 1494, 1462, 1582],
["cheese" , 105, 66, 103, 103],
["confectionery" , 54, 41, 62, 64],
["fats and oils" , 193, 209, 184, 235],
["fish" , 147, 93, 122, 160],
["fresh fruit" , 1102, 674, 957, 1137],
["fresh potatoes" , 720, 1033, 566, 874],
["fresh Veg" , 253, 143, 171, 265],
["other meat" , 685, 586, 750, 803],
["other veg." , 488, 355, 418, 570],
["processed potatoes", 198, 187, 220, 203],
["processed veg." , 360, 334, 337, 365],
["soft drinks" , 1374, 1506, 1572, 1256],
["sugars" , 156, 139, 147, 175]
],
columns = [
"foods",
"England",
"Northern Ireland",
"Scotland",
"Wales"
]
)
df = df_raw
df = df.set_index("foods")
df = df.transpose()
df.index.name = "countries"
df.columns.name = None
df
Out[2]:
In [3]:
# standardizing scaler
scaler = sklearn.preprocessing.MinMaxScaler(feature_range = (-1, 1))
df_scaled = pd.DataFrame(scaler.fit_transform(df), index = df.index, columns = df.columns)
df_scaled
Out[3]:
Principal component analysis (PCS) is a technique used to emphasize variation and make apparent strong patterns in data. It is used often to make data easier to explore and visualize.
Consider a dataset in two dimensions, such as height and weight. The data can be plotted as points on a plane. However, to make variation more apparent, PCA finds a new coordinate system in which every point has a new (x, y) value. The axes do not correspond to physical measurements; they are combinations of height and weight called "principal components" that are chosen to give one axes a lot of variation.
PCA is useful for eliminating dimensions. If data is going to be seen only along one dimension, it may be beneficial to make that dimension the principal component with more variation.
With three dimensions, PCA is more useful because it is difficult to see through a cloud of data. With data plotted in 3D, a projection to 2D can involve a transformation that is conceptually identical to finding a good camera angle with which to view the cloud of data; the axes are rotated to find the best angle. The PCA transformation ensures that the horizontal axis has the greatest variation, the vertical axis has the second greatest variation and a third axis the least. The axis with least variation is dropped.
So, in PCA, the features projected onto principle components retain the important information (the axes with maximum variances) and the axes with small variances are dropped. It is a simple and popular linear transformation technique.
PCA yields the directions (principal components) that maximize the variance of the data. As an alternative, Linear Discriminant Analysis (LDA) aims to find the directions that maximize the separation (or discrimination) between different classes. In other words, PCA projects the entire dataset onto a different feature subspace and LDA tries to determine a suitable feature subspace in order to distinguish between patterns that belong to different classes.
Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features. Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially if it was measured on different scales.
In [4]:
names_features = df_raw["foods"].values
y_positions = np.arange(len(names_features))
variances_features = []
for name_feature in names_features:
variances_features.append(df_raw[df_raw["foods"] == name_feature][["England", "Northern Ireland", "Scotland", "Wales"]].values[0].var())
plt.rcParams["figure.figsize"] = (13, 6)
plt.bar(y_positions, variances_features, align = "center")
plt.xticks(y_positions, names_features, rotation = 90)
plt.ylabel("variance")
plt.show()
In [5]:
# split data table into data X and class labels y
X = df[df.columns].values
y = df.index.values
Y = sklearn.decomposition.PCA(n_components = 17).fit_transform(X)
for point, label in zip(Y, y):
plt.scatter(point[0], point[1])
plt.axes().annotate(label, (point[0], point[1]))
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.rcParams["figure.figsize"] = (13, 13)
plt.show()
t-SNE is a tool for data visualization. It reduces the dimensionality of data to 2 or 3 dimensions so that it can be plotted and interpreted easily by humans. It can create compelling maps from data with hundreds or even thousands of dimensions. t-SNE converts distances between data in the original space to probabilities. t-SNE can help to indicate whether classes are separable in some linear or nonlinear representation.
The goal is to take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space. The algorithm is nonlinear and adapts to the underlying data, performing different transformations on different regions. The t-SNE algorithm adapts its notion of "distance" to regional density variations in the data. As a result, it expands dense clusters and contracts sparse ones, evening out cluster sizes. This density equalization happens by design and is a predictable feature of t-SNE. The actual distances between clusters might not mean anything.
So, t-SNE reorganises a dataset so that it preserves local similarity. It shows which pieces of data are locally connected in the manifold/embedding. Similar data are clustered together.
A feature of t-SNE is a tuneable parameter called "perplexity", which describes how to balance attention between local and global aspects of data. The parameter is, in a sense, a guess about the number of close neighbors each point has. The perplexity value has a complex effect on the resulting pictures. The perplexity should really have a smaller value than the number of points. A plausible approach is to iterate until a stable coniguration is reached.
One rough approach could be to set the perplexity to about 5% of the dataset size. So, for a dataset with 100K cases, an initial perplexity of ~5000 could be set, or at least ~1000 if a high performance computer isn't available.
In [6]:
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d
import matplotlib.ticker
import sklearn.manifold
import sklearn.datasets
mpl_toolkits.mplot3d.Axes3D
# data
n_points = 1000
X, color = sklearn.datasets.samples_generator.make_s_curve(n_points)
n_neighbors = 10
n_components = 2
fig = plt.figure(figsize = (15, 8))
# 3D plot
ax = fig.add_subplot(1, 2, 1, projection = "3d")
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c = color, cmap = plt.cm.Spectral)
ax.view_init(4, -72)
ax.title.set_text("3D S-curve data")
# t-SNE
Y = sklearn.manifold.TSNE(n_components = n_components, init = "pca").fit_transform(X)
ax = fig.add_subplot(1, 2, 2)
plt.scatter(Y[:, 0], Y[:, 1], c = color, cmap = plt.cm.Spectral)
ax.title.set_text("t-SNE")
ax.xaxis.set_major_formatter(matplotlib.ticker.NullFormatter())
ax.yaxis.set_major_formatter(matplotlib.ticker.NullFormatter())
plt.axis("tight")
plt.show()
In [7]:
# split data table into data X and class labels y
X = df[df.columns].values
y = df.index.values
Y = sklearn.manifold.TSNE(
n_components = 4,
init = "pca",
method = "barnes_hut"
).fit_transform(X)
for point, label in zip(Y, y):
plt.scatter(point[0], point[1])
plt.axes().annotate(label, (point[0], point[1]))
plt.rcParams["figure.figsize"] = (13, 13)
plt.show()