Much of statistical learning relies on the ability to represent data observations (measurements) in a space comprising relevant dimensions (features). Often, the number of relevant dimensions is quite small; if you were trying to discern a model that described the area of a rectangle, observations of only two features (length and width) would be all you needed. Fisher's well-known iris dataset comprises 150 measurements of only three features 📊
In some cases - particularly with text analysis - the dimensionality of the space can grow much faster. In many approaches to text analysis, the process to get from a text corpus to numerical feature vectors involves a few steps. Just as an exapmle, one way to do this is to:
One way to accomplish the final step is to consider each token (ie word) as a unique dimension, and the count of each word per document as the magnitude along the corresponding dimension. There are certainly other ways to define each of these steps (and more subtle details to consider within each), but for now, we'll consider this simple one.
Using exactly this approach, constructing a vector space from just a few minutes of Tweets (each Tweet considered a document) leads to a space with hundreds of thousands of features! In this high-dimensional vector space, it becomes easy for us to be misled by our intuition for statistical learning approaches in more "human" dimensions e.g. one-, two- and three-dimensional spaces. At this point, many people will cite the "curse of dimensionality."
There are multiple phenomena referred to by this name in domains such as numerical analysis, sampling, combinatorics, machine learning, data mining, and databases. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.
This "curse of dimensionality" refers to a few related, but distinct challenges with data for statistical learning in high dimensions:
I wanted to build more intuition around thinking, visualizing, and generally being more aware of how these phenomena affect our typical kinds of analyses; this notebook is a first step, primarily focused on building an intuition for inspecting and thinking about ways to inspect spaces when we can longer just plot them.
Along the way, I learned a number of new things, and aim to explore them in follow up pieces.
Note: Beware that there are a lot of reused variable names in this notebook. If you get an unexpected result, or an error, be sure to check that the appropriate data generation step was run!
These are the two high-level objectives we'll aim for:
~O(d)
In [ ]:
import copy
try:
import ujson as json
except ImportError:
import json
import math
import operator
import random
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
from numpy.linalg import norm as np_norm
import matplotlib.pyplot as plt
import pandas as pd
from scipy.spatial import distance as spd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
sns.set_style('whitegrid')
%matplotlib inline
In [ ]:
# number of data points
n = 1000
# array of continuous values, randomly drawn from standard normal in two dimensions
X = np.array(np.random.normal(size=(n,2)))
# seaborn plays really nicely with pandas
df = pd.DataFrame(X, columns=['x0','x1'])
df.tail()
We have a 2-dimensional feature space containing 1000 pieces of data. Each coordinate is orthogonal, and we can equivalently think about each data point being represented by a vector from the origin [ (0,0) in 2-dimensional space ], to the point defined by [x0, x1].
Since we only have two dimensions, we can look at the bivariate distribution quite easily using jointplot
. Seaborn also gives us some handy tools for looking at the univariate distributions at the same time 🙌🏼
In [ ]:
sns.jointplot(data=df, x='x0', y='x1', alpha=0.5)
Another distribution that can provide some hints about the structure of data in a multi-dimensional vector space, is the pairwise inter-point distance distribution for all points in the data. Here's a function that makes this a little cleaner.
In [ ]:
def in_sample_dist(X, n):
"""Create a histogram of pairwise distances in array X, using n bins."""
plt.figure(figsize=(15,6))
# use scipy's pairwise distance function for efficiency
plt.hist(spd.pdist(X), bins=n, alpha=0.6)
plt.xlabel('inter-sample distance')
plt.ylabel('count')
In [ ]:
in_sample_dist(X,n)
In unsupervised statistical learning, we're often interested in the existence of "clusters" in data. Our intuition in low dimensions can be helpful here. In order to identify and label a grouping of points as being unique from some other grouping of points, there needs to be a similarity or "sameness" metric that we can compare. One such measure is simply the distance between all of the points. If a group of points are all qualitatively closer to each other than another group of points, then we might call those two groups unique clusters.
If we look at the distribution of inter-point distances above, we see a relatively smooth distribution, suggesting that no group of points is notably closer or further than another other group of points. We'll come back to this idea, shortly. (The inspiration for this approach is found here: pdf)
Above, the bivariate pairplot works great for displaying our data when it's in two dimensions, but you can probably imagine that even in just d=3 dimensions, looking at this distribution of data will be really hard. So, I want to create a metric that gives us a feel for where the data is located in the vector space. There are many ways to do this. For now, I'm going to consider the euclidean distance cumulative distribution function*. Remember that the euclidean distance is the $L_{2}$ norm $dist(p,q) = \sqrt{ \sum_{i=1}^{d} (q_{i}-p_{i})^{2} }$ where d
is the dimensionality of the space. (Wiki)
*in fact, even in the course of developing this notebook, I learned that this is not a terribly great choice. But, hey, you have to start somewhere! ¯\_(ツ)_/¯
In [ ]:
def radius(vector):
"""Calculate the euclidean norm for the given coordinate vector."""
origin = np.zeros(len(vector))
# use scipy's distance functions again!
return spd.euclidean(origin, vector)
In [ ]:
# use our function to create a new 'r' column in the dataframe
df['r'] = df.apply(radius, axis=1)
df.head()
There are a couple of ways that I want to visualize this radial distance. First, I'd like to see the univariate distribution (from 0 to max(r)
), and second, I'd like to see how much of the data is at a radius less than or equal to a particular value of r
. To do this, I'll define a plotting function that takes a dataframe as shown above, and returns plots of these two distributions as described.
There's a lot of plotting hijinks in this function, so first just look at the output and see if it makes some sense. Then we can come back and dig through the plotting function.
In [ ]:
def kde_cdf_plot(df, norm=False, vol=False):
"""Display stacked KDE and CDF plots."""
assert 'r' in df, 'This method only works for dataframes that include a radial distance in an "r" column!'
if norm:
# overwrite df.r with normalized version
df['r'] = df['r'] / max(df['r'])
fig, (ax1, ax2) = plt.subplots(2,1,
sharex=True,
figsize=(15,8)
)
# subplot 1
sns.distplot(df['r'],
hist=False,
rug=True,
ax=ax1
)
ax1.set_ylabel('KDE')
ax1.set_title('n={} in {}-d space'.format(len(df), df.shape[1] - 1) )
# subplot 2
if vol:
raise NotImplementedError("Didn't finish implementing this volume normalization!")
dim = df.shape[1] - 1
df['r'].apply(lambda x: x**dim).plot(kind='hist',
cumulative=True,
normed=1,
bins=len(df['r']),
histtype='step',
linewidth=2,
ax=ax2
)
ax2.set_ylabel('CDF')
plt.xlim(0, .99*max(df['r'])**dim)
xlab = 'volume fraction'
else:
df['r'].plot(kind='hist',
cumulative=True,
normed=1,
bins=len(df['r']),
histtype='step',
linewidth=2,
ax=ax2
)
ax2.set_ylabel('CDF')
plt.xlim(0, .99*max(df['r']))
xlab = 'radial distance'
if norm:
xlab += ' (%)'
plt.xlabel(xlab)
Now, let's see these distributions for the 2-dimensional array we created earlier.
In [ ]:
kde_cdf_plot(df)
As a reminder:
seaborn
docs have a nice illustration of this technique. The ticks on the bottom are called a "rug plot", and are the values of the data (values of r
)
In [ ]:
# data points, dimensions, blob count
n = 1000
dims = 2
blobs = 5
# note: default bounding space is +/- 10.0 in each dimension
X, y = make_blobs(n_samples=n, n_features=dims, centers=blobs)
In [ ]:
# convert np arrays to a df, auto-label the columns
X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
X_df.head()
In [ ]:
sns.jointplot(data=X_df, x='x0', y='x1')
In [ ]:
X_df['r'] = X_df.apply(radius, axis=1)
#X_df.head()
This time, we'll incorporate one extra kwarg in the kde_cdf_plot
function: norm=True
displays the x axis (radial distance) as a fraction of the maximum value. This will helpful when we're comparing spaces of varying radial magnitude.
In [ ]:
kde_cdf_plot(X_df, norm=True)
As a start, notice that the radius CDF for this data has shifted to the right. At larger r
, we're closer to the "edge" of the space containing our data. The graph will vary with iterations of the data generation, but should consistently be shifted to the right relative to the 0-centered standard normal distribution.
Now let's look at the inter-sample distance distribution. Remember that this data is explicitly generated by a mechanism that includes clusters, so we should not see a nice uniform distribution.
In [ ]:
in_sample_dist(X,n)
Sure enough, we can see that there are in fact some peaks in the inter-sample distance. This makes sense, because we know that the data generation process encoded that exact idea. Since we're intentionally using a data generation process that builds in clusters, we'll always see a peak on the low end of the x axis... each cluster is created with a low (and similar) intra-cluster distance. The other, larger peaks, will illustrate the relationships between the clusters.
We may not see precisely the same number of peaks as were specified in the blob creation, though, because we know that sometimes the blobs will be on top of each other and will "look" like one cluster. Compare the peaks of this distribution with the pairplot we created with the same data.
Let's increase the dimension count by one, to 3, just about the limit of our intuition's abilities. To make the data generation process a bit more reusable, we'll use a function to get the data array and corresponding dataframes.
In [ ]:
def make_blob_df(n_points=1000, dims=2, blobs=5, bounding_box=(-10.0, 10.0)):
"""Function to automate the np.array blob => pd.df creation and r calculation."""
# nb: default bounding space is +/- 10.0 in each dimension
X, y = make_blobs(n_samples=n_points, n_features=dims, centers=blobs, center_box=bounding_box)
# make a df, auto-label the columns
X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
X_df_no_r = copy.deepcopy(X_df)
# add a radial distance column
X_df['r'] = X_df.apply(radius, axis=1)
return X, X_df, X_df_no_r, y
In [ ]:
n = 1000
dims = 3
blobs = 5
X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)
X_df.head()
#X_df_no_r.head()
In [ ]:
fig = plt.figure(figsize=(12,7))
ax = fig.add_subplot(111, projection='3d')
ax.plot(X_df['x0'],X_df['x1'],X_df['x2'],'o', alpha=0.3)
ax.set_xlabel('x0'); ax.set_ylabel('x1'); ax.set_zlabel('x2')
In [ ]:
sns.pairplot(X_df_no_r, plot_kws=dict(alpha=0.3), diag_kind='kde')
In [ ]:
kde_cdf_plot(X_df, norm=True)
Again, compare this CDF to the 2-d case above; note that the data is closer to the "edge" of the space.
In [ ]:
in_sample_dist(X,n)
In [ ]:
n = 1000
dims = 10
blobs = 5
X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)
X_df.head()
In [ ]:
# this starts to take a few seconds when d~10
sns.pairplot(X_df_no_r, diag_kind='kde', plot_kws=dict(alpha=0.3))
In [ ]:
kde_cdf_plot(X_df, norm=True)
In [ ]:
in_sample_dist(X,n)
Having seen the way these plots vary individually, let's compare, side-by-side, a similar data generation process (same number of points and clusters) in a range of dimensions.
In [ ]:
n_points = 1000
dim_range = [2, 100, 10000]
blob_count = 5
fig, (ax1, ax2) = plt.subplots(2,1, sharex=True, figsize=(15,8))
for d in dim_range:
## data generation
# random gaussian blobs in d-dims
X, y = make_blobs(n_samples=n_points, n_features=d, centers=blob_count)
##
## calculation
# create a labeled df from X
X_df = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
# add an 'r' column
#X_df_no_r = copy.deepcopy(X_df)
X_df['r'] = X_df.apply(radius, axis=1)
# normalize r value to % of max?
X_df['r'] = X_df['r'] / max(X_df['r'])
##
## plotting
# subplot 1 - KDE
sns.distplot(X_df['r'],
kde=True,
hist=False,
rug=True,
ax=ax1,
label='{}-dims'.format(d)
)
# subplot 2 - CDF
X_df['r'].plot(kind='hist',
cumulative=True,
normed=1,
bins=len(X_df['r']),
histtype='step',
linewidth=2,
ax=ax2
)
##
ax1.set_ylabel('KDE')
ax1.set_title('n={} in {}-d space'.format(len(X_df), dim_range) )
ax2.set_ylabel('CDF')
plt.xlim(0, .999*max(X_df['r']))
plt.xlabel('radial distance (%)')
In [ ]:
fig, (ax1, ax2, ax3) = plt.subplots(3,1, figsize=(15,9))
for i,d in enumerate(dim_range):
X, y = make_blobs(n_samples=n_points, n_features=d, centers=blob_count)
# loop through the subplots
plt.subplot('31{}'.format(i+1))
# plot the data
plt.hist(spd.pdist(X), bins=n_points, alpha=0.6)
plt.ylabel('count (d={})'.format(d))
ax3.set_xlabel('inter-sample distance')
In [ ]:
small_corpus = [
'The dog likes cats.',
'The blue cat eats brown sharks.',
'Why not, blue?'
]
In [ ]:
vec = CountVectorizer()
X = vec.fit_transform(small_corpus)
X.todense()
In [ ]:
vec.vocabulary_
It's good to remember how to map the matrix-like data onto the words that go into it...
In [ ]:
terms = [x for x,_ in sorted(vec.vocabulary_.items(), key=operator.itemgetter(1))]
text_df = pd.DataFrame(X.todense(), columns=terms)
text_df
In [ ]:
text_df['r'] = text_df.apply(radius, axis=1)
text_df
In [ ]:
kde_cdf_plot(text_df, norm=True)
With a tiny little corpus, these plots aren't very useful. Let's use a bigger one: this text file (not included in the repo, sorry visitors!) is about a 10-minute, 10% sample of Tweet (body text) from the Firehose. It has a little under 400,000 Tweets.
In [ ]:
text_array = []
with open('twitter_2016-04-06_2030.jsonl.body.txt', 'r') as infile:
for line in infile:
text_array.append(line.replace('\n', ' '))
print( len(text_array) )
print( text_array[0] )
In [ ]:
vec = CountVectorizer(
#binary=1,
## add dimensionality reduction?
#stop_words='english',
#lowercase=True,
#min_df=10
)
dtm = vec.fit_transform(text_array)
dtm
In [ ]:
# what fraction of the feature space is full?
3051924 / ( 374941*523498 )
We have to do the radius math slightly differently now, because we're dealing with a scipy CSR matrix instead of a dense numpy array.
In [ ]:
# (element-wise sq) (row sum) (flatten) (sqrt)
dtm_r = dtm.multiply(dtm).sum(axis=1).A1**0.5
#print(len(dtm_r))
#print(dtm_r)
#print(min(dtm_r), np.median(dtm_r), max(dtm_r))
In [ ]:
s = pd.Series(dtm_r)
plt.figure(figsize=(15,6))
s.plot(kind='hist',
cumulative=True,
normed=1,
bins=len(dtm_r),
histtype='step',
linewidth=2
)
plt.ylabel('CDF')
#plt.xlim(0, .99*max(dtm_r))
plt.xlim(0, 6)
plt.xlabel('radial distance')
In [ ]:
# This is a super interesting side note: some tweets can totally throw off your distribution.
# This one Tweet had 114 repetitions of a single character. If you swap the xlim() commands
# above, you'll see that r extends to over 100. This is why:
#text_array[ s[ s > 114 ].index[0] ]
<record-stopping screeching noise>
Ok, so I spent some time working with this data, and I'll be honest: I expected this distribution to be much more skewed to large r
! In fact, I thought it would be more exaggerated than the blob examples above.
Since I didn't have enough time to dig any deeper for this session, let's keep this observation in the back of our minds, and come back to it in another session.
We can round out today's discussion with one more relevant topic...
Before we end this session, we'll consider one more facet of high-dimensional spaces: reducing them to lower dimension. For now, we'll illustrate the effect of using principal component analysis using the same inspection techniques we've been using all along.
If we try to densify the 500k+ dimension document term matrix above, we'll run out of RAM. So, let's use a synthetic data set.
First, we look at our metrics in 10,000 dimensions, then after PCA to bring them down to 3.
In [ ]:
n = 2000
dims = 10000
blobs = 10
X, X_df, X_df_no_r, y = make_blob_df(n, dims, blobs)
#X_df_no_r.head()
In [ ]:
kde_cdf_plot(X_df, norm=True)
plt.xlim(0,1)
In [ ]:
in_sample_dist(X,n)
Now, we know that the data generation process built in the notion of identifiable clusters. Let's see if we can surface that information by projecting our high-dimensional data and space down into a smaller number using principal component analysis.
In [ ]:
# now apply PCA and reduce the dimension down to 3
pca = PCA(n_components=3)
X_df_3d = pd.DataFrame(pca.fit_transform(X_df_no_r), columns=['x0','x1','x2'])
In [ ]:
# add in that radial distance column
X_df_3d['r'] = X_df_3d.apply(radius, axis=1)
X_df_3d.head()
In [ ]:
# add in the labels so we can color by them
X_df_3d['y'] = y
In [ ]:
# nb: using the vars kwarg seems to remove the ability to include KDE
sns.pairplot(X_df_3d,
vars=['x0','x1','x2'],
hue='y',
palette="husl",
diag_kind='kde',
plot_kws=dict(s=50, alpha=0.7)
)
In [ ]:
kde_cdf_plot(X_df_3d, norm=True)
In [ ]:
#in_sample_dist(X_df_3d[['x0','x1','x2']],n)
Given the two plots just above, it seems like we've both done a good job of representing the underlying clusters in our lower-dimensional space, and moved the data away from the extreme edges of the feature space. We should expect that both our algorithms can run more efficiently (faster), and achieve a higher level of significance.
In future installments, I look forward to:
In [ ]: