Suppose we have a $p$-dimensional dataset containing $N$ observations that we've normalized so that all points are contained in the $p$-dimensional unit cube. Fix a target data point $x_0$, and say we send out a hypercubical neighborhood $C(x_0)$ about $x_0$ to capture a fraction $r$ of the observations. In other words, we want to consider a cubical neighbourhood of $x_0$ with volume $0 < r < 1$ (see image below for an example with $x_0 = 0$).
The volume of a $p$-dimensional cube with side-length $l$ is given by $l^n$, so we will have to choose the side-length of our smaller cube $C(x_0)$ to be $r^\frac{1}{p}$.
If we add the additional assumption that the points in our dataset are uniformly distributed, we can compute the median distance from the origin to the closest data point by
$$ d(p, N) = \left( 1 - \frac{1}{2^{\frac{1}{N}}}\right)^{\frac{1}{p}}. $$Formula Explanation for hypersphere, follows for hypercube.
Suppose $Y$ and $Z$ are $p$-dimensional random vectors where each entry has been drawn from the standard normal distribution $\mathcal{N}(0, 1)$. In other words
$$ Y_i \sim \mathcal{N}(0, 1) \quad \text{and} \quad Z_i \sim \mathcal{N}(0, 1) \quad \forall i. $$It turns out that as $d \to \infty$, the dot product of the normalized vectors $\widehat{Y} = Y / \|Y\|$ and $\widehat{Z} = Z / \|Z\|$ tends to zero. In other words $\widehat{Y}$ and $\widehat{Z}$ become approximately orthogonal in high dimension. Our next goal is to verify this empirically.
As we saw above, high dimensional data can cause quite a few problems. We'll now explore a few ways of reducing dimensionality.
Suppose again that we have a $p$-dimensional dataset for large $p$, and we want to reduce the number of dimensions to $n$ for $n << p$. One way of approaching this problem is via random projections. A random projection $M$ in this case would be a $p \times n$ matrix whose entries are chosen at random. If $u \in \mathbb{R}^p$ is a point in our dataset, we compute its projection $\widehat{u} \in \mathbb{R}^n$ with the following matrix multiplication:
$$ \widehat{u} = u^T \cdot M. $$However, we don't want to choose just any random matrix $M$. Ideally, we would like our projection to preserve distances between points in our dataset. We can do this by choosing the matrix $M$ to be one of the following kinds:
A Gaussian random projection is given by drawing the entries of $M$ from the normal distrbution $\mathcal{N}(0, 1/n)$ (recall $n$ is the dimension of the target space, also referred to as the number of components of the projection).
An Achlioptas random projection is given by drawing the entries of $M_{ij}$ of $M$ according to
If $u, v \in \mathbb{R}^p$ are points in our dataset, then projections like those above satisfy bounds of the form
$$ (1 - \epsilon) \| u - v \|^2 \leq \| Mu - Mv \|^2 \leq (1 + \epsilon) \|u - v\|^2. $$The relationship between the dimension $n$ of the target space, the number of samples in our dataset, and $\epsilon$ is given by
$$ n = \frac{4\log(\text{number of samples})}{\epsilon^2/2 - \epsilon^3 /3}. $$Typically the number of samples is given, and so we just compute $n$ as a function of $\epsilon$.
In [ ]: