In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
"At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily, the most important factor is the features used. If you have many independent features that each correlate well with the class, learning is easy. On the other hand, if the class is a very complex function of the features, you may not be able to learn it. Often, the raw data is not in a form that is amenable to learning, but you can construct features from it that are. This is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition and creativity are as important as the technical stuff."
-- Pedro Domingos, A Few Useful Things to Know about Machine Learning
"High dimensional datasets are at the risk of being very sparse: most training instances are likely to be far away from each other. Of course, this also means that a new instance will likely be far away from any training instance, making predictions much less reliable than in lower dimensions, since they will be based on much larger extrapolations. In short, the more dimensions the training set has [i.e., the more attributes], the greater the risk of overfitting it."
-- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, p.207
In [2]:
# Range of distances between 2 randomly chosen points in a unit hypercube
import numpy as np
def dist(n_points, n_dims):
# Generate a random set of n_points to fill the hypercube
rows1 = np.random.rand(n_points, n_dims)
# Generate another set of random n_points to fill the hypercube
rows2 = np.random.rand(n_points, n_dims)
distances = [np.linalg.norm(rows1[i] - rows2[i]) for i in range(len(rows1))]
return [np.average(distances), (np.max(distances) - np.min(distances))]
In [3]:
# Try large numbers with caution
dims = [10, 50, 100, 500, 1000, 5000, 10000]
%time d = [dist(10000,x) for x in dims]
d
Out[3]:
In [4]:
x_vals = [d[i][0] for i in range(len(d))]
y_vals = [d[i][1] for i in range(len(d))]
In [5]:
x_vals
Out[5]:
In [6]:
y_vals
Out[6]:
In [7]:
# Average distance of any two randomly chosen points as a function of the number of features
fig = plt.figure(1, figsize=(12, 8))
# Create an axes instance
ax = fig.add_subplot(111)
plt.title('Average Distance Between Any 2 Randomly Chosen Points in an N-Dimensional Hypercube')
plt.xlabel('Number of Features')
plt.ylabel('Average Distance')
# Create the plot
plt.plot(dims, x_vals, marker='o');
In [8]:
# Average distance of any two randomly chosen points as a function of the number of features
fig = plt.figure(1, figsize=(9, 6))
# Create an axes instance
ax = fig.add_subplot(111)
plt.title('Max - Min Distance Between Any 2 Randomly Chosen Points in an N-Dimensional Hypercube')
plt.xlabel('Number of Features')
plt.ylabel('Max - Min Distance')
# Create the plot
plt.plot(dims, y_vals, marker='o', color='r');
"Generalizing correctly becomes exponentially harder as the dimensionality (number of features) of the examples grows, because a fixed-size training set covers a dwindling fraction of the input space. Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter covers only a fraction of about $10^{-18}$ of the input space. This is what makes machine learning both necessary and hard."
-- Pedros Domingos, A Few Useful Things to Know about Machine Learning
The idea: Find the axis (or more generally, the hyperplane) that captures the highest amount of variance. Then find the axis that captures the second highest amount of variance, and so on. These axes are called principal components.
There are as many principal components as there are features in the dataset.
(Image from Aurélien Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, p.212)
In a space of $n$ dimensions, PCA will find $n$ principal components. This new set of principal components is another set of axes for the dataset. They are axes chosen to maximize the variance of data when it is projected onto that principal component axis.
Variance is technically the statistical term -- but for simplicity, we can think of it simply as the difference betweeen the maximum and minimum value of the data points projected onto a principal component axis.
Let's see how it's visualized in http://setosa.io/ev/principal-component-analysis/
Caution: It's not always the case that reducing dimensions simplifies a problem.
(Image from Aurélien Géron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, p.211)
(For a simple introduction to the manifold hypothesis, see https://nbviewer.jupyter.org/github/jsub10/In-Progress/blob/master/How-is-Learning-Possible%3F.ipynb)
Let's see how PCA and random forests look in Orange...
In [ ]: