Suppose we didn't know how many different species there were in the Iris dataset. How can we approximately infer this information from the data?
One possible solution would be to plot the data as a scatter plot and visually identify distinct groups. The Iris dataset, however, is comprised of four dimensions, so it can't be visualized except for pairs of features.
In order to visualize the complete dataset as a 2D scatterplot, it's possible to use dimensionality reduction techniques to reduce the data to two dimensions without losing too much structural information.
In [1]:
import pandas as pd
iris = pd.read_csv('../datasets/iris_without_classes.csv') # Read the file 'datasets/iris_without_classes.csv'
In [2]:
# Print the first entries using the head() method to check that there is no Class information anymore
iris.head()
Out[2]:
We'll use scikit-learn's PCA algorithm to reduce the number of dimensions to two in our dataset.
In [3]:
# Use PCA's fit_transform() method to reduce the dataset size to two dimensions
from sklearn.decomposition import PCA
RANDOM_STATE=1234
pca = PCA(n_components=2, random_state=RANDOM_STATE) # Create a PCA object with two components
iris_2d = pca.fit_transform(iris) # use fit_transform() to reduce the original dataset into two dimensions
In [4]:
# Create a scatterplot of the reduced dataset
import matplotlib.pyplot as plt
%matplotlib inline
# Create a scatterplot of the two dimensions of the transformed data
plt.scatter(iris_2d[:, 0], iris_2d[:, 1])
# Show the scatterplot
plt.show()
How many distinct groups can you see?
The presented above can be framed as a Clustering problem. Clustering involves finding groups of examples that are like other examples in the same group but different from examples that belong to other groups.
In this example, we'll use scikit-learn's KMeans algorithm to find clusters in our data.
One limitation of KMeans is that it receives the expected number of clusters as input, so you must either have some domain knowledge to guess a reasonable number of groups or attempt different values for the number of clusters and see which one works best.
In [5]:
# Create two KMeans models: one with two clusters and another with three clusters
# Store the labels predicted by the KMeans models using two and three clusters
from sklearn.cluster import KMeans
model2 = KMeans(n_clusters=2, random_state=RANDOM_STATE).fit(iris) # Create a KMeans model expecting two clusters
labels2 = model2.predict(iris) # Predict the cluster label for each data point using predict()
model3 = KMeans(n_clusters=3, random_state=RANDOM_STATE).fit(iris) # Create a KMeans model expecting three clusters
labels3 = model3.predict(iris) # Predict the cluster label for each data point using predict()
In [6]:
# Plot the 2-clusters assignments using the reduced dataset. Use different colors for each cluster
plt.scatter(iris_2d[labels2 == 0, 0], iris_2d[labels2 == 0, 1], color='red')
plt.scatter(iris_2d[labels2 == 1, 0], iris_2d[labels2 == 1, 1], color='blue')
# Show the scatterplot
plt.show()
In [7]:
# Plot the 3-clusters assignments using the reduced dataset. Use different colors for each cluster
plt.scatter(iris_2d[labels3 == 0, 0], iris_2d[labels3 == 0, 1], color='red')
plt.scatter(iris_2d[labels3 == 1, 0], iris_2d[labels3 == 1, 1], color='blue')
plt.scatter(iris_2d[labels3 == 2, 0], iris_2d[labels3 == 2, 1], color='green')
# Show the scatterplot
plt.show()
There are some techniques like Silhouette Analysis to automatically infer the optimal number of clusters in a dataset. This link gives an example of how that could be done using scikit-learn.