Introduction to Python for Data Sciences |
Franck Iutzeler Fall. 2018 |
Package check and Styling
Outline
a) Clustering
b) Dimension reduction
c) Exercises
Clustering is the task of assigning data points to a known number of classes. The K Means algorithm is one of the most well known, it clusters data by minimizing the squared distance of cluster points to their mean.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
#import seaborn as sns
#sns.set()
In [2]:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1])
Out[2]:
As before, we proceed by selecting a KMeans model and fitting it to the data.
In [3]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
Out[3]:
From the model, one can get the data points labels with the attribute labels_ and the cluster centers with cluster\_centers_.
In [4]:
print(kmeans.labels_)
print(kmeans.cluster_centers_)
In [5]:
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='r' , s = 100 , marker="*")
Out[5]:
The different clusters have visibly been recovered. It is to be noted that from the cluster center, one can define Voronoi regions (regions that are closer to one center than any other one) that are exactly the predicted regions from the K Means algorithm.
In [6]:
from scipy.spatial import Voronoi, voronoi_plot_2d
vor = Voronoi(kmeans.cluster_centers_)
voronoi_plot_2d(vor)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='r' , s = 100 , marker="*")
plt.show()
In order to reduce the dimension of our features either for direct learning or for visualization, dimension reduction is important and is implemented extensively in Scikit Learn Decompositions.
One of the most standard methods is the Principal Components Analysis (PCA) that consists in projecting the feature matrix onto its top $n$ singular values (This was used in image compression in the NumPy notebook).
In [7]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
In [8]:
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
plt.scatter(X[:, 0], X[:, 1])
plt.axis('equal');
In [9]:
from sklearn.decomposition import PCA
pca = PCA(2)
pca.fit(X)
Out[9]:
The PCA model outputs components_ that are the singular vectors and explained\_variance_ for the magnitude of the associated singular values.
In [10]:
print(pca.components_)
print(pca.explained_variance_)
This illustration is provided in the Python Data Science Handbook by Jake VanderPlas. The greater axis is more informative and thus the second one would be dropped in the case of 1D dimensional reduction.
In [11]:
def draw_vector(v0, v1, ax=None):
ax = ax or plt.gca()
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0)
ax.annotate('', v1, v0, arrowprops=arrowprops)
# plot data
plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');
In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
iris = pd.read_csv('data/iris.csv')
classes = pd.DataFrame(iris["species"])
features = iris.drop(["species",],axis=1)
lenc = LabelEncoder()
num_classes = np.array(classes.apply(lenc.fit_transform))
In [13]:
features.head()
Out[13]:
In [14]:
pca = PCA(2)
pca.fit(features)
Out[14]:
In [15]:
reduction = pd.DataFrame(pca.components_)
reduction.columns = ["sepal_length","sepal_width","petal_length","petal_width"]
reduction["---- Variance ----"] = pca.explained_variance_
reduction.index = ["vec. 1", "vec. 2"]
reduction
Out[15]:
We notice an important first vector that combines the 4 features (sepal_width seems less important).
We can now project the data onto these two vectors and plot the result to see if the classes are recognizable in this reduced space.
In [16]:
projected = pca.transform(features)
plt.scatter(projected[:, 0], projected[:, 1], c=num_classes)
plt.xlabel('vec. 1')
plt.ylabel('vec. 2')
Out[16]:
We see that the classes are way more separable this way than with just 2 features (see before). Furthermore, the vector 1 can almost be used alone to separate.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import matplotlib.cm as cm
#### IMAGE
img = mpimg.imread('img/flower.png')
img_gray = 0.2989 * img[:,:,0] + 0.5870 * img[:,:,1] + 0.1140 * img[:,:,2] # Apparently these are "good" coefficients to convert to grayscale
####
print(img_gray.shape)
plt.figure()
plt.xticks([]),plt.yticks([])
plt.title("Original")
plt.imshow(img_gray, cmap = cm.Greys_r)
plt.show()
In [ ]:
pixels = img_gray.flatten().reshape(-1, 1)
pixels.shape
In [ ]:
In [ ]:
import lib.notebook_setting as nbs
packageList = ['IPython', 'numpy', 'scipy', 'matplotlib', 'cvxopt', 'pandas', 'seaborn', 'sklearn', 'tensorflow']
nbs.packageCheck(packageList)
nbs.cssStyling()