by Karen Belita
Clustering is a type of Unsupervised Machine Learning, which can determine relationships of unlabeled data.
This notebook will show how to get and prepare data for exploration of clustering methods.
This notebook will use scikit-learn for machine learning processes.
Zillow has real estate data at different geographic levels. This notebook will explore some clustering methods with the Zillow Home Value Index data set at the City level.
The data can be found here.
In [1]:
import pandas as pd
import csv
import os
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
#from pandas.tools.plotting import scatter_matrix
from __future__ import print_function
import urllib.request
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn import preprocessing
matplotlib.style.use('ggplot')
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler,RobustScaler, Normalizer
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
%matplotlib inline
import matplotlib.cm as cm
from sklearn.cluster import AgglomerativeClustering
import warnings
from sklearn.metrics import silhouette_samples, silhouette_score
In [2]:
url = "http://files.zillowstatic.com/research/public/City/City_Zhvi_Summary_AllHomes.csv"
def csv_download():
path = os.getcwd() ## current location
file_name = path + "/" + "ZHVICity.csv"
if os.path.isfile(file_name): # makes sure that no file duplicates
pass
else:
f = urllib.request.urlopen(url)
data = f.read()
with open(file_name, "wb") as f:
f.write(data)
# return file_name to print location
csv_download()
In [3]:
file_name = os.path.join(os.getcwd(), "ZHVICity.csv")
df = pd.read_csv(file_name)
Look at the structure of the data.
In [4]:
df.head()
Out[4]:
Rename the column "RegionName" to "City" and create a column that combines the name of city with its state for readability purposes.
Also remove irrelevant columns.
In [5]:
#rename
df.rename(columns={"RegionName" : "City"},
inplace = True)
# add new column
df['City-State'] = df['City'] + "-" + df['State']
df.head(1)
##move new column to the front
cols = df.columns.tolist()
cols.insert(0, cols.pop(cols.index('City-State'))) #move to position 0
#drop columns
df = df.reindex(columns = cols)
df.drop(df.columns[[0,1,6,18]], axis = 1, inplace = True)
df.head(2)
Out[5]:
Pandas can also help with describing the Data which can help for analysis.
Such as Summarizing data by state (Average ZHVI data per state)...
In [6]:
statedf = df.groupby("State")["Zhvi"].mean().sort_values(ascending = False)
statedf.head()
Out[6]:
Select columns that will be used as features for Machine Learning.
In [7]:
featcol= [
'Zhvi','MoM','QoQ','YoY','5Year','10Year','PeakZHVI','PctFallFromPeak']
x = df[featcol]
Dealing with missing values can be done by removing rows with missing data....
In [8]:
#check number of rows
print ("original number of rows: %d" % (len(x.index)))
#remove rows
x1 = x.dropna()
print ("new number of rows: %d" % (len(x1.index)))
Or imputating missing values with the interpolate function from pandas.
In [9]:
## all Nans to white space
x = x.replace(np.nan, ' ', regex = True)
x = x.replace(np.nan, 'NaN', regex = True)
## convert to all floats
with warnings.catch_warnings():
warnings.simplefilter("ignore")
x = x.convert_objects(convert_numeric = True)
x = x.interpolate()
x.isnull().any().any() ## to check if any missing data
Out[9]:
Prepare features by converting the dataframe into an array.
In [10]:
# features into array
features = x.values
To see variance of features: boxplot (from Seaborn) can be used with the MinMaxscaler (from scikit-learn) to visualize this.
In [11]:
min_max_scaler = MinMaxScaler()
fmm = min_max_scaler.fit_transform(features)
fmX = pd.DataFrame(fmm)
ax = sns.boxplot(data=fmX)
ax
Out[11]:
Clustering can label the unlabeled Real Estate Data.
Below is a summary of the parameters used by the clustering algorithms from scikit-learn:
Read more about Clustering with Scikit-Learn.
Start with K-means since it's simple.
Read more about K-Means.
The parameter that K-Means utilizes is the number of clusters or k.
Pick a number of clusters or k for K-Means, and check its performance by using the silhouette score, the metric that assesses the variance of objects within clusters. Score closer to 1 is best.
In [12]:
k = 4
cluster = KMeans(init='k-means++', n_clusters=k, n_init=12)
cluster.fit(features)
metrics.silhouette_score(features, cluster.labels_)
Out[12]:
Model selection can be done by looping through a range of k. This could also include looping through different types of clustering methods that uses k or number of clusters as a parameter.
The loop includes K-Means and the following clustering methods:
The result of the loop is a ranking of the silhouette scores of the different methods and k explored.
Pick a range of k to explore.
In [13]:
# adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
range_n_clusters = range(8,11)
List1 = []
List2 = []
List3 = []
for n_clusters in range_n_clusters:
def bench_clustering(estimator, name, data):
estimator.fit(data)
v1 = name
v2 = n_clusters
v3 = metrics.silhouette_score(data, estimator.labels_)
List1.append(v1)
List2.append(v2)
List3.append(v3)
bench_clustering(KMeans(init='k-means++', n_clusters=n_clusters, n_init=12),
name="K-Means", data=features)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
bench_clustering(MiniBatchKMeans(init='k-means++', n_clusters=n_clusters, n_init=12,max_no_improvement=10, verbose=0,random_state=0),
name="MiniBatchKMeans", data=features)
bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
name="Ward", data=features)
bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='average'),
name="Average", data=features)
bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
name="Complete", data=features)
d = pd.DataFrame()
d['method'] = List1
d['k'] = List2
d['silhouette_score'] = List3
d = d.sort_values(['silhouette_score'], ascending = False)
print (d)
Silhouette plot analysis can also aid with model selection. The relationship of objects within each cluster can be assessed visually.
Pick a range of k to explore. (Below is used with K-Means, but can be used with the other clustering methods that use k as a parameter, as mentioned above.)
In [14]:
#adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py
range_n_clusters = range(8,11)
for n_clusters in range_n_clusters:
cluster = KMeans(init='k-means++', n_clusters=n_clusters, n_init=12)
cluster.fit(features)
metrics.silhouette_score(features, cluster.labels_)
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(features) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = cluster
cluster_labels = clusterer.fit_predict(features)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(features, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(features, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhoutte score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(features[:, 0], features[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1],
marker='o', c="white", alpha=1, s=200)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()
Visualizing Clusters - The example below shows the clusters and their centroids. Seeing the shape of the clusters and the location of the centroids can help with further analysis.
Pick a k to explore. (Below is used with K-Means, but can be used with the other clustering methods that use k as a parameter, as mentioned above.)
In [15]:
#adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py
n_clusters = 10
f_scaled = scale(features)
reduced_data = PCA(n_components=2).fit_transform(f_scaled)
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
kmeans.fit(reduced_data)
# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max].
# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title('K-means clustering on the Zillow ZHVI dataset (PCA-reduced data)\n'
'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
Try adding more features from Zillow/data.
There are also many other clustering methods with scikit-learn to explore here.
In [ ]: