CLUSTERING WITH REAL ESTATE DATA

    by Karen Belita 

Clustering is a type of Unsupervised Machine Learning, which can determine relationships of unlabeled data.
This notebook will show how to get and prepare data for exploration of clustering methods.
This notebook will use scikit-learn for machine learning processes.

Data information

Zillow has real estate data at different geographic levels. This notebook will explore some clustering methods with the Zillow Home Value Index data set at the City level.

The data can be found here.

Dependencies


In [1]:
import pandas as pd
import csv
import os 
import numpy as np
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
#from pandas.tools.plotting import scatter_matrix
from __future__ import print_function
import urllib.request
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn import preprocessing
matplotlib.style.use('ggplot')
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans

from sklearn.preprocessing import MinMaxScaler, StandardScaler,RobustScaler, Normalizer
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
%matplotlib inline 
import matplotlib.cm as cm
from sklearn.cluster import AgglomerativeClustering
import warnings
from sklearn.metrics import silhouette_samples, silhouette_score

Getting Data

Use urllib to download the csv file from the site.


In [2]:
url = "http://files.zillowstatic.com/research/public/City/City_Zhvi_Summary_AllHomes.csv"

def csv_download():
    path = os.getcwd() ## current location
    file_name = path + "/" + "ZHVICity.csv"
    
    if os.path.isfile(file_name): # makes sure that no file duplicates
        pass
    else:
        f = urllib.request.urlopen(url)
        data = f.read()
        with open(file_name, "wb") as f:
            f.write(data)
    # return file_name to print location

csv_download()

Preparing Data

Use pandas to prepare data for machine learning.


In [3]:
file_name = os.path.join(os.getcwd(), "ZHVICity.csv")

df = pd.read_csv(file_name)

Look at the structure of the data.


In [4]:
df.head()


Out[4]:
Date RegionID RegionName State Metro County SizeRank Zhvi MoM QoQ YoY 5Year 10Year PeakMonth PeakQuarter PeakZHVI PctFallFromPeak LastTimeAtCurrZHVI
0 2017-03-31 6181 New York NY New York Queens 0 650000 0.013408 0.032402 0.129060 0.070806 0.023315 2017-03 2017-Q1 650000 0.000000 2017-03
1 2017-03-31 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles 1 622900 0.007766 0.020144 0.087655 0.106821 0.005033 2017-03 2017-Q1 622900 0.000000 2017-03
2 2017-03-31 17426 Chicago IL Chicago Cook 2 220900 0.004091 0.026487 0.078613 0.058426 -0.011106 2007-01 2007-Q1 247300 -0.106753 2005-01
3 2017-03-31 13271 Philadelphia PA Philadelphia Philadelphia 3 136100 0.005913 0.021005 0.088800 0.039377 0.014545 2017-03 2017-Q1 136100 0.000000 2017-03
4 2017-03-31 40326 Phoenix AZ Phoenix Maricopa 4 208600 0.005786 0.019052 0.097317 0.136332 -0.014783 2006-07 2006-Q3 245700 -0.150997 2005-07

Rename the column "RegionName" to "City" and create a column that combines the name of city with its state for readability purposes.
Also remove irrelevant columns.


In [5]:
#rename
df.rename(columns={"RegionName" : "City"},
          inplace = True) 
# add new column
df['City-State'] = df['City'] + "-" + df['State']
df.head(1)
##move new column to the front
cols = df.columns.tolist() 
cols.insert(0, cols.pop(cols.index('City-State'))) #move to position 0
#drop columns
df = df.reindex(columns = cols)
df.drop(df.columns[[0,1,6,18]], axis = 1, inplace = True)
df.head(2)


Out[5]:
RegionID City State Metro SizeRank Zhvi MoM QoQ YoY 5Year 10Year PeakMonth PeakQuarter PeakZHVI PctFallFromPeak
0 6181 New York NY New York 0 650000 0.013408 0.032402 0.129060 0.070806 0.023315 2017-03 2017-Q1 650000 0.0
1 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim 1 622900 0.007766 0.020144 0.087655 0.106821 0.005033 2017-03 2017-Q1 622900 0.0

Pandas can also help with describing the Data which can help for analysis.
Such as Summarizing data by state (Average ZHVI data per state)...


In [6]:
statedf = df.groupby("State")["Zhvi"].mean().sort_values(ascending = False)
statedf.head()


Out[6]:
State
CA    612531
HI    561834
DC    559200
MA    385613
NJ    377342
Name: Zhvi, dtype: int64

Select columns that will be used as features for Machine Learning.


In [7]:
featcol= [
    'Zhvi','MoM','QoQ','YoY','5Year','10Year','PeakZHVI','PctFallFromPeak']
x = df[featcol]

Dealing with missing values can be done by removing rows with missing data....


In [8]:
#check number of rows
print ("original number of rows:  %d" % (len(x.index)))

#remove rows
x1 = x.dropna()
print ("new number of rows:  %d" % (len(x1.index)))


original number of rows:  10893
new number of rows:  9571

Or imputating missing values with the interpolate function from pandas.


In [9]:
## all Nans to white space
x = x.replace(np.nan, ' ', regex = True)
x = x.replace(np.nan, 'NaN', regex = True)
## convert to all floats
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    x = x.convert_objects(convert_numeric = True)
x = x.interpolate()
x.isnull().any().any() ## to check if any missing data


Out[9]:
False

Prepare features by converting the dataframe into an array.


In [10]:
# features into array
features = x.values

To see variance of features: boxplot (from Seaborn) can be used with the MinMaxscaler (from scikit-learn) to visualize this.


In [11]:
min_max_scaler = MinMaxScaler()
fmm = min_max_scaler.fit_transform(features)
fmX = pd.DataFrame(fmm)

ax = sns.boxplot(data=fmX)
ax


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x103ba1c88>

Unsupervised Machine Learning with Clustering

Clustering can label the unlabeled Real Estate Data.

Below is a summary of the parameters used by the clustering algorithms from scikit-learn:

  • K-Means: number of clusters
  • Affinity propagation: damping, sample preference
  • Means-shift: bandwidth
  • Spectral Clustering: number of clusters
  • Ward hierachical clustering: number of clusters
  • DBSCAN: neighborhood size
  • Gaussian mixtures: there are many to choose from
  • Birch: branching factor, threshold, optional global clusterer

Read more about Clustering with Scikit-Learn.

Start with K-means since it's simple.
Read more about K-Means.
The parameter that K-Means utilizes is the number of clusters or k.

Pick a number of clusters or k for K-Means, and check its performance by using the silhouette score, the metric that assesses the variance of objects within clusters. Score closer to 1 is best.


In [12]:
k = 4
cluster = KMeans(init='k-means++', n_clusters=k, n_init=12)
cluster.fit(features)
metrics.silhouette_score(features, cluster.labels_)


Out[12]:
0.67514490830721008

Model selection can be done by looping through a range of k. This could also include looping through different types of clustering methods that uses k or number of clusters as a parameter.

The loop includes K-Means and the following clustering methods:

  • MiniBatch K-Means is the same as K-means but uses mini-batches to reduce computation time
  • Agglomerative Clustering performs heirarchical clustering and can be used with 3 three different merge or linkage strategies :
    • Ward
    • Complete
    • Average

The result of the loop is a ranking of the silhouette scores of the different methods and k explored.

Pick a range of k to explore.


In [13]:
# adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
range_n_clusters = range(8,11)
List1 = []
List2 = []
List3 = []
 
for n_clusters in range_n_clusters:
    def bench_clustering(estimator, name, data):
        estimator.fit(data)
        v1 = name 
        v2 = n_clusters
        v3 = metrics.silhouette_score(data, estimator.labels_)

        List1.append(v1)
        List2.append(v2)
        List3.append(v3)

    bench_clustering(KMeans(init='k-means++', n_clusters=n_clusters, n_init=12),
                  name="K-Means", data=features)
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        bench_clustering(MiniBatchKMeans(init='k-means++', n_clusters=n_clusters, n_init=12,max_no_improvement=10, verbose=0,random_state=0),
                     name="MiniBatchKMeans", data=features)
    bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
                  name="Ward", data=features)   

    bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='average'),
                  name="Average", data=features)  
    
    bench_clustering(AgglomerativeClustering(n_clusters=n_clusters, linkage='ward'),
                  name="Complete", data=features) 
d = pd.DataFrame()
d['method'] = List1 
d['k'] = List2
d['silhouette_score'] = List3
d = d.sort_values(['silhouette_score'], ascending = False)
print (d)


             method   k  silhouette_score
3           Average   8          0.869141
8           Average   9          0.850578
13          Average  10          0.737894
0           K-Means   8          0.569374
2              Ward   8          0.542509
4          Complete   8          0.542509
12             Ward  10          0.539720
14         Complete  10          0.539720
7              Ward   9          0.539631
9          Complete   9          0.539631
5           K-Means   9          0.534215
10          K-Means  10          0.522625
6   MiniBatchKMeans   9          0.481316
11  MiniBatchKMeans  10          0.474331
1   MiniBatchKMeans   8          0.451215

Silhouette plot analysis can also aid with model selection. The relationship of objects within each cluster can be assessed visually.

Pick a range of k to explore. (Below is used with K-Means, but can be used with the other clustering methods that use k as a parameter, as mentioned above.)


In [14]:
#adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py

range_n_clusters = range(8,11)

for n_clusters in range_n_clusters:
    
    cluster = KMeans(init='k-means++', n_clusters=n_clusters, n_init=12)
    cluster.fit(features)
    metrics.silhouette_score(features, cluster.labels_)

    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(features) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = cluster
    cluster_labels = clusterer.fit_predict(features)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(features, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(features, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhoutte score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(features[:, 0], features[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors)

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1],
                marker='o', c="white", alpha=1, s=200)

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1, s=50)

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

    plt.show()


For n_clusters = 8 The average silhouette_score is : 0.569374256213
For n_clusters = 9 The average silhouette_score is : 0.530796596903
For n_clusters = 10 The average silhouette_score is : 0.515566366619

Visualizing Clusters - The example below shows the clusters and their centroids. Seeing the shape of the clusters and the location of the centroids can help with further analysis.

Pick a k to explore. (Below is used with K-Means, but can be used with the other clustering methods that use k as a parameter, as mentioned above.)


In [15]:
#adapted from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

n_clusters = 10
f_scaled = scale(features)
reduced_data = PCA(n_components=2).fit_transform(f_scaled)
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=10)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means clustering on the Zillow ZHVI dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()


Not getting good silhouette scores?

Try adding more features from Zillow/data.
There are also many other clustering methods with scikit-learn to explore here.


In [ ]: