The Elbow Method

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

The end of these show nows also shows a slightly more canonical use of The Elbow Method for a clustering problem, which is a more typical use case.



In [9]:

    
%matplotlib inline
import os
import json
import pickle
import geocoder
import math
import random
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt









    



/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))



In [3]:

    
addresses = {}



In [4]:

    
root = '../data/'
allfiles = []
dirs = os.listdir(root)
for d in dirs:
    files = os.listdir(root + d)
    for f in files:
        allfiles.append(root + d + '/' + f)



In [544]:

    
random.shuffle(allfiles)
len(allfiles)









    Out[544]:





166212



In [45]:

    
def persist(addresses):
    save = {}
    for key in addresses.keys():
        addr = addresses[key]
        val = addr.latlng
        val.append(addr.accuracy)
        save[key] = val
    #
    h = open('addresses.pkl', 'wb')
    pickle.dump(save, h)
    h.close()



In [120]:

    
def flatten(sarr):
    if type(sarr) == str or type(sarr) == unicode:
        return sarr
    res = ''
    for s in sarr:
        s = s.strip()
        if len(s) > 0:
            if len(res) > 0:
                res += s + ', '
            else:
                res = s
    return res



In [314]:

    
def numClean(x):
    if x is None:
        return None
    if x == '':
        return None
    if type(x) == float:
        return x
    try:
        return float(x.replace('$', '').replace(',', ''))
    except ValueError:
        return None



In [306]:

    
def clean_details(details, blockLLMap):
    c = 0
    details['Legals'] = flatten(details['Legals'])
    if details.has_key('Improvements'):
        di = details['Improvements']
        if len(di)>0:
            imps = di[0]
            for key in imps.keys():
                val = imps[key]
                details['improve_' + key] = val
            del details['Improvements']
        convertFields = ['SalePrice', 'FIXEVEV', 'FIXTVAL', 'HOEVAL', 'IMPROVAL', 'LANDVAL', 'PPEXEVAL', 'PPVALUE', 'REEXEVAL', 'improve_BDLBATH', 'improve_BDLBDRM', 'improve_BDLEFFYR', 'improve_BDLSQMAIN', 'improve_BDLUNITS', 'improve_BDLYRBLT']
        for cf in convertFields:
            if details.has_key(cf):
                details[cf] = numClean(details[cf])
    head = int(details['AIN'])/1000 * 1000 + 1
    try:
        ll = blockLLMap[head]
        details['latitude'] = ll.lat
        details['longitude'] = ll.lng
    except KeyError:
        return details
    return details

I'm far from being an expert on these AIN numbers, but it's clear there's some order to them. For my application, I'm comfortable saying that every parcel on the same block can be considered to have a distance of 0. This will save me the pain of trying to geocode hundreds of thousands of addresses. I might later work on putting some offset in based on the data I already have. For now, let's grab every AIN that appears to be at the "head" of a block.



In [152]:

    
errcount = 0
c = 0
blockLLMap = {}
alldetails = []
for f in allfiles:
    try:
        data = json.loads(pickle.load(open(f, 'r')).text)
        details = data['results']['ParcelDetails']
        if details:
            if int(details['AIN']) % 1000 == 1:
                addr = details['Address1'] + ' ' + details['Address2']
                try:
                    loc = addresses[addr]
                    c += 1
                except KeyError:
                    print 'Lookup', addr
                    loc = geocoder.google(addr)
                    addresses[addr] = loc
                blockLLMap[int(details['AIN'])] = loc
                if c % 1000 == 0:
                    print 'Completed', 100.0 * c / len(allfiles), errcount
            alldetails.append(details)
    except:
        print 'Error on: ', f
        errcount += 1
        if errcount>10:
            break

# Since I did a bunch of slow API lookups, let's cache this in case I want to use it again in the future.
persist(addresses)









    



Error on:  ../data/old/addresses.pkl.pkl
Completed 0.601641277405 1
Completed 1.20328255481 1
Completed 1.80492383221 1



In [214]:

    
# Do data cleaning
errcount = 0
c = 0
all2 = []
for details in alldetails:
    details = clean_details(details, blockLLMap)
    all2.append(details)

df = pd.DataFrame(all2)



In [254]:

    
# See how many latitudes are missing given my assumption
df['latitude'].apply(lambda x: math.isnan(x)).sum()









    Out[254]:





63045

There are enough lat/lngs missing that I want to fill in some of the blanks. Luckily, there's a lot but it's manageable with a long running script.



In [ ]:

    
c=0
gaps = df[df['latitude'].apply(lambda x: math.isnan(x))]
print gaps.shape
for r in range(gaps.shape[0]):
    row = gaps.iloc[r]
    addr = row['Address1'] + ' ' + row['Address2']
    try:
        loc = addresses[addr]
        c += 1
    except KeyError:
        print 'Lookup', addr
        loc = geocoder.google(addr)
        addresses[addr] = loc
    blockLLMap[int(details['AIN'])] = loc
    if c % 1000 == 0:
        print 'Completed', 100.0 * r / gaps.shape[0], errcount



In [225]:

    
df['AIN'] = df['AIN'].astype(int)



In [226]:

    
# save df to csv
df.to_csv('details.csv', sep='\t', index=False, encoding='utf-8')



In [ ]:

    
# all on block are considered 0



In [545]:

    
# Sample of the data
df.head()









    Out[545]:






  
    
      
      AIN
      Address1
      Address2
      Assr_Index_Map
      Assr_Map
      CLUSTER
      FIXEVEV
      FIXTVAL
      FormattedAIN
      HOEVAL
      ...
      UseType_Label
      improve_BDLBATH
      improve_BDLBDRM
      improve_BDLEFFYR
      improve_BDLSQMAIN
      improve_BDLUNITS
      improve_BDLYRBLT
      improve_ImprovementNum
      latitude
      longitude
    
  
  
    
      0
      4252019010
      3741 VETERAN AVE
      LOS ANGELES CA 90034
      None
      None
      09156
      0
      0
      4252-019-010
      7000
      ...
      Single Family Residential
      1
      2
      1950
      905
      1
      1950
      1
      NaN
      NaN
    
    
      1
      4330028006
      400 S BEVERLY DR
      BEVERLY HILLS CA 90212
      None
      None
      23620
      0
      0
      4330-028-006
      0
      ...
      Commercial / Industrial
      0
      0
      1970
      45326
      0
      1957
      1
      NaN
      NaN
    
    
      2
      4277013052
      933 21ST ST, APT 0013
      SANTA MONICA CA 90403
      None
      None
      07398
      0
      0
      4277-013-052
      7000
      ...
      Condominium
      3
      2
      1974
      1254
      1
      1974
      1
      34.036635
      -118.486177
    
    
      3
      4317009025
      2006 THAYER AVE
      LOS ANGELES CA 90025
      None
      None
      07153
      0
      0
      4317-009-025
      7000
      ...
      Single Family Residential
      3
      4
      1960
      2111
      1
      1925
      1
      34.049540
      -118.425438
    
    
      4
      4249026288
      
      
      None
      None
      22811
      0
      0
      4249-026-288
      0
      ...
      Vacant Land
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 36 columns



In [266]:

    
df.shape









    Out[266]:





(148557, 36)



In [546]:

    
# Filter down to those I have a price for

f1 = df['SalePrice'] > 0
f1 = df['SalePrice'] > 0
f2 = df['latitude'].apply(lambda x: not(math.isnan(x)))
df2 = df[f1 & f2]
print 'Remaining after filter:', 1.0 * df2.shape[0] / df.shape[0]
print df2.shape[0]









    



Remaining after filter: 0.0380594653904
5654

I have a disappointingly small subset, but let's see if we can make the most of it.

So I'm using Euclydian distance, which is strictly incorrect. I should be using the geodesic distance, since the curvature of the Earth, especially at the longitude of Los Angeles, will introduce significant error in the Euclydian distance approximation. Since i'm looking for the nearest neighbors, this error might be minimized, but I'm not sure if it will be trivial. Yet, I'm skipping this step for now, since the objective in this particular mini-episode is to talk about the Elbow Method, not about distance measures.



In [344]:

    
# load to k-d tree
lat = df2['latitude']
lng = df2['longitude']
df2.index = np.arange(df2.shape[0]) # Just to make sure the index is useful to us
k=2
kdt = cKDTree(zip(lat,lng))



In [540]:

    
maxk = 25



In [541]:

    
# iterate over k, calculate average price, store mean and accuracy
results = []
for k in range(1, maxk):
    for r in np.arange(df2.shape[0]):
        row = df2.iloc[r]
        sp = row['SalePrice']
        distances, neighbors = kdt.query((row['latitude'], row['longitude']), k=k)
        if type(distances) == float:
            distances = pd.Series(distances)
        if type(neighbors) == int:
            neighbors = pd.Series(neighbors)
        if sum(neighbors==r):
            selfidx = np.where(neighbors == r)[0][0]
            neighbors = pd.Series(neighbors)
            neighbors.drop(selfidx, inplace=True)
        avg_dist = distances.mean()
        #avg_sp = df2.iloc[neighbors]['SalePrice'].mean()
        avg_sp = math.sqrt(sum((sp - df2.iloc[neighbors]['SalePrice']).apply(lambda x: math.pow(x, 2))))
        result = [sp, k, avg_dist, avg_sp, len(neighbors)]
        results.append(result)



In [542]:

    
rdf = pd.DataFrame(results)
rdf.columns = ['SalePrice', 'k', 'adist', 'avgNeighborSalePrice', 'numNeighbors']

# There are some extreme outliers, so let's trim the top and bottom
low, high = rdf['SalePrice'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['SalePrice'] > low]
rdf = rdf[rdf['SalePrice'] < high]
rdf['delta'] = rdf['avgNeighborSalePrice'] - rdf['SalePrice']
rdf['sq_err'] = rdf['delta'].apply(lambda x: abs(x))
low, high = rdf['delta'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['delta'] > low]
rdf = rdf[rdf['delta'] < high]
gb = rdf.groupby('k')['sq_err'].mean()



In [543]:

    
plt.figure(figsize=(10,5))
plt.bar(gb.index, gb)
plt.xlabel('k=')
plt.gca().xaxis.grid(False)
plt.title('Mean square error ')
plt.xticks(gb.index + 0.4, gb.index)
plt.xlim(.75, maxk)
plt.show()

The figure above plots the mean square error of the price of a home to it's k nearest neighbors. Naturally, this diverges as k increases, because it means we're searching further and further away from the home for comparables, which means homes that are less comparable in terms of location.

Yet, we find a small minimium with k=3, which would imply that the optimal number of nearest neighbors to use to estimate the price of a home based strictly on price alone, is three.

The reader should be a bit skeptical of this result, because we have some unstated assumptions about the data. First, we assume the data source is good. Yet, it's a questionable usage. I got it by crawling an unpublicized API of the assessor's office. It clearly has large gaps (homes without recent sale prices). I have no explanation for these gaps. If they were random ommisions, that would not be so bad, but I suspect their introducing a bias to the dataset.

I also don't account for trend or seasonality. Because my dataset got down to be rather small compared to the total population of homes in LA, I used all the data. In reality, more recent data would be better (perhaps the last 3 years?). Additionally, home sales have seasonal components that I didn't account for. Thus, the nearest neigbors might have all sold during a more favorable time in the market, and are thus, not exactly comparable to the home I'm estimating.

I'm also only comparing on price. Euclydian distance (which has it's own problems stated above) is the only metric. A one bedroom condo might be compared to a 4 bedroom single family dettached home by this method, if it happens to be closest. This is clearly not ideal and will introduce a significant amount of variance into my results.

So where is the elbow? Well, it's arguable at k=3. One might say this is not actually an "elbow method" problem because this method is typically used for monotonically decreasing trade-offs. In this way the more common use in k-means clustering would have been a better example. For a discussion of the more ideal use cases, please listen to this episode.

Lastly, consider these show notes an demonstration of technology and technique over actual result. There are many things to be skeptical of in this analysis, but I learned a few things a long the way that will go into our upcoming home purchase decisions.

Clustering Example

Let's dust off the old iris dataset and see how The Elbow Method performs in it's more typical use case of identifying the k for k-means clustering.



In [1]:

    
from sklearn.cluster import KMeans
from sklearn import datasets
iris = datasets.load_iris()



In [60]:

    
X = iris.data
rsqs = []
ks = range(2,20)
for k in ks:
    km = KMeans(n_clusters=k)
    fit = km.fit(X)
    # Calculate the R-squared for the cluster, which is explained (within variance) over total variance

    labels = fit.predict(X)
    y = pd.DataFrame(labels, columns=['cluster'])
    x = pd.DataFrame(X, columns=['x0','x1','x2','x3'])
    df = pd.concat([x, y], axis=1)
    centroids = df.groupby(['cluster']).mean()
    centroids.columns = ['c0', 'c1', 'c2', 'c3']
    mcentroid = df[['x0', 'x1', 'x2', 'x3']].mean()
    df.set_index('cluster', inplace=True)
    df = df.join(centroids, how='outer')

    # sum of squared error within each cluster
    df['sse'] = (df['x0']-df['c0']).apply(lambda x: math.pow(x, 2)) \
              + (df['x1']-df['c1']).apply(lambda x: math.pow(x, 2)) \
              + (df['x2']-df['c2']).apply(lambda x: math.pow(x, 2)) \
              + (df['x3']-df['c3']).apply(lambda x: math.pow(x, 2))

    # sum of squared error total            
    df['m0'] = mcentroid['x0']
    df['m1'] = mcentroid['x1']
    df['m2'] = mcentroid['x2']
    df['m3'] = mcentroid['x3']
    df['tse'] = (df['x0']-df['m0']).apply(lambda x: math.pow(x, 2)) \
              + (df['x1']-df['m1']).apply(lambda x: math.pow(x, 2)) \
              + (df['x2']-df['m2']).apply(lambda x: math.pow(x, 2)) \
              + (df['x3']-df['m3']).apply(lambda x: math.pow(x, 2))

    rsq = 1 - df['sse'].sum() / df['tse'].sum()
    rsqs.append(rsq)



In [61]:

    
plt.plot(ks, rsqs)
plt.xlabel('Number of clusters')
plt.ylabel('R-Squared')
plt.xlim(1,20)
plt.show()

The above example of an elbow plot for clustesr found using k-means is what one would typically use to inform the elbow method, and thus select your k parameter. To my eye, there is no obvious elbow in this example, highlighting one of my complaints expressed in the episode.

	AIN	Address1	Address2	Assr_Index_Map	Assr_Map	CLUSTER	FormattedAIN	HOEVAL	...	UseType_Label	improve_BDLBATH	improve_BDLBDRM	improve_BDLEFFYR	improve_BDLSQMAIN	improve_BDLUNITS	improve_BDLYRBLT	improve_ImprovementNum	latitude	longitude
0	4252019010	3741 VETERAN AVE	LOS ANGELES CA 90034	None	None	09156	4252-019-010	7000	...	Single Family Residential	1	2	1950	905	1	1950	1	NaN	NaN
1	4330028006	400 S BEVERLY DR	BEVERLY HILLS CA 90212	None	None	23620	4330-028-006	0	...	Commercial / Industrial	0	0	1970	45326	0	1957	1	NaN	NaN
2	4277013052	933 21ST ST, APT 0013	SANTA MONICA CA 90403	None	None	07398	4277-013-052	7000	...	Condominium	3	2	1974	1254	1	1974	1	34.036635	-118.486177
3	4317009025	2006 THAYER AVE	LOS ANGELES CA 90025	None	None	07153	4317-009-025	7000	...	Single Family Residential	3	4	1960	2111	1	1925	1	34.049540	-118.425438
4	4249026288			None	None	22811	4249-026-288	0	...	Vacant Land	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN