# The Elbow Method

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

The end of these show nows also shows a slightly more canonical use of The Elbow Method for a clustering problem, which is a more typical use case.

``````

In :

%matplotlib inline
import os
import json
import pickle
import geocoder
import math
import random
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree
import matplotlib.pyplot as plt

``````
``````

/usr/local/lib/python2.7/dist-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))

``````
``````

In :

``````
``````

In :

root = '../data/'
allfiles = []
dirs = os.listdir(root)
for d in dirs:
files = os.listdir(root + d)
for f in files:
allfiles.append(root + d + '/' + f)

``````
``````

In :

random.shuffle(allfiles)
len(allfiles)

``````
``````

Out:

166212

``````
``````

In :

save = {}
save[key] = val
#
pickle.dump(save, h)
h.close()

``````
``````

In :

def flatten(sarr):
if type(sarr) == str or type(sarr) == unicode:
return sarr
res = ''
for s in sarr:
s = s.strip()
if len(s) > 0:
if len(res) > 0:
res += s + ', '
else:
res = s
return res

``````
``````

In :

def numClean(x):
if x is None:
return None
if x == '':
return None
if type(x) == float:
return x
try:
return float(x.replace('\$', '').replace(',', ''))
except ValueError:
return None

``````
``````

In :

def clean_details(details, blockLLMap):
c = 0
details['Legals'] = flatten(details['Legals'])
if details.has_key('Improvements'):
di = details['Improvements']
if len(di)>0:
imps = di
for key in imps.keys():
val = imps[key]
details['improve_' + key] = val
del details['Improvements']
convertFields = ['SalePrice', 'FIXEVEV', 'FIXTVAL', 'HOEVAL', 'IMPROVAL', 'LANDVAL', 'PPEXEVAL', 'PPVALUE', 'REEXEVAL', 'improve_BDLBATH', 'improve_BDLBDRM', 'improve_BDLEFFYR', 'improve_BDLSQMAIN', 'improve_BDLUNITS', 'improve_BDLYRBLT']
for cf in convertFields:
if details.has_key(cf):
details[cf] = numClean(details[cf])
head = int(details['AIN'])/1000 * 1000 + 1
try:
details['latitude'] = ll.lat
details['longitude'] = ll.lng
except KeyError:
return details
return details

``````

I'm far from being an expert on these AIN numbers, but it's clear there's some order to them. For my application, I'm comfortable saying that every parcel on the same block can be considered to have a distance of 0. This will save me the pain of trying to geocode hundreds of thousands of addresses. I might later work on putting some offset in based on the data I already have. For now, let's grab every AIN that appears to be at the "head" of a block.

``````

In :

errcount = 0
c = 0
blockLLMap = {}
alldetails = []
for f in allfiles:
try:
details = data['results']['ParcelDetails']
if details:
if int(details['AIN']) % 1000 == 1:
try:
c += 1
except KeyError:
blockLLMap[int(details['AIN'])] = loc
if c % 1000 == 0:
print 'Completed', 100.0 * c / len(allfiles), errcount
alldetails.append(details)
except:
print 'Error on: ', f
errcount += 1
if errcount>10:
break

# Since I did a bunch of slow API lookups, let's cache this in case I want to use it again in the future.

``````
``````

Completed 0.601641277405 1
Completed 1.20328255481 1
Completed 1.80492383221 1

``````
``````

In :

# Do data cleaning
errcount = 0
c = 0
all2 = []
for details in alldetails:
details = clean_details(details, blockLLMap)
all2.append(details)

df = pd.DataFrame(all2)

``````
``````

In :

# See how many latitudes are missing given my assumption
df['latitude'].apply(lambda x: math.isnan(x)).sum()

``````
``````

Out:

63045

``````

There are enough lat/lngs missing that I want to fill in some of the blanks. Luckily, there's a lot but it's manageable with a long running script.

``````

In [ ]:

c=0
gaps = df[df['latitude'].apply(lambda x: math.isnan(x))]
print gaps.shape
for r in range(gaps.shape):
row = gaps.iloc[r]
try:
c += 1
except KeyError:
blockLLMap[int(details['AIN'])] = loc
if c % 1000 == 0:
print 'Completed', 100.0 * r / gaps.shape, errcount

``````
``````

In :

df['AIN'] = df['AIN'].astype(int)

``````
``````

In :

# save df to csv
df.to_csv('details.csv', sep='\t', index=False, encoding='utf-8')

``````
``````

In [ ]:

# all on block are considered 0

``````
``````

In :

# Sample of the data

``````
``````

Out:

AIN
Assr_Index_Map
Assr_Map
CLUSTER
FIXEVEV
FIXTVAL
FormattedAIN
HOEVAL
...
UseType_Label
improve_BDLBATH
improve_BDLBDRM
improve_BDLEFFYR
improve_BDLSQMAIN
improve_BDLUNITS
improve_BDLYRBLT
improve_ImprovementNum
latitude
longitude

0
4252019010
3741 VETERAN AVE
LOS ANGELES CA 90034
None
None
09156
0
0
4252-019-010
7000
...
Single Family Residential
1
2
1950
905
1
1950
1
NaN
NaN

1
4330028006
400 S BEVERLY DR
BEVERLY HILLS CA 90212
None
None
23620
0
0
4330-028-006
0
...
Commercial / Industrial
0
0
1970
45326
0
1957
1
NaN
NaN

2
4277013052
933 21ST ST, APT 0013
SANTA MONICA CA 90403
None
None
07398
0
0
4277-013-052
7000
...
Condominium
3
2
1974
1254
1
1974
1
34.036635
-118.486177

3
4317009025
2006 THAYER AVE
LOS ANGELES CA 90025
None
None
07153
0
0
4317-009-025
7000
...
Single Family Residential
3
4
1960
2111
1
1925
1
34.049540
-118.425438

4
4249026288

None
None
22811
0
0
4249-026-288
0
...
Vacant Land
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN

5 rows × 36 columns

``````
``````

In :

df.shape

``````
``````

Out:

(148557, 36)

``````
``````

In :

# Filter down to those I have a price for

f1 = df['SalePrice'] > 0
f1 = df['SalePrice'] > 0
f2 = df['latitude'].apply(lambda x: not(math.isnan(x)))
df2 = df[f1 & f2]
print 'Remaining after filter:', 1.0 * df2.shape / df.shape
print df2.shape

``````
``````

Remaining after filter: 0.0380594653904
5654

``````

I have a disappointingly small subset, but let's see if we can make the most of it.

So I'm using Euclydian distance, which is strictly incorrect. I should be using the geodesic distance, since the curvature of the Earth, especially at the longitude of Los Angeles, will introduce significant error in the Euclydian distance approximation. Since i'm looking for the nearest neighbors, this error might be minimized, but I'm not sure if it will be trivial. Yet, I'm skipping this step for now, since the objective in this particular mini-episode is to talk about the Elbow Method, not about distance measures.

``````

In :

lat = df2['latitude']
lng = df2['longitude']
df2.index = np.arange(df2.shape) # Just to make sure the index is useful to us
k=2
kdt = cKDTree(zip(lat,lng))

``````
``````

In :

maxk = 25

``````
``````

In :

# iterate over k, calculate average price, store mean and accuracy
results = []
for k in range(1, maxk):
for r in np.arange(df2.shape):
row = df2.iloc[r]
sp = row['SalePrice']
distances, neighbors = kdt.query((row['latitude'], row['longitude']), k=k)
if type(distances) == float:
distances = pd.Series(distances)
if type(neighbors) == int:
neighbors = pd.Series(neighbors)
if sum(neighbors==r):
selfidx = np.where(neighbors == r)
neighbors = pd.Series(neighbors)
neighbors.drop(selfidx, inplace=True)
avg_dist = distances.mean()
#avg_sp = df2.iloc[neighbors]['SalePrice'].mean()
avg_sp = math.sqrt(sum((sp - df2.iloc[neighbors]['SalePrice']).apply(lambda x: math.pow(x, 2))))
result = [sp, k, avg_dist, avg_sp, len(neighbors)]
results.append(result)

``````
``````

In :

rdf = pd.DataFrame(results)
rdf.columns = ['SalePrice', 'k', 'adist', 'avgNeighborSalePrice', 'numNeighbors']

# There are some extreme outliers, so let's trim the top and bottom
low, high = rdf['SalePrice'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['SalePrice'] > low]
rdf = rdf[rdf['SalePrice'] < high]
rdf['delta'] = rdf['avgNeighborSalePrice'] - rdf['SalePrice']
rdf['sq_err'] = rdf['delta'].apply(lambda x: abs(x))
low, high = rdf['delta'].quantile([.05, .95]).tolist()
rdf = rdf[rdf['delta'] > low]
rdf = rdf[rdf['delta'] < high]
gb = rdf.groupby('k')['sq_err'].mean()

``````
``````

In :

plt.figure(figsize=(10,5))
plt.bar(gb.index, gb)
plt.xlabel('k=')
plt.gca().xaxis.grid(False)
plt.title('Mean square error ')
plt.xticks(gb.index + 0.4, gb.index)
plt.xlim(.75, maxk)
plt.show()

``````
``````

``````

The figure above plots the mean square error of the price of a home to it's k nearest neighbors. Naturally, this diverges as k increases, because it means we're searching further and further away from the home for comparables, which means homes that are less comparable in terms of location.

Yet, we find a small minimium with k=3, which would imply that the optimal number of nearest neighbors to use to estimate the price of a home based strictly on price alone, is three.

The reader should be a bit skeptical of this result, because we have some unstated assumptions about the data. First, we assume the data source is good. Yet, it's a questionable usage. I got it by crawling an unpublicized API of the assessor's office. It clearly has large gaps (homes without recent sale prices). I have no explanation for these gaps. If they were random ommisions, that would not be so bad, but I suspect their introducing a bias to the dataset.

I also don't account for trend or seasonality. Because my dataset got down to be rather small compared to the total population of homes in LA, I used all the data. In reality, more recent data would be better (perhaps the last 3 years?). Additionally, home sales have seasonal components that I didn't account for. Thus, the nearest neigbors might have all sold during a more favorable time in the market, and are thus, not exactly comparable to the home I'm estimating.

I'm also only comparing on price. Euclydian distance (which has it's own problems stated above) is the only metric. A one bedroom condo might be compared to a 4 bedroom single family dettached home by this method, if it happens to be closest. This is clearly not ideal and will introduce a significant amount of variance into my results.

So where is the elbow? Well, it's arguable at k=3. One might say this is not actually an "elbow method" problem because this method is typically used for monotonically decreasing trade-offs. In this way the more common use in k-means clustering would have been a better example. For a discussion of the more ideal use cases, please listen to this episode.

Lastly, consider these show notes an demonstration of technology and technique over actual result. There are many things to be skeptical of in this analysis, but I learned a few things a long the way that will go into our upcoming home purchase decisions.

## Clustering Example

Let's dust off the old iris dataset and see how The Elbow Method performs in it's more typical use case of identifying the k for k-means clustering.

``````

In :

from sklearn.cluster import KMeans
from sklearn import datasets

``````
``````

In :

X = iris.data
rsqs = []
ks = range(2,20)
for k in ks:
km = KMeans(n_clusters=k)
fit = km.fit(X)
# Calculate the R-squared for the cluster, which is explained (within variance) over total variance

labels = fit.predict(X)
y = pd.DataFrame(labels, columns=['cluster'])
x = pd.DataFrame(X, columns=['x0','x1','x2','x3'])
df = pd.concat([x, y], axis=1)
centroids = df.groupby(['cluster']).mean()
centroids.columns = ['c0', 'c1', 'c2', 'c3']
mcentroid = df[['x0', 'x1', 'x2', 'x3']].mean()
df.set_index('cluster', inplace=True)
df = df.join(centroids, how='outer')

# sum of squared error within each cluster
df['sse'] = (df['x0']-df['c0']).apply(lambda x: math.pow(x, 2)) \
+ (df['x1']-df['c1']).apply(lambda x: math.pow(x, 2)) \
+ (df['x2']-df['c2']).apply(lambda x: math.pow(x, 2)) \
+ (df['x3']-df['c3']).apply(lambda x: math.pow(x, 2))

# sum of squared error total
df['m0'] = mcentroid['x0']
df['m1'] = mcentroid['x1']
df['m2'] = mcentroid['x2']
df['m3'] = mcentroid['x3']
df['tse'] = (df['x0']-df['m0']).apply(lambda x: math.pow(x, 2)) \
+ (df['x1']-df['m1']).apply(lambda x: math.pow(x, 2)) \
+ (df['x2']-df['m2']).apply(lambda x: math.pow(x, 2)) \
+ (df['x3']-df['m3']).apply(lambda x: math.pow(x, 2))

rsq = 1 - df['sse'].sum() / df['tse'].sum()
rsqs.append(rsq)

``````
``````

In :

plt.plot(ks, rsqs)
plt.xlabel('Number of clusters')
plt.ylabel('R-Squared')
plt.xlim(1,20)
plt.show()

``````
``````

``````

The above example of an elbow plot for clustesr found using k-means is what one would typically use to inform the elbow method, and thus select your k parameter. To my eye, there is no obvious elbow in this example, highlighting one of my complaints expressed in the episode.