DATASCI W261: Machine Learning at Scale

Week 11, Homework 10

Katrina Adams

kradams@ischool.berkeley.edu
17 November 2015

Start Spark



In [1]:

    
%cd ~/Documents/W261/hw10/









    



/Users/davidadams/Documents/W261/hw10



In [3]:

    
import os
import sys

spark_home = os.environ['SPARK_HOME'] = \
   '/Users/davidadams/packages/spark-1.5.1-bin-hadoop2.6/'

if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')
sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home,'python/pyspark/shell.py'))









    



Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.6 (default, Sep  9 2014 15:04:36)
SparkContext available as sc, HiveContext available as sqlContext.

HW 10.0: Short answer questions

What is Apache Spark and how is it different to Apache Hadoop?

Fill in the blanks: Spark API consists of interfaces to develop applications based on it in Java, ...... languages (list languages).

Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or ????? in a distributed manner.

What is an RDD and show a fun example of creating one and bringing the first element back to the driver program.

What is lazy evaluation and give an intuitoive example of lazy evaluation and comment on the massive computational savings to be had from lazy evaluation.

Answers
Apache Spark is a framework for parallel computations over big data with optimized genaral execution graphs over RDDs. It differs from Apache Hadoop by storing data in-memory instead of writing to disk so it is much faster. Spark also required 2-5 time less code than Hadoop. With Spark you can do read-eval-print loop, while Hadoop cannot.

Spark API consists of interfaces to develop applications based on it in Java, Scala, Python, and R languages.

Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or YARN in a distributed manner.

A Resilient distributed data set (RDD) is a distributed collection of elements, which are automatically distributed across the cluster for parallel computations. RDDs can also be recomputed from the execution graph providing fault tolerance.

Lazy evaluation means that transformations are not computed immediately, but only when an action is performed on the trandformed RDD. An example of lazy evaluation is reading the first line of a file. If creation of an RDD from a text file were not computed lazily, then the entire file would be read when the RDD was created. However, with laxy evaluation, if we then perform an action of examining the first line, only the first line needs to be read. Lazy evaluation means that values are only computed if they are required, potentially resulting in significant computational savings.



In [110]:

    
''' Example of creating an RDD and bringing the first element back to the driver'''
import numpy as np

dataRDD = sc.parallelize(np.random.random_sample(1000))   
data2X= dataRDD.map(lambda x: x*2)
dataGreaterThan1 = data2X.filter(lambda x: x > 1.0)
print dataGreaterThan1.take(1)









    



[1.3881815116646858]

HW 10.1: In Spark write the code to count how often each word appears in a text document (or set of documents). Please use this homework document as a the example document to run an experiment.
Report the following: provide a sorted list of tokens in decreasing order of frequency of occurence.



In [154]:

    
def hw10_1():
    # create RDD from text file and split at spaces to get words
    rdd = sc.textFile("HW10-Public/MIDS-MLS-HW-10.txt")
    words = rdd.flatMap(lambda x: x.strip().split(" "))
    # count words and sort
    sortedcounts = words.map(lambda x: (x, 1)) \
        .reduceByKey(lambda x, y: x + y) \
        .map(lambda (x,y): (y, x)) \
        .sortByKey(False) \
        .map(lambda (x,y): (y, x))

    for line in mysorted.collect():
        print line
        
    return None

hw10_1()









    



(89, u'')
(44, u'the')
(23, u'and')
(17, u'of')
(17, u'in')
(11, u'a')
(9, u'code')
(9, u'to')
(8, u'=')
(8, u'data')
(7, u'for')
(7, u'with')
(7, u'is')
(7, u'Using')
(7, u'on')
(6, u'#')
(6, u'===')
(6, u'your')
(6, u'KMeans')
(5, u'from')
(5, u'this')
(5, u'as')
(4, u'What')
(4, u'Sum')
(4, u'Comment')
(4, u'Squared')
(4, u'each')
(4, u'linear')
(4, u'HW')
(4, u'Set')
(4, u'==HW')
(4, u'example')
(4, u'clusters')
(3, u'report')
(3, u'words')
(3, u'Spark')
(3, u'+')
(3, u'lazy')
(3, u'100')
(3, u'training')
(3, u'count')
(3, u'following')
(3, u'model')
(3, u'Errors')
(3, u'results')
(3, u'using')
(3, u'Within')
(3, u'===HW')
(3, u'import')
(3, u'after')
(3, u'plot')
(3, u'it')
(3, u'an')
(3, u'regression')
(3, u'document')
(3, u'provided')
(3, u'x')
(2, u'-----------------------')
(2, u'notebook:')
(2, u'plots.')
(2, u'--')
(2, u'iterations')
(2, u'list')
(2, u'run')
(2, u'regression.')
(2, u'Report')
(2, u'evaluation')
(2, u'available')
(2, u'https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0')
(2, u'here')
(2, u'RIDGE')
(2, u'10.6.1')
(2, u'Apache')
(2, u'word')
(2, u'LASS0')
(2, u'set.')
(2, u'HW10.3.')
(2, u'clusters.')
(2, u'NOTE')
(2, u'set')
(2, u'testing')
(2, u'iterations,')
(2, u'between')
(2, u'be')
(2, u'found')
(2, u'points')
(2, u'how')
(2, u'(OPTIONAL)')
(2, u'or')
(2, u'findings.')
(2, u'one')
(2, u'Explain.')
(2, u'that')
(2, u'differences')
(2, u'"myModelPath")')
(2, u'up')
(2, u'Generate')
(2, u'HW10.3')
(2, u'vector')
(2, u'any')
(2, u'kmeans_data.txt')
(2, u'In')
(2, u'repeat')
(2, u'decreasing')
(2, u'order')
(2, u'Fill')
(1, u'y)')
(1, u'weight(X)=')
(1, u'10.2:')
(1, u'intuitoive')
(1, u'(cluster')
(1, u'consists')
(1, u'based')
(1, u'parameters')
(1, u'error(point):')
(1, u'(Euclidean')
(1, u'Modify')
(1, u'3')
(1, u'Here')
(1, u'languages')
(1, u'snippet')
(1, u'SQRT(X.X)=')
(1, u'2,')
(1, u'homework')
(1, u'Load')
(1, u'return')
(1, u'runs=10,')
(1, u'KMeans,')
(1, u'runs')
(1, u'experiment.')
(1, u'Berkeley,')
(1, u'compute')
(1, u'sc.textFile("kmeans_data.txt")')
(1, u'bringing')
(1, u'program.')
(1, u'===========================================================================')
(1, u'resource')
(1, u'questions===')
(1, u'array([float(x)')
(1, u'frequency.')
(1, u'(point')
(1, u'1/||X||,')
(1, u'(or')
(1, u'Learning')
(1, u'KMeansModel.load(sc,')
(1, u'where')
(1, u'initializationMode="random")')
(1, u'generation')
(1, u'begin')
(1, u'(homegrown')
(1, u'provided).')
(1, u'10.4:')
(1, u'sqrt(sum([x**2')
(1, u'please')
(1, u'per')
(1, u'cell')
(1, u'blanks:')
(1, u'above')
(1, u'math')
(1, u'HW10')
(1, u'sameModel')
(1, u'driver')
(1, u'interfaces')
(1, u'modify')
(1, u'20')
(1, u'fun')
(1, u'error(point)).reduce(lambda')
(1, u'Weight')
(1, u'comment')
(1, u'implementation')
(1, u'length')
(1, u'resulting')
(1, u'follows:')
(1, u'UC')
(1, u'load')
(1, u'letters')
(1, u'?????')
(1, u'==================')
(1, u'distributed')
(1, u'Please')
(1, u'done')
(1, u'10.6:')
(1, u'DATSCI')
(1, u'array')
(1, u'==================END')
(1, u'https://www.dropbox.com/s/3nsthvp8g2rrrdh/EM-Kmeans.ipynb?dl=0')
(1, u'clustering')
(1, u'use')
(1, u'findings')
(1, u'HW10.5.')
(1, u'(a-z)')
(1, u'numpy')
(1, u'X2^2)')
(1, u'sort')
(1, u'x,')
(1, u'manner.')
(1, u'(using')
(1, u'had')
(1, u'case')
(1, u'10.0:')
(1, u'Run')
(1, u'MLlib-centric')
(1, u'single')
(1, u'============================================')
(1, u'labeled')
(1, u"')]))")
(1, u'def')
(1, u'model?')
(1, u'parse')
(1, u'Call')
(1, u'KMeans.train(parsedData,')
(1, u'WSSSE')
(1, u'different')
(1, u'data)')
(1, u'provide')
(1, u'(one')
(1, u'-')
(1, u'sqrt')
(1, u'write')
(1, u'answer')
(1, u'point:')
(1, u'Machine')
(1, u'W261')
(1, u'Download')
(1, u'SQRT(X1^2')
(1, u'develop')
(1, u'data.map(lambda')
(1, u'tune')
(1, u'documents).')
(1, u'tokens')
(1, u'lower')
(1, u'Mesos')
(1, u'......')
(1, u'savings')
(1, u'1')
(1, u'algorithms')
(1, u'pyspark.mllib.clustering')
(1, u'hyper')
(1, u'show')
(1, u'text')
(1, u'"Gradient')
(1, u'Final')
(1, u"line.split('")
(1, u'inverse')
(1, u'number')
(1, u'Linear')
(1, u'la')
(1, u"MLlib's")
(1, u'#10===')
(1, u'X2.')
(1, u'do')
(1, u'good')
(1, u'get')
(1, u'evaluate')
(1, u'Kmean')
(1, u'framework')
(1, u'made')
(1, u'progress')
(1, u'sorted')
(1, u'dataset')
(1, u'10.5:')
(1, u'evaluation.')
(1, u'follows')
(1, u'https://www.dropbox.com/s/atzqkc0p1eajuz6/LinearRegression-Notebook-Challenge.ipynb?dl=0')
(1, u'server')
(1, u'10.1.1')
(1, u'API')
(1, u'either')
(1, u'output')
(1, u'hundred)')
(1, u'Again')
(1, u'often')
(1, u'ASSIGNMENT')
(1, u'back')
(1, u'experiments')
(1, u'clusters.save(sc,')
(1, u'frequency')
(1, u'Build')
(1, u'X')
(1, u'center)]))')
(1, u'MLLib')
(1, u'experiements')
(1, u'Save')
(1, u'NOTE:')
(1, u'parsedData')
(1, u'following:')
(1, u'Evaluate')
(1, u'separate')
(1, u'find')
(1, u'weighted')
(1, u'Regression')
(1, u'Short')
(1, u'creating')
(1, u'provide,')
(1, u'occurence.')
(1, u'Spark,')
(1, u'Hadoop?')
(1, u'10.6.2')
(1, u'by')
(1, u'Justify')
(1, u'clusters.centers[clusters.predict(point)]')
(1, u'languages).')
(1, u'str(WSSSE))')
(1, u'KMEans')
(1, u'massive')
(1, u'maxIterations=10,')
(1, u'first')
(1, u'computational')
(1, u'Plot')
(1, u'snippet:')
(1, u'Scale')
(1, u'plots')
(1, u'11/05/2015')
(1, u'LinearRegressionWithSGD')
(1, u'management')
(1, u'computing')
(1, u'appears')
(1, u'2')
(1, u'parsedData.map(lambda')
(1, u'X1')
(1, u'more')
(1, u'iteration,')
(1, u'||X||')
(1, u'RDD')
(1, u'norm):')
(1, u'10.1:')
(1, u'===MIDS')
(1, u'Error')
(1, u'10')
(1, u'cluster')
(1, u'can')
(1, u'KMeansModel')
(1, u'(list')
(1, u'iterations.')
(1, u'give')
(1, u"MLLib's")
(1, u'print("Within')
(1, u'measure')
(1, u'V1.0')
(1, u'at')
(1, u'work')
(1, u'Then')
(1, u'data.')
(1, u'line:')
(1, u'Java,')
(1, u'instance')
(1, u'other')
(1, u'"')
(1, u'(regularization)".')
(1, u'blanks')
(1, u'10.3:')
(1, u'y:')
(1, u'applications')
(1, u'notebook')
(1, u'such')
(1, u'words.')
(1, u'descent')
(1, u'center')
(1, u'sets')
(1, u'element')
(1, u'train')
(1, u'code)')
(1, u'HW10.4.')

HW 10.1.1 Modify the above word count code to count words that begin with lower case letters (a-z) and report your findings. Again sort the output words in decreasing order of frequency.



In [155]:

    
def hw10_1_1():
    
    def isloweraz(word):
        '''
            check if the word starts with a lower case letter
        '''
        lowercase = 'abcdefghijklmnopqrstuvwxyz'
        try:
            return word[0] in lowercase
        except IndexError:
            return False

    
    # create RDD from text file
    rdd = sc.textFile("HW10-Public/MIDS-MLS-HW-10.txt")
    
    # get words and filter for those that start with a lowercase letter
    words = rdd.flatMap(lambda x: x.strip().split(" ")) \
        .filter(isloweraz)

    # count words and sort
    sortedcounts = words.map(lambda x: (x, 1)) \
        .reduceByKey(lambda x, y: x + y) \
        .map(lambda (x,y): (y, x)) \
        .sortByKey(False) \
        .map(lambda (x,y): (y, x))

    for line in sortedcounts.collect():
        print line
        
    return None

hw10_1_1()









    



(u'the', 44)
(u'and', 23)
(u'of', 17)
(u'in', 17)
(u'a', 11)
(u'code', 9)
(u'to', 9)
(u'data', 8)
(u'for', 7)
(u'with', 7)
(u'is', 7)
(u'on', 7)
(u'your', 6)
(u'from', 5)
(u'this', 5)
(u'as', 5)
(u'clusters', 4)
(u'each', 4)
(u'linear', 4)
(u'example', 4)
(u'count', 3)
(u'words', 3)
(u'report', 3)
(u'lazy', 3)
(u'following', 3)
(u'training', 3)
(u'model', 3)
(u'results', 3)
(u'using', 3)
(u'import', 3)
(u'plot', 3)
(u'it', 3)
(u'an', 3)
(u'regression', 3)
(u'document', 3)
(u'provided', 3)
(u'after', 3)
(u'x', 3)
(u'notebook:', 2)
(u'list', 2)
(u'run', 2)
(u'regression.', 2)
(u'evaluation', 2)
(u'available', 2)
(u'https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0', 2)
(u'here', 2)
(u'iterations', 2)
(u'word', 2)
(u'set.', 2)
(u'clusters.', 2)
(u'plots.', 2)
(u'findings.', 2)
(u'found', 2)
(u'set', 2)
(u'testing', 2)
(u'iterations,', 2)
(u'between', 2)
(u'be', 2)
(u'or', 2)
(u'one', 2)
(u'that', 2)
(u'differences', 2)
(u'up', 2)
(u'vector', 2)
(u'kmeans_data.txt', 2)
(u'any', 2)
(u'how', 2)
(u'repeat', 2)
(u'points', 2)
(u'decreasing', 2)
(u'order', 2)
(u'y)', 1)
(u'sameModel', 1)
(u'intuitoive', 1)
(u'consists', 1)
(u'snippet', 1)
(u'based', 1)
(u'parameters', 1)
(u'error(point):', 1)
(u'distributed', 1)
(u'languages', 1)
(u'homework', 1)
(u'return', 1)
(u'runs=10,', 1)
(u'experiment.', 1)
(u'compute', 1)
(u'sc.textFile("kmeans_data.txt")', 1)
(u'bringing', 1)
(u'program.', 1)
(u'runs', 1)
(u'resource', 1)
(u'questions===', 1)
(u'array([float(x)', 1)
(u'frequency.', 1)
(u'where', 1)
(u'initializationMode="random")', 1)
(u'generation', 1)
(u'provided).', 1)
(u'sqrt(sum([x**2', 1)
(u'please', 1)
(u'per', 1)
(u'cell', 1)
(u'blanks:', 1)
(u'above', 1)
(u'math', 1)
(u'interfaces', 1)
(u'modify', 1)
(u'error(point)).reduce(lambda', 1)
(u'comment', 1)
(u'implementation', 1)
(u'length', 1)
(u'resulting', 1)
(u'follows:', 1)
(u'pyspark.mllib.clustering', 1)
(u'load', 1)
(u'letters', 1)
(u'had', 1)
(u'done', 1)
(u'array', 1)
(u'https://www.dropbox.com/s/3nsthvp8g2rrrdh/EM-Kmeans.ipynb?dl=0', 1)
(u'clustering', 1)
(u'use', 1)
(u'findings', 1)
(u'numpy', 1)
(u'sort', 1)
(u'x,', 1)
(u'manner.', 1)
(u'case', 1)
(u'single', 1)
(u'labeled', 1)
(u'fun', 1)
(u'def', 1)
(u'model?', 1)
(u'parse', 1)
(u'different', 1)
(u'develop', 1)
(u'provide', 1)
(u'weight(X)=', 1)
(u'sqrt', 1)
(u'write', 1)
(u'answer', 1)
(u'point:', 1)
(u'begin', 1)
(u'driver', 1)
(u'data)', 1)
(u'data.map(lambda', 1)
(u'tune', 1)
(u'documents).', 1)
(u'tokens', 1)
(u'lower', 1)
(u'savings', 1)
(u'algorithms', 1)
(u'hyper', 1)
(u'show', 1)
(u'text', 1)
(u'sorted', 1)
(u'find', 1)
(u"line.split('", 1)
(u'inverse', 1)
(u'instance', 1)
(u'do', 1)
(u'good', 1)
(u'get', 1)
(u'evaluate', 1)
(u'framework', 1)
(u'progress', 1)
(u'dataset', 1)
(u'evaluation.', 1)
(u'follows', 1)
(u'https://www.dropbox.com/s/atzqkc0p1eajuz6/LinearRegression-Notebook-Challenge.ipynb?dl=0', 1)
(u'server', 1)
(u'iterations.', 1)
(u'y:', 1)
(u'either', 1)
(u'often', 1)
(u'parsedData.map(lambda', 1)
(u'back', 1)
(u'experiments', 1)
(u'clusters.save(sc,', 1)
(u'frequency', 1)
(u'hundred)', 1)
(u'measure', 1)
(u'experiements', 1)
(u'norm):', 1)
(u'parsedData', 1)
(u'following:', 1)
(u'separate', 1)
(u'creating', 1)
(u'provide,', 1)
(u'occurence.', 1)
(u'snippet:', 1)
(u'plots', 1)
(u'by', 1)
(u'languages).', 1)
(u'str(WSSSE))', 1)
(u'la', 1)
(u'massive', 1)
(u'maxIterations=10,', 1)
(u'first', 1)
(u'computational', 1)
(u'number', 1)
(u'weighted', 1)
(u'center)]))', 1)
(u'management', 1)
(u'print("Within', 1)
(u'appears', 1)
(u'more', 1)
(u'iteration,', 1)
(u'train', 1)
(u'made', 1)
(u'cluster', 1)
(u'work', 1)
(u'can', 1)
(u'clusters.centers[clusters.predict(point)]', 1)
(u'computing', 1)
(u'words.', 1)
(u'at', 1)
(u'data.', 1)
(u'line:', 1)
(u'output', 1)
(u'other', 1)
(u'blanks', 1)
(u'applications', 1)
(u'notebook', 1)
(u'such', 1)
(u'descent', 1)
(u'give', 1)
(u'center', 1)
(u'sets', 1)
(u'element', 1)
(u'code)', 1)

HW 10.2: KMeans a la MLLib
Using the MLlib-centric KMeans code snippet below

NOTE: kmeans_data.txt is available here https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0

Run this code snippet and list the clusters that your find and compute the Within Set Sum of Squared Errors for the found clusters. Comment on your findings.



In [75]:

    
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt

# Load and parse the data
# NOTE  kmeans_data.txt is available here https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0 
data = sc.textFile("HW10-Public/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))



# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
        runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
    center = clusters.centers[clusters.predict(point)]
    return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))

# Save and load model
clusters.save(sc, "myModelPath")
sameModel = KMeansModel.load(sc, "myModelPath")









    



Within Set Sum of Squared Error = 0.692820323028



In [58]:

    
for i,ctr in enumerate(clusters.centers):
    print("Cluster %i: %.1f, %.1f, %.1f" % (i, ctr[0],ctr[1],ctr[2]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))









    



Cluster 0: 0.1, 0.1, 0.1
Cluster 1: 9.1, 9.1, 9.1
Within Set Sum of Squared Error = 0.692820323028

HW 10.3:
Download the following KMeans notebook:
https://www.dropbox.com/s/3nsthvp8g2rrrdh/EM-Kmeans.ipynb?dl=0

Generate 3 clusters with 100 (one hundred) data points per cluster (using the code provided). Plot the data.
Then run MLlib's Kmean implementation on this data and report your results as follows:

-- plot the resulting clusters after 1 iteration, 10 iterations, after 20 iterations, after 100 iterations.
-- in each plot please report the Within Set Sum of Squared Errors for the found clusters. Comment on the progress of this measure as the KMEans algorithms runs for more iterations



In [59]:

    
%matplotlib inline
import numpy as np
import pylab 
import json
size1 = size2 = size3 = 100
samples1 = np.random.multivariate_normal([4, 0], [[1, 0],[0, 1]], size1)
data = samples1
samples2 = np.random.multivariate_normal([6, 6], [[1, 0],[0, 1]], size2)
data = np.append(data,samples2, axis=0)
samples3 = np.random.multivariate_normal([0, 4], [[1, 0],[0, 1]], size3)
data = np.append(data,samples3, axis=0)
# Randomlize data
data = data[np.random.permutation(size1+size2+size3),]
np.savetxt('data.csv',data,delimiter = ',')

pylab.plot(samples1[:, 0], samples1[:, 1],'*', color = 'red')
pylab.plot(samples2[:, 0], samples2[:, 1],'o',color = 'blue')
pylab.plot(samples3[:, 0], samples3[:, 1],'+',color = 'green')
pylab.show()



In [91]:

    
'''
Then run MLlib's Kmean implementation on this data 
and report your results as follows:
-- plot the resulting clusters after 1, 10, 20, and 100 iterations
-- in each plot please report the Within Set Sum of Squared Errors 
for the found clusters. Comment on the progress of this measure as 
the KMeans algorithms runs for more iterations
'''


from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt

# Load and parse the data
data = sc.textFile("data.csv")
parsedData = data.map(lambda line: array([float(x) for x in line.split(',')]))



In [97]:

    
import numpy as np

#Calculate which class each data point belongs to
def nearest_centroid(line):
    x = np.array([float(f) for f in line.split(',')])
    closest_centroid_idx = np.sum((x - centroids)**2, axis=1).argmin()
    return (closest_centroid_idx,(x,1))

#plot centroids and data points for each iteration
def plot_iteration(means):
    pylab.plot(samples1[:, 0], samples1[:, 1], '.', color = 'blue')
    pylab.plot(samples2[:, 0], samples2[:, 1], '.', color = 'blue')
    pylab.plot(samples3[:, 0], samples3[:, 1],'.', color = 'blue')
    pylab.plot(means[0][0], means[0][1],'*',markersize =10,color = 'red')
    pylab.plot(means[1][0], means[1][1],'*',markersize =10,color = 'red')
    pylab.plot(means[2][0], means[2][1],'*',markersize =10,color = 'red')
    pylab.show()



In [112]:

    
from time import time

numIters = [1, 10, 20, 100]

for i in numIters:
    clusters = KMeans.train(parsedData, k=3, maxIterations=i,
                        initializationMode = "random")
    if i==1:
        print("Centroids after %d iteration:" % i)
    else:
        print("Centroids after %d iterations:" % i)
    for centroid in clusters.centers:
        print centroid
    WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
    print("Within Set Sum of Squared Error = " + str(WSSSE))
    plot_iteration(clusters.centers)









    



Centroids after 1 iteration:
[ 5.01971659 -0.44755959]
[ 1.73271973  2.29390467]
[ 5.92760705  5.7966386 ]
Within Set Sum of Squared Error = 525.673523269






    












    



Centroids after 10 iterations:
[ 6.41257066  5.05983697]
[ 2.09780762  2.01964046]
[ 5.51496769  6.73541106]
Within Set Sum of Squared Error = 665.263827732






    












    



Centroids after 20 iterations:
[ 4.019853    0.15172955]
[ 0.2435185   3.92614601]
[ 6.04647796  5.89393637]
Within Set Sum of Squared Error = 365.942488955






    












    



Centroids after 100 iterations:
[ 6.04647796  5.89393637]
[ 0.2435185   3.92614601]
[ 4.019853    0.15172955]
Within Set Sum of Squared Error = 365.942488955

The WSSE decreases with the number of iterations from 1 to 20 iterations. After 20 iterations, the centroids converge and the WSSE is stable.

HW 10.4: Using the KMeans code (homegrown code) provided repeat the experiments in HW10.3. Comment on any differences between the results in HW10.3 and HW10.4. Explain.



In [153]:

    
from numpy.random import rand

#Calculate which class each data point belongs to
def nearest_centroid(line):
    x = np.array([float(f) for f in line.split(',')])
    closest_centroid_idx = np.sum((x - centroids)**2, axis=1).argmin()
    return (closest_centroid_idx,(x,1))


def error_p4(line, centroids):
    point = np.array([float(f) for f in line.split(',')])
    closest_centroid_idx = np.sum((point - centroids)**2, axis=1).argmin()
    center = centroids[closest_centroid_idx]
    return sqrt(sum([x**2 for x in (point - center)]))

K = 3

D = sc.textFile("./data.csv").cache()

numIters = [1, 10, 20, 100]

for n in numIters:
    # randomly initialize centroids
    centroids = rand(3,2)*5
    iter_num = 0
    for i in range(n):  
        res = D.map(nearest_centroid).reduceByKey(lambda x,y : (x[0]+y[0],x[1]+y[1])).collect()
        res = sorted(res,key = lambda x : x[0])  #sort based on cluster ID
        centroids_new = np.array([x[1][0]/x[1][1] for x in res])  #divide by cluster size
        if np.sum(np.absolute(centroids_new-centroids))<0.01:
            break
        iter_num = iter_num + 1 
        centroids = centroids_new
    
    if n==1:
        print("Centroids after %d iteration:" % n)
    else:
        print("Centroids after %d iterations:" % n)
    print centroids
    
    WSSSE = D.map(lambda line: error_p4(line, centroids)).reduce(lambda x, y: x + y)
    print("Within Set Sum of Squared Error = " + str(WSSSE))
    
    plot_iteration(centroids)









    



Centroids after 1 iteration:
[[ 0.38440545  4.4110655 ]
 [ 6.01586934  5.63927381]
 [ 2.99833688  0.6767298 ]]
Within Set Sum of Squared Error = 408.002683254






    












    



Centroids after 10 iterations:
[[ 0.2435185   3.92614601]
 [ 6.04647796  5.89393637]
 [ 4.019853    0.15172955]]
Within Set Sum of Squared Error = 365.942488955






    












    



Centroids after 20 iterations:
[[ 4.019853    0.15172955]
 [ 6.04647796  5.89393637]
 [ 0.2435185   3.92614601]]
Within Set Sum of Squared Error = 365.942488955






    












    



Centroids after 100 iterations:
[[ 4.019853    0.15172955]
 [ 0.2435185   3.92614601]
 [ 6.04647796  5.89393637]]
Within Set Sum of Squared Error = 365.942488955

These results are very similar to those for problem 10.3 with centroids converging after about 10 iterations to a WSSE of 365.94