In [51]:
%%javascript
/**********************************************************************************************
Known Mathjax Issue with Chrome - a rounding issue adds a border to the right of mathjax markup
https://github.com/mathjax/MathJax/issues/1300
A quick hack to fix this based on stackoverflow discussions:
http://stackoverflow.com/questions/34277967/chrome-rendering-mathjax-equations-with-a-trailing-vertical-line
**********************************************************************************************/
$('.math>span').css("border-left-color","transparent")
In [52]:
%reload_ext autoreload
%autoreload 2
Course Lead: Dr James G. Shanahan (email Jimi via James.Shanahan AT gmail.com)
Name: Your Name Goes Here
Class: MIDS w261 (Section Your Section Goes Here, e.g., Summer 2016 Group 1)
Email: Your UC Berkeley Email Goes Here@iSchool.Berkeley.edu
Week: 10
Prepare a single Jupyter note, please include questions, and question numbers in the questions and in the responses. Submit your homework notebook via the following form:
What is Apache Spark and how is it different to Apache Hadoop?
Fill in the blanks: Spark API consists of interfaces to develop applications based on it in Java, _ BLANKS languages (list languages).
Using Spark, resource management can be done either in a single server instance or using a framework such as Mesos or ????? in a distributed manner.
What is an RDD and show a fun example of creating one and bringing the first element back to the driver program.
HW10.1 WordCount plus sorting
Back to Table of Contents
The following notebooks will be useful to jumpstart this collection of Homework exercises:
In Spark write the code to count how often each word appears in a text document (or set of documents). Please use this homework document (with no solutions in it) as a the example document to run an experiment. Report the following:
OPTIONAL Feel free to do a secondary sort where words with the same frequncy are sorted alphanumerically increasing. Plseas refer to the following notebook for examples of secondary sorts in Spark. Please provide the following [top 20 most frequent terms only] and [bottom 10 least frequent terms]
NOTE [Please incorporate all referenced notebooks directly into this master notebook as cells for HW submission. I.e., HW submissions should comprise of just one notebook]__
In [53]:
## Code goes here
In [54]:
## Drivers & Runners
In [55]:
## Run Scripts, S3 Sync
HW10.1.1
Back to Table of Contents
Modify the above word count code to count words that begin with lower case letters (a-z) and report your findings. Again sort the output words in decreasing order of frequency.
In [56]:
## Code goes here
In [57]:
## Drivers & Runners
In [58]:
## Run Scripts, S3 Sync
HW10.2: MLlib-centric KMeans
Back to Table of Contents
Using the following MLlib-centric KMeans code snippet:
from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array from math import sqrt # Load and parse the data # NOTE kmeans_data.txt is available here # https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0 data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=10, initializationMode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) print("Within Set Sum of Squared Error = " + str(WSSSE)) # Save and load model clusters.save(sc, "myModelPath") sameModel = KMeansModel.load(sc, "myModelPath")
NOTE
The kmeans_data.txt is available here https://www.dropbox.com/s/q85t0ytb9apggnh/kmeans_data.txt?dl=0
TASKS
In [59]:
## Code goes here
In [60]:
## Drivers & Runners
In [61]:
## Run Scripts, S3 Sync
HW10.3: Homegrown KMeans in Spark
Back to Table of Contents
Download the following KMeans notebook.
Generate 3 clusters with 100 (one hundred) data points per cluster (using the code provided). Plot the data. Then run MLlib's Kmean implementation on this data and report your results as follows:
In [62]:
## Code goes here
In [63]:
## Drivers & Runners
In [64]:
## Run Scripts, S3 Sync
HW10.4: KMeans Experiments
Back to Table of Contents
Using this provided homegrown Kmeans code repeat the experiments in HW10.3. Explain any differences between the results in HW10.3 and HW10.4.
In [65]:
## Code goes here
In [66]:
## Drivers & Runners
In [67]:
## Run Scripts, S3 Sync
HW10.4.1: Making Homegrown KMeans more efficient
Back to Table of Contents
The above provided homegrown KMeans implentation in not the most efficient. How can you make it more efficient? Make this change in the code and show it work and comment on the gains you achieve.
In [68]:
## Code goes here
In [69]:
## Drivers & Runners
In [70]:
## Run Scripts, S3 Sync
HW10.5: OPTIONAL Weighted KMeans
Back to Table of Contents
Using this provided homegrown Kmeans code, modify it to do a weighted KMeans and repeat the experiements in HW10.3. Explain any differences between the results in HW10.3 and HW10.5.
NOTE: Weight each example as follows using the inverse vector length (Euclidean norm):
weight(X)= 1/||X||,
where ||X|| = SQRT(X.X)= SQRT(X1^2 + X2^2)
Here X is vector made up of two values X1 and X2.
[Please incorporate all referenced notebooks directly into this master notebook as cells for HW submission. I.e., HW submissions should comprise of just one notebook]
In [71]:
## Code goes here
In [72]:
## Drivers & Runners
In [73]:
## Run Scripts, S3 Sync
HW10.6 OPTIONAL Linear Regression
Back to Table of Contents
HW10.6.1 OPTIONAL Linear Regression
Back to Table of Contents
Using this linear regression notebook:
Generate 2 sets of data with 100 data points using the data generation code provided and plot each in separate plots. Call one the training set and the other the testing set.
Using MLLib's LinearRegressionWithSGD train up a linear regression model with the training dataset and evaluate with the testing set. What a good number of iterations for training the linear regression model? Justify with plots (e.g., plot MSE as a function of the number of iterations) and words.
HW10.6.2 OPTIONAL Linear Regression
Back to Table of Contents
In the notebook provided above, in the cell labeled "Gradient descent (regularization)".
Fill in the blanks and get this code to work for LASS0 and RIDGE linear regression.
Using the data from HW10.6.1 tune the hyper parameters of your LASS0 and RIDGE regression. Report your findings with words and plots.
In [74]:
## Code goes here
In [75]:
## Drivers & Runners
In [76]:
## Run Scripts, S3 Sync
HW10.7 OPTIONAL Error surfaces
Back to Table of Contents
Here is a link to R code with 1 test drivers that plots the linear regression model in model space and in the domain space:
https://www.dropbox.com/s/3xc3kwda6d254l5/PlotModelAndDomainSpaces.R?dl=0
Here is a sample output from this script:
https://www.dropbox.com/s/my3tnhxx7fr5qs0/image%20%281%29.png?dl=0
Please use this as inspiration and code a equivalent error surface and heatmap (with isolines) in Spark and show the trajectory of learning taken during gradient descent(after each n-iterations of Gradient Descent):
Using Spark and Python (using the above R Script as inspiration), plot the error surface for the linear regression model using a heatmap and contour plot. Also plot the current model in the original domain space for every 10th iteration. Plot them side by side if possible for each iteration: lefthand side plot is the model space(w0 and w01) and the righthand side plot is domain space (plot the corresponding model and training data in the problem domain space) with a final pair of graphs showing the entire trajectory in the model and domain space. Make sure to label your plots with iteration numbers, function, model space versus original domain space, MSE on the training data etc.
Also plot the MSE as a function of each iteration (possibly every 10th iteration). Dont forget to label both axis and the graph also. [Please incorporate all referenced notebooks directly into this master notebook as cells for HW submission. I.e., HW submissions should comprise of just one notebook]
In [77]:
## Code goes here
In [78]:
## Drivers & Runners
In [79]:
## Run Scripts, S3 Sync
In [ ]: