The purpose of this assignment is to develop some basic Spark skills. The assignment is composed by two mandatory tasks and one challenge. Only PhD students are required to go through the challenge part. However, completing the challenges will improve the final grade for Master students too.
Warning: all of the tasks must be solved using the Spark RDD API, in order to distribute the computations in the Spark workers.
GitHub is a web-based software development collaboration tool. It is very important for your career that you get familiar with it. This is why Q&A for this assignment will be based on GitHub issues. If you encounter any problem setting up the environment, or you need some additional pointers to solve the tasks please open an issue here: https://github.com/SNICScienceCloud/LDSA-Spark/issues.
Before you begin with the assignment you need to get the Docker-based Spark deployment that we have seen in the lecture. You can either install the environment on the SNIC cloud or on your local computer. For Linux and Mac users, installing the environment locally should be straightforward, while on Windows machines the installation could be slightly more challenging. If you are unsure on how to proceed, the best way is to deploy the envirnment on SNIC.
Setup the environment on the SNIC cloud (recommended)
Start an instance from TheSparkBox image:
scc.large flavorSpark security groupAssociate a floating IP to the newly create instance, and SSH into it:
ssh core@<your-floating-ip>
Deploy Spark and Jupyter running:
export SPARK_PUBLIC_DNS="<your-floating-ip>"
export TSB_JUPYTER_TOKEN="<your-password>"
tsb up -d
If everithing went well, within a couple of minutes you should be able to access the web UIs.
http://<your-floating-ip>:8080http://<your-floating-ip>:8888Setup the environment locally
If you like to work on you local computer, and you can install Docker, here you can find detailed instruction to get the environment on your machine: https://github.com/mcapuccini/TheSparkBox. The procedure should be straightforward for Linux and Mac, however on Windows this could a little bit more challenging.
The material for the Spark module in this course is stored in this GitHub repository: https://github.com/SNICScienceCloud/LDSA-Spark. You can import it in your deployment following this steps:
New > TerminalClone the course material running:
git clone https://github.com/SNICScienceCloud/LDSA-Spark.git
If everithing went well, you should be able to open this assignment as a Jupyter notebook. Please go back to the Jupyter home page and navigate to LDSA-Spark > Lab Assignment.ipynb. The LDSA-Spark folder includes also the Lecture Examples.ipynb notebook that we have seen in the lecture, so you can play with it. However, keep in mind that only one notebook at the time will manage to send tasks to the Spark Master, so make sure to shutdown notebooks when you are not using them.
Please complete the tasks in this assignment editing this notebook, both for the code implementations and the theory questions. If you are not familiar with Jupyter, you may want to give a look to the Notebook Basics tutorial. When you are done with the tasks, please save this notebook with your solutions, download it as .ipynb (File > Download as > Notebook), and upload it to the student portal.
DNA is a molecule that carries genetic information used in the growth, development, functioning and reproduction of all known living organisms (including some viruses). The DNA information is coded in a language of 4 bases: cytosine (C), guanine (G), adenine (A), thymine (T). The percentage of G+C bases in a DNA sequence has valuable biological mening (wikipedia: https://goo.gl/kCLvDp), hence it is important to be able to compute it for long DNA sequences.
Given an input DNA sequence, represented as a text file: data/dna.txt, compute the percentage of g + c occurrences into it. An example follows:
Input file:
atcg
ccgg
ttat
result: $$\frac{C_{count} + G_{count}}{C_{count} + G_{count} + A_{count} + T_{count}} = \frac{6}{12} = 0.5$$
Tip 1: when you load an input file as an RDD, each line will be loaded into a distinct string RDD record. In Scala you can count the occurrences of a certain character in a string as it follows:
In [3]:
"atttccgg".count(c => c == 'g')
Out[3]:
Question 1: Is the previous operation parallel, or is computed locally? Why?
Answer goes here
Tip 2: sums form different RDD records can be aggregated using the RDD reduce method. An example follows:
In [4]:
val sumsRDD = sc.parallelize(Array(3,5,2))
sumsRDD.reduce(_+_)
Out[4]:
In [ ]:
{
// Implementation goes here
}
Question 2: What is an RDD in Spark?
Answer goes here
Question 3: Is reduce a trasformation or an action? What are RDD transformations and RDD actions? How do they differ from each other?
Answer goes here
Question 4: What are some of the advanteges of Spark over Hadoop (and MapReduce)?
Answer goes here
The Monte Carlo integration method, is a way to get an approximation of the definite integral of a function $f(x)$, over an interval $[A,B]$. Given a value $Max_{f(x)},$ which $f(x)$ never exceeds, we first randomly draw $N$ uniformly distributed points $(x_1,y_2) … (x_N,y_N)$ s.t. $x_1 … x_N \in [A,B], y_1 … y_N \in [0,Max_{f(x)}]$. Then, assuming that $f(x)$ is positive over $[A,B]$, the fraction of points that fell under $f(x)$ will be roughly equal to the area under the curve, divided by the total area of the rectangle in which we randomly drew points. Hence, the definite integral of $f(x)$ over $[A,B]$ is roughly equal to:
$$(B-A) Max_{f(x)}\frac{n_{P}}{tot_{P}},$$where $n_{P}$ is the number of points that fell under $f(x)$, and $tot_{P}$ is the total number of randomly drawn points.
Write a program in Spark to approximate the definite integral of $f(x) = (1 + sin(x))$ / $cos(x)$ over $[0,1].$ Such function is positive and it is lower than $4$ over $[0,1]$.
For the purpose of this assignment drawing 1000 points is good enough.
Question What does the Spark's parallelize function do? What is it good for?
Answer goes here
In [ ]:
{
// Implementation goes here
}
The well-known Iris flower dataset (wiki: https://goo.gl/OQjope) contains measurements for 150 Iris flowers form 3 different species: Iris setosa, Iris virginica and Iris versicolor. For each example in the dataset, 4 measurements were performed: sepal length, sepal width, petal length and petal width.
You are given a copy of this dataset in Comma Separated Values (CSV) format: data/iris_csv.txt. In the CSV text file each line contains sepal length, sepal width, petal length, petal width and species name separated by a comma. The file looks something like the following example:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
Task 1: Build and evaluate a 3NN classifier for the Iris flower dataset in Spark. In order to evaluate your classifier, you can save 20% of the data for testing like we did in the lecture examples. For simplicity, you are allowed to collect the test data.
N.B. Part of the challenge is to figure out how to parse the input dataset into an RDD. Google is your friend!
In [ ]:
{
// Implementation goes here
}