Spark on Cloud

How to set up and run Spark on Azure or AWS EC2 clusters.

Azure

Follow instructions provided by Microsoft.

To terminate the cluster, you have to delete it.

AWS

AWS setup is more involved. We will show how to access pyspark via ssh to an EMR cluster, as well as how to set up the Zeppelin browser-based notebook (similar to Jupyter).

References

Know your AWS public and private access keys

These will look something like

public: AKIAIOSFODNN7EXAMPLE
private: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Know your AWS EC2 key-pair

This is a name that you give - mine is cliburn-2016 and an associated PEM file - I keep mine at ~/AWS/cliburn-2016.pem.

Set the correct permissions on the PEM file.

chmod 400 xxx.pem

Install AWS command line client

pip install awscli

If you run into problems, see docs

Configure the AWS command line client

aws configure

AWS Access Key ID: <<Your public access key>>
AWS Secret Access Key: <<Your private access key>>
Default region name: us-east-1
Default output format: json

Create a cluster

Warning: You will be charged for this.

aws emr create-cluster --name "<<NAME-FOR-CLUSTER>>" --release-label  emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox  --ec2-attributes KeyName=<<Your key-pair>>> --instance-type m3.xlarge --instance-count 3 --use-default-roles

For example, I start mine with

aws emr create-cluster --name "spak-2016-d" --release-label    emr-4.5.0 --applications Name=Spark Name=Zeppelin-Sandbox  --ec2-attributes KeyName="cliburn-2016"  --instance-type m3.xlarge --instance-count 3 --use-default-role

A cluster-id should be returned

{
    "ClusterId": "j-XXXXXXXXXXXXXXX"
}

Get information about the cluster

aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX

or just inspect the state

aws emr describe-cluster --cluster-id -XXXXXXXXXXXXXXX | grep \"State\"

Connect to the cluster via `ssh`

aws emr ssh --cluster-id -XXXXXXXXXXXXXXX --key-pair-file cliburn-2016.pem

Note the IP address that is returned

It will be something like ec2-XX-X-XX-XXX.compute-1.amazonaws.com

Run `pyspark`

Run

pyspark

And you will be in a pyspark console where you can issue Spark commands.

When you've had enough fun playing in pyspark for a while, end the session with Ctrl-D and exit to leave the ssh session.

Run the `Zepellin` notebook

Create an SSH tunnel to port 8890

ssh -i xxx.pem -L 8192:ec2-xx-xx-xx.compute-1.amazonaws.com:8192 hadoop@ec2-xx-xx-xx-xx.compute-1.amazonaws.com -N -v

Fill in the xxx with the locatin of your PEM file, and the appropriate IP address.

Connect to `Zeppelin` notebook

Open a browser to http://localhost:8890/ - if it worked you should see this

Create notebook and run Spark within it

The default cell uses scala. For pyspark just start a cell with %pyspark.

Terminate the cluster

When you are done, remember to terminate the cluster!

aws emr terminate-clusters --cluster-id j-XXXXXXXXXXXXXXX

and confirm that it is terminating

aws emr describe-cluster --cluster-id j-XXXXXXXXXXXXXXX | grep \"State\"

You should see

                    "State": "TERMINATING"
                    "State": "TERMINATING"
            "State": "TERMINATING"

If you are paranoid, log into the AWS Management Console and click on Services | EMR and check the status of your cluster.



In [ ]: