Spark

http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873 https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040 https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec https://www.google.com/search?q=spark+on+amazon+ec2&oq=spark+on+amazon&aqs=chrome.3.69i57j0l5.11481j0j1&sourceid=chrome&ie=UTF-8 https://spark.apache.org/docs/1.6.2/ec2-scripts.html https://aws.amazon.com/big-data/what-is-spark/ https://eu-north-1.console.aws.amazon.com/console/home?region=eu-north-1# https://eu-north-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-north-1# https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html https://stackoverflow.com/questions/38611573/how-to-launch-spark-2-0-on-ec2 https://www.google.com/search?q=simple+tutorial+for+running+spark+in+aws&oq=simple+tutorial+for+running+spark+in+aws&aqs=chrome..69i57j33.14978j0j1&sourceid=chrome&ie=UTF-8 https://advpy2019.slack.com/messages/DJSPA69AL/? https://www.google.com/search?q=python+running+spark+on+ec2&oq=python+running+spark+on+ec2&aqs=chrome..69i57j33l3.9566j0j1&sourceid=chrome&ie=UTF-8 https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297 https://towardsdatascience.com/clean-up-your-own-model-data-without-leaving-jupyter-bdbcc9001734 https://medium.com/jbennetcodes/how-to-get-rid-of-loops-and-use-window-functions-in-pandas-or-spark-sql-907f274850e4

$ conda install -c conda-forge awscli
$ aws --version
aws-cli/1.16.161 Python/3.7.3 Linux/4.18.0-20-generic botocore/1.12.151

Then go to EMR and pick the cheapest option for your region, optimized for computing, mine is c5.xlarge.

The region settings for Stockholm can be found at: https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region I am using the Stockholm region (eu-north-1).

Next you need access keys for IAM roles. Open https://console.aws.amazon.com/iam/ and folow the instructions here.

$ aws configure
AWS Access Key ID [None]: AKIAYGDZBRYTTYE5Y2N6
AWS Secret Access Key [None]:....
Default region name [None]: eu-north-1
Default output format [None]: json

$ aws emr create-default-roles

$ aws ec2 describe-subnets \
>      --filters "Name=availabilityZone,Values=eu-north-1"
{
    "Subnets": []
}

aws emr add-steps --cluster-id <your-cluster-job-id> --steps Name=Python Job,Jar=http://s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,s3://my_bucket/path/pythonjob.py,<comma separated list of arguments for your app>],ActionOnFailure=CONTINUE

Now

aws emr create-cluster \ --name "sparkclust" \ --release-label emr-5.23.0 \ --applications Name=Hadoop Name=Spark \ --ec2-attributes KeyName=spark_keypair \ --instance-groups \ Name=EmrMaster,InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c5.xlarge \ Name=EmrCore,InstanceGroupType=CORE,InstanceCount=2,InstanceType=c5.xlarge \ --use-default-roles

aws emr ssh --cluster-id j-3H0XTRI8687P2 --key-pair-file /home/sergiu/Downloads/spark_keypair.pem

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html

$ aws emr list-clusters $ aws emr describe-cluster --cluster-id j-3H0XTRI8687P2

look for the "MasterPublicDnsName": "ec2-13-48-55-199.eu-north-1.compute.amazonaws.com"

Before you can ssh to the master node you have to enable SSH to "My IP" (your computer's IP) on both the master and slave subnets. (visit Summary: Security groups for Master)

ssh hadoop@ec2-13-48-55-199.eu-north-1.compute.amazonaws.com -i /home/sergiu/Downloads/spark_keypair.pem


In [ ]:
# used from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

You can run cluster jobs interactively by SSH to teh master node. But there are many other ways in which you can run your jobs on the spark cluster. Complete freedom!

$ ls /usr/lib/spark/python/lib/
py4j-0.10.7-src.zip  PY4J_LICENSE.txt  py4j-src.zip  pyspark.zip
[hadoop@ip-172-31-16-184 ~]$ export SPARK_HOME=/usr/lib/spark
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ source ~/.bashrc

$ python pitest.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/18 17:30:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:>                                                          (0 + 0) / 2]19/05/18 17:30:19 WARN TaskSetManager: Stage 0 contains a task of very large size (371 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.142540

You can also add steps via the awscli and monitor them online:

$ aws emr add-steps \
--cluster-id j-3H0XTRI8687P2 \
--steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",/home/hadoop/pitest.py]

SUPER IMPORTANT STEP..

aws emr terminate-clusters --cluster-ids j-3H0XTRI8687P2

OBS: Verify on https://eu-north-1.console.aws.amazon.com/elasticmapreduce/ that your cluster was terminated properly!

Task:

  • Log the result in an s3 bucket.
  • Try to load a data set and perform a basic ML task!
  • Configure JupyterHub via EMR, load giant pyspark steps from the comfort of your phone's web browser. Profit! ;)