Spark

http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873 https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040 https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec https://www.google.com/search?q=spark+on+amazon+ec2&oq=spark+on+amazon&aqs=chrome.3.69i57j0l5.11481j0j1&sourceid=chrome&ie=UTF-8 https://spark.apache.org/docs/1.6.2/ec2-scripts.html https://aws.amazon.com/big-data/what-is-spark/ https://eu-north-1.console.aws.amazon.com/console/home?region=eu-north-1# https://eu-north-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-north-1# https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html https://stackoverflow.com/questions/38611573/how-to-launch-spark-2-0-on-ec2 https://www.google.com/search?q=simple+tutorial+for+running+spark+in+aws&oq=simple+tutorial+for+running+spark+in+aws&aqs=chrome..69i57j33.14978j0j1&sourceid=chrome&ie=UTF-8 https://advpy2019.slack.com/messages/DJSPA69AL/? https://www.google.com/search?q=python+running+spark+on+ec2&oq=python+running+spark+on+ec2&aqs=chrome..69i57j33l3.9566j0j1&sourceid=chrome&ie=UTF-8 https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297 https://towardsdatascience.com/clean-up-your-own-model-data-without-leaving-jupyter-bdbcc9001734 https://medium.com/jbennetcodes/how-to-get-rid-of-loops-and-use-window-functions-in-pandas-or-spark-sql-907f274850e4

$ conda install -c conda-forge awscli
$ aws --version
aws-cli/1.16.161 Python/3.7.3 Linux/4.18.0-20-generic botocore/1.12.151

Then go to EMR and pick the cheapest option for your region, optimized for computing, mine is c5.xlarge.

The region settings for Stockholm can be found at: https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region I am using the Stockholm region (eu-north-1).

Next you need access keys for IAM roles. Open https://console.aws.amazon.com/iam/ and folow the instructions here.

$ aws configure
AWS Access Key ID [None]: AKIAYGDZBRYTTYE5Y2N6
AWS Secret Access Key [None]:....
Default region name [None]: eu-north-1
Default output format [None]: json

$ aws emr create-default-roles

$ aws ec2 describe-subnets \
>      --filters "Name=availabilityZone,Values=eu-north-1"
{
    "Subnets": []
}

aws emr add-steps --cluster-id <your-cluster-job-id> --steps Name=Python Job,Jar=http://s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,s3://my_bucket/path/pythonjob.py,<comma separated list of arguments for your app>],ActionOnFailure=CONTINUE

Now

aws emr create-cluster \ --name "sparkclust" \ --release-label emr-5.23.0 \ --applications Name=Hadoop Name=Spark \ --ec2-attributes KeyName=spark_keypair \ --instance-groups \ Name=EmrMaster,InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c5.xlarge \ Name=EmrCore,InstanceGroupType=CORE,InstanceCount=2,InstanceType=c5.xlarge \ --use-default-roles

aws emr ssh --cluster-id j-3H0XTRI8687P2 --key-pair-file /home/sergiu/Downloads/spark_keypair.pem

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html

$ aws emr list-clusters $ aws emr describe-cluster --cluster-id j-3H0XTRI8687P2

look for the "MasterPublicDnsName": "ec2-13-48-55-199.eu-north-1.compute.amazonaws.com"

Before you can ssh to the master node you have to enable SSH to "My IP" (your computer's IP) on both the master and slave subnets. (visit Summary: Security groups for Master)

ssh hadoop@ec2-13-48-55-199.eu-north-1.compute.amazonaws.com -i /home/sergiu/Downloads/spark_keypair.pem


In [ ]:
# used from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

from __future__ import print_function

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()

In [ ]:
%bash

$ ls /usr/lib/spark/python/lib/
py4j-0.10.7-src.zip  PY4J_LICENSE.txt  py4j-src.zip  pyspark.zip
[hadoop@ip-172-31-16-184 ~]$ export SPARK_HOME=/usr/lib/spark
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ source ~/.bashrc

$ python pitest.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/18 17:30:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:>                                                          (0 + 0) / 2]19/05/18 17:30:19 WARN TaskSetManager: Stage 0 contains a task of very large size (371 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.142540        


aws emr add-steps \
--cluster-id j-3H0XTRI8687P2 \
--steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",/home/hadoop/pitest.py]

Task:

  • Log the result in an s3 bucket.
  • Try to load a data set and perform a basic ML task!
  • Configure JupyterHub via EMR, load giant pyspark steps from the comfort of your phone's web browser. Profit! ;)