http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873 https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040 https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec https://www.google.com/search?q=spark+on+amazon+ec2&oq=spark+on+amazon&aqs=chrome.3.69i57j0l5.11481j0j1&sourceid=chrome&ie=UTF-8 https://spark.apache.org/docs/1.6.2/ec2-scripts.html https://aws.amazon.com/big-data/what-is-spark/ https://eu-north-1.console.aws.amazon.com/console/home?region=eu-north-1# https://eu-north-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-north-1# https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html https://stackoverflow.com/questions/38611573/how-to-launch-spark-2-0-on-ec2 https://www.google.com/search?q=simple+tutorial+for+running+spark+in+aws&oq=simple+tutorial+for+running+spark+in+aws&aqs=chrome..69i57j33.14978j0j1&sourceid=chrome&ie=UTF-8 https://advpy2019.slack.com/messages/DJSPA69AL/? https://www.google.com/search?q=python+running+spark+on+ec2&oq=python+running+spark+on+ec2&aqs=chrome..69i57j33l3.9566j0j1&sourceid=chrome&ie=UTF-8 https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297 https://towardsdatascience.com/clean-up-your-own-model-data-without-leaving-jupyter-bdbcc9001734 https://medium.com/jbennetcodes/how-to-get-rid-of-loops-and-use-window-functions-in-pandas-or-spark-sql-907f274850e4
$ conda install -c conda-forge awscli
$ aws --version
aws-cli/1.16.161 Python/3.7.3 Linux/4.18.0-20-generic botocore/1.12.151
Then go to EMR and pick the cheapest option for your region, optimized for computing, mine is c5.xlarge.
The region settings for Stockholm can be found at: https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region I am using the Stockholm region (eu-north-1).
Next you need access keys for IAM roles. Open https://console.aws.amazon.com/iam/ and folow the instructions here.
$ aws configure
AWS Access Key ID [None]: AKIAYGDZBRYTTYE5Y2N6
AWS Secret Access Key [None]:....
Default region name [None]: eu-north-1
Default output format [None]: json
$ aws emr create-default-roles
$ aws ec2 describe-subnets \
> --filters "Name=availabilityZone,Values=eu-north-1"
{
"Subnets": []
}
aws emr add-steps --cluster-id <your-cluster-job-id> --steps Name=Python Job,Jar=http://s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,s3://my_bucket/path/pythonjob.py,<comma separated list of arguments for your app>],ActionOnFailure=CONTINUE
Now
aws emr create-cluster \ --name "sparkclust" \ --release-label emr-5.23.0 \ --applications Name=Hadoop Name=Spark \ --ec2-attributes KeyName=spark_keypair \ --instance-groups \ Name=EmrMaster,InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c5.xlarge \ Name=EmrCore,InstanceGroupType=CORE,InstanceCount=2,InstanceType=c5.xlarge \ --use-default-roles
aws emr ssh --cluster-id j-3H0XTRI8687P2 --key-pair-file /home/sergiu/Downloads/spark_keypair.pem
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
$ aws emr list-clusters $ aws emr describe-cluster --cluster-id j-3H0XTRI8687P2
Before you can ssh to the master node you have to enable SSH to "My IP" (your computer's IP) on both the master and slave subnets. (visit Summary: Security groups for Master)
ssh hadoop@ec2-13-48-55-199.eu-north-1.compute.amazonaws.com -i /home/sergiu/Downloads/spark_keypair.pem
In [ ]:
# used from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
You can run cluster jobs interactively by SSH to teh master node. But there are many other ways in which you can run your jobs on the spark cluster. Complete freedom!
$ ls /usr/lib/spark/python/lib/
py4j-0.10.7-src.zip PY4J_LICENSE.txt py4j-src.zip pyspark.zip
[hadoop@ip-172-31-16-184 ~]$ export SPARK_HOME=/usr/lib/spark
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ source ~/.bashrc
$ python pitest.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/18 17:30:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:> (0 + 0) / 2]19/05/18 17:30:19 WARN TaskSetManager: Stage 0 contains a task of very large size (371 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.142540
You can also add steps via the awscli and monitor them online:
$ aws emr add-steps \
--cluster-id j-3H0XTRI8687P2 \
--steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",/home/hadoop/pitest.py]
aws emr terminate-clusters --cluster-ids j-3H0XTRI8687P2
OBS: Verify on https://eu-north-1.console.aws.amazon.com/elasticmapreduce/ that your cluster was terminated properly!
Task:
http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://www.themarketingtechnologist.co/upload-your-local-spark-script-to-an-aws-emr-cluster-using-a-simply-python-script/ http://queirozf.com/entries/using-command-line-tools-to-manage-spark-clusters-on-emr-examples-and-reference https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html https://console.aws.amazon.com/iam/home?#/users/sergiun?section=security_credentials https://www.quora.com/How-do-you-automate-pyspark-jobs-on-AWS-EMR http://spark.apache.org/docs/latest/submitting-applications.html https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region