http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://towardsdatascience.com/a-brief-introduction-to-pyspark-ff4284701873 https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040 https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec https://www.google.com/search?q=spark+on+amazon+ec2&oq=spark+on+amazon&aqs=chrome.3.69i57j0l5.11481j0j1&sourceid=chrome&ie=UTF-8 https://spark.apache.org/docs/1.6.2/ec2-scripts.html https://aws.amazon.com/big-data/what-is-spark/ https://eu-north-1.console.aws.amazon.com/console/home?region=eu-north-1# https://eu-north-1.console.aws.amazon.com/elasticmapreduce/home?region=eu-north-1# https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html https://stackoverflow.com/questions/38611573/how-to-launch-spark-2-0-on-ec2 https://www.google.com/search?q=simple+tutorial+for+running+spark+in+aws&oq=simple+tutorial+for+running+spark+in+aws&aqs=chrome..69i57j33.14978j0j1&sourceid=chrome&ie=UTF-8 https://advpy2019.slack.com/messages/DJSPA69AL/? https://www.google.com/search?q=python+running+spark+on+ec2&oq=python+running+spark+on+ec2&aqs=chrome..69i57j33l3.9566j0j1&sourceid=chrome&ie=UTF-8 https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297 https://towardsdatascience.com/clean-up-your-own-model-data-without-leaving-jupyter-bdbcc9001734 https://medium.com/jbennetcodes/how-to-get-rid-of-loops-and-use-window-functions-in-pandas-or-spark-sql-907f274850e4
$ conda install -c conda-forge awscli
$ aws --version
aws-cli/1.16.161 Python/3.7.3 Linux/4.18.0-20-generic botocore/1.12.151
Then go to EMR and pick the cheapest option for your region, optimized for computing, mine is c5.xlarge.
The region settings for Stockholm can be found at: https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region I am using the Stockholm region (eu-north-1).
Next you need access keys for IAM roles. Open https://console.aws.amazon.com/iam/ and folow the instructions here.
$ aws configure
AWS Access Key ID [None]: AKIAYGDZBRYTTYE5Y2N6
AWS Secret Access Key [None]:....
Default region name [None]: eu-north-1
Default output format [None]: json
$ aws emr create-default-roles
$ aws ec2 describe-subnets \
> --filters "Name=availabilityZone,Values=eu-north-1"
{
"Subnets": []
}
aws emr add-steps --cluster-id <your-cluster-job-id> --steps Name=Python Job,Jar=http://s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,s3://my_bucket/path/pythonjob.py,<comma separated list of arguments for your app>],ActionOnFailure=CONTINUE
Now
aws emr create-cluster \ --name "sparkclust" \ --release-label emr-5.23.0 \ --applications Name=Hadoop Name=Spark \ --ec2-attributes KeyName=spark_keypair \ --instance-groups \ Name=EmrMaster,InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c5.xlarge \ Name=EmrCore,InstanceGroupType=CORE,InstanceCount=2,InstanceType=c5.xlarge \ --use-default-roles
aws emr ssh --cluster-id j-3H0XTRI8687P2 --key-pair-file /home/sergiu/Downloads/spark_keypair.pem
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
$ aws emr list-clusters $ aws emr describe-cluster --cluster-id j-3H0XTRI8687P2
Before you can ssh to the master node you have to enable SSH to "My IP" (your computer's IP) on both the master and slave subnets. (visit Summary: Security groups for Master)
ssh hadoop@ec2-13-48-55-199.eu-north-1.compute.amazonaws.com -i /home/sergiu/Downloads/spark_keypair.pem
In [ ]:
# used from https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py
from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
spark.stop()
In [ ]:
%bash
$ ls /usr/lib/spark/python/lib/
py4j-0.10.7-src.zip PY4J_LICENSE.txt py4j-src.zip pyspark.zip
[hadoop@ip-172-31-16-184 ~]$ export SPARK_HOME=/usr/lib/spark
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
[hadoop@ip-172-31-16-184 ~]$ source ~/.bashrc
$ python pitest.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/05/18 17:30:03 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:> (0 + 0) / 2]19/05/18 17:30:19 WARN TaskSetManager: Stage 0 contains a task of very large size (371 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.142540
aws emr add-steps \
--cluster-id j-3H0XTRI8687P2 \
--steps Type=CUSTOM_JAR,Name="Spark Program",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=["spark-submit",/home/hadoop/pitest.py]
Task:
http://localhost:8888/lab https://keep.google.com/u/0/#search/text%253Dspark https://www.themarketingtechnologist.co/upload-your-local-spark-script-to-an-aws-emr-cluster-using-a-simply-python-script/ http://queirozf.com/entries/using-command-line-tools-to-manage-spark-clusters-on-emr-examples-and-reference https://medium.com/big-data-on-amazon-elastic-mapreduce/run-a-spark-job-within-amazon-emr-in-15-minutes-68b02af1ae16 https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html https://console.aws.amazon.com/iam/home?#/users/sergiun?section=security_credentials https://www.quora.com/How-do-you-automate-pyspark-jobs-on-AWS-EMR http://spark.apache.org/docs/latest/submitting-applications.html https://docs.aws.amazon.com/general/latest/gr/rande.html#apigateway_region