Data preparation using Hadoop MapReduce on YARN with Cloud Dataproc
Cloud Dataproc, GCP, Cloud Storage, Hadoop, YARN, Apache, MapReduce
A Kubeflow Pipeline component to prepare data by submitting an Apache Hadoop MapReduce job on Apache Hadoop YARN to Cloud Dataproc.
Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a Kubeflow Pipeline.
Argument | Description | Optional | Data type | Accepted values | Default |
---|---|---|---|---|---|
project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | ||
region | The Dataproc region to handle the request. | No | GCPRegion | ||
cluster_name | The name of the cluster to run the job. | No | String | ||
main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. | No | List | ||
main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in hadoop_job.jarFileUris . |
No | String | ||
args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | None | |
hadoop_job | The payload of a HadoopJob. | Yes | Dict | None | |
job | The payload of a Dataproc job. | Yes | Dict | None | |
wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | 30 |
Note:
main_jar_file_uri
: The examples for the files are :
gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar
hdfs:/tmp/test-samples/custom-wordcount.jarfile:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar
Name | Description | Type |
---|---|---|
job_id | The ID of the created job. | String |
To use the component, you must:
roles/dataproc.editor
on the project.This component creates a Hadoop job from Dataproc submit job REST API.
Follow these steps to use the component in a pipeline:
In [ ]:
%%capture --no-stderr
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.14/kfp.tar.gz'
!pip3 install $KFP_PACKAGE --upgrade
In [ ]:
import kfp.components as comp
dataproc_submit_hadoop_job_op = comp.load_component_from_url(
'https://raw.githubusercontent.com/kubeflow/pipelines/01a23ae8672d3b18e88adf3036071496aca3552d/components/gcp/dataproc/submit_hadoop_job/component.yaml')
help(dataproc_submit_hadoop_job_op)
Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.
Create a new Dataproc cluster (or reuse an existing one) before running the sample code.
Upload your Hadoop JAR file to a Cloud Storage bucket. In the sample, we will use a JAR file that is preinstalled in the main cluster, so there is no need to provide main_jar_file_uri
.
Here is the WordCount example source code.
To package a self-contained Hadoop MapReduce application from the source code, follow the MapReduce Tutorial.
In [ ]:
PROJECT_ID = '<Please put your project ID here>'
CLUSTER_NAME = '<Please put your existing cluster name here>'
OUTPUT_GCS_PATH = '<Please put your output GCS path here>'
REGION = 'us-central1'
MAIN_CLASS = 'org.apache.hadoop.examples.WordCount'
INTPUT_GCS_PATH = 'gs://ml-pipeline-playground/shakespeare1.txt'
EXPERIMENT_NAME = 'Dataproc - Submit Hadoop Job'
In [ ]:
!gsutil cat $INTPUT_GCS_PATH
This is needed because the sample code requires the output folder to be a clean folder. To continue to run the sample, make sure that the service account of the notebook server has access to the OUTPUT_GCS_PATH
.
CAUTION: This will remove all blob files under OUTPUT_GCS_PATH
.
In [ ]:
!gsutil rm $OUTPUT_GCS_PATH/**
In [ ]:
import kfp.dsl as dsl
import json
@dsl.pipeline(
name='Dataproc submit Hadoop job pipeline',
description='Dataproc submit Hadoop job pipeline'
)
def dataproc_submit_hadoop_job_pipeline(
project_id = PROJECT_ID,
region = REGION,
cluster_name = CLUSTER_NAME,
main_jar_file_uri = '',
main_class = MAIN_CLASS,
args = json.dumps([
INTPUT_GCS_PATH,
OUTPUT_GCS_PATH
]),
hadoop_job='',
job='{}',
wait_interval='30'
):
dataproc_submit_hadoop_job_op(
project_id=project_id,
region=region,
cluster_name=cluster_name,
main_jar_file_uri=main_jar_file_uri,
main_class=main_class,
args=args,
hadoop_job=hadoop_job,
job=job,
wait_interval=wait_interval)
In [ ]:
pipeline_func = dataproc_submit_hadoop_job_pipeline
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)
In [ ]:
#Specify pipeline argument values
arguments = {}
#Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment(EXPERIMENT_NAME)
#Submit a pipeline run
run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)
In [ ]:
!gsutil cat $OUTPUT_GCS_PATH/*
By deploying or using this software you agree to comply with the AI Hub Terms of Service and the Google APIs Terms of Service. To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.