The aim of this exercise is to become familiar with Hadoop. We will show how to run Hadoop application and create our own Python-based streaming application for parsing log file data.
For the following exercise we will use two examples provided as part of the standard Hadoop distribution. We use the Hortonworks HDP 2.3.2 deployed on Amazon Web Services (EC2). First we need to set these two variables to the jar
files containing the application.
In [37]:
HADOOP_EXAMPLES="/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-mapreduce-examples.jar"
HADOOP_STREAMING="/usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming.jar"
We will use the standard Hadoop command line utilities to enquire the status of the cluster (in particular HDFS and Yarn), and to submit applications. Shell commands can be executed in iPython notebook cells using the prefix !
. Use shift-enter or the play button in the menu to execute cells.
The -report
argument provides information about the file-system, disk space, number of nodes etc.
In [32]:
!hadoop dfsadmin -report
In [33]:
!yarn node -list all
In [34]:
!hdfs dfs -rm -r teragen teraout
In the next command we will create a dataset of 100,000 records, each of 100 bytes (for a total of 10Mb).
In [35]:
!yarn jar $HADOOP_EXAMPLES teragen 100000 teragen
The following command will use Hadoop commands (yarn
) to perform sorting.
In [40]:
!yarn jar $HADOOP_EXAMPLES terasort teragen teraout
In [39]:
!hadoop fs -text teraout/part-r-00000 | head
In [41]:
!hdfs dfs -rm -r wordcount-out
In [42]:
!hdfs dfs -text /data/nasa/NASA_access_log_Jul95 | head
In [43]:
!yarn jar $HADOOP_EXAMPLES wordcount /data/nasa/ wordcount-out/
Use the commands head
, cat
, uniq
, wc
, sort
, find
, xargs
, awk
to evaluate the NASA log file:
Which page was called the most? What was the most frequent return code? How many errors occurred? What is the percentage of errors? Implement a Python version of this Unix Shell script using the script as a template (the answer can be found in mapreduce_streaming.py)
We will now run the Python script inside a Hadoop Streaming job.
In [28]:
!cat mapreduce_streaming.py
In the next example, we'll execute mapreduce_streaming.py
with YARN as a hadoop streaming application.
In [29]:
!yarn jar $HADOOP_STREAMING -input /data/nasa -output logs-parsed \
-file mapreduce_streaming.py \
-mapper "python mapreduce_streaming.py map" \
-reducer "python mapreduce_streaming.py reduce"
In the next two commands, we'll parse the output of the operations above.
In [30]:
!hdfs dfs -ls logs-parsed
In [44]:
!hdfs dfs -text logs-parsed/*
In [ ]: