Hadoop Short Course

1. Hadoop Distributed File System

Hadoop Distributed File System (HDFS)

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. To learn more about the interaction of users and administrators with HDFS, please refer to HDFS User Guide.

All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands. For all the commands, please refer to HDFS Commands Reference

Start HDFS

In [ ]:
hadoop_root = '/home/ubuntu/shortcourse/hadoop-2.7.1/'
hadoop_start_hdfs_cmd = hadoop_root + 'sbin/start-dfs.sh'
hadoop_stop_hdfs_cmd = hadoop_root + 'sbin/stop-dfs.sh'

In [ ]:
# start the hadoop distributed file system
! {hadoop_start_hdfs_cmd}

In [ ]:
# show the jave jvm process summary
# You should see NamenNode, SecondaryNameNode, and DataNode
! jps

Normal file operations and data preparation for later example

list recursively everything under the root dir

In [ ]:

Download some files for later use. The files should already be there.

In [ ]:
# We will use three ebooks from Project Gutenberg for later example
# Pride and Prejudice by Jane Austen: http://www.gutenberg.org/ebooks/1342.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1342.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/pride-and-prejudice.txt

# Alice's Adventures in Wonderland by Lewis Carroll: http://www.gutenberg.org/ebooks/11.txt.utf-8
! wget http://www.gutenberg.org/ebooks/11.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/alice.txt
# The Adventures of Sherlock Holmes by Arthur Conan Doyle: http://www.gutenberg.org/ebooks/1661.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1661.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/sherlock-holmes.txt

Delete existing folders under /user/ubuntu/ in hdfs

In [ ]:

Create input folder: /user/ubuntu/input

In [ ]:

Copy the three books to the input folder in HDFS. Similiar to normal bash cmd:

cp /home/ubuntu/shortcourse/data/wordcount/* /user/ubuntu/input/

but copy to hdfs.

In [ ]:

Show if the files are there.

In [ ]:

2. WordCount Example

Let's count the single word frequency in the uploaded three books.

Start Yarn, the resource allocator for Hadoop.

In [ ]:
Start the hadoop distributed file system

In [ ]:
! {hadoop_root + 'sbin/start-yarn.sh'}

Test locally the mapper.py and reduce.py

In [ ]:
# wordcount 1 the scripts
# Map: /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py
# Test locally the map script
! echo "go gators gators beat everyone go glory gators" | \

In [ ]:
# Reduce: /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py
# Test locally the reduce script
! echo "go gators gators beat everyone go glory gators" | \
  /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py | \
  sort -k1,1 | \

In [ ]:
# run them with Hadoop against the uploaded three books
cmd = hadoop_root + 'bin/hadoop jar ' + hadoop_root + 'hadoop-streaming-2.7.1.jar ' + \
    '-input input ' + \
    '-output output ' + \
    '-mapper /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
    '-reducer /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py'

! {cmd}

List the output

In [ ]:

Download the output file (part-00000) to local fs.

In [ ]:

In [ ]:
# Let's see what's in the output file
# delete if previous results exist
! tail -n 20 $(THE_DOWNLOADED_FILE)

3. Exercise: WordCount2

Count the single word frequency, where the words are given in a pattern file.

For example, given pattern.txt file, which contains:

"a b c d"

And the input file is:

"d e a c f g h i a b c d". 

Then the output shoule be:

"a 1
 b 1
 c 2
 d 2"

Please copy the mapper.py and reduce.py from the first wordcount example to foler "/home/ubuntu/shortcourse/notes/scripts/wordcount2/". The pattern file is given in the wordcount2 folder with name "wc2-pattern.txt"


  1. pass the pattern file using "-file option" and use -cmdenv to pass the file name as environment variable
  2. in the mapper, read the pattern file into a set
  3. only print out the words that exist in the set

In [ ]:
# 1. go to wordcount2 folder, modify the mapper

In [ ]:
# 2. test locally if the mapper is working

In [ ]:
# 3. run with hadoop streaming. Input is still the three books, output to 'output2'

Verify Results

  1. Copy the output file to local
  2. run the following command, and compare with the downloaded output

    sort -nrk 2,2 part-00000 | head -n 20

The wc1-part-00000 is the output of the previous wordcount (wordcount1)

In [ ]:
# 1. list the output, download the output to local, and cat the output file

In [ ]:
# 2. use bash cmd to find out the most frequently used 20 words from the previous example, 
#    and compare the results with this output

In [ ]:
# stop dfs and yarn
!{hadoop_root + 'sbin/stop-yarn.sh'}
# don't stop hdfs for now, later use
# !{hadoop_stop_hdfs_cmd}

In [ ]: