Hadoop Distributed File System (HDFS)
HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. To learn more about the interaction of users and administrators with HDFS, please refer to HDFS User Guide.
All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands. For all the commands, please refer to HDFS Commands Reference
In [ ]:
hadoop_root = '/home/ubuntu/shortcourse/hadoop-2.7.1/'
hadoop_start_hdfs_cmd = hadoop_root + 'sbin/start-dfs.sh'
hadoop_stop_hdfs_cmd = hadoop_root + 'sbin/stop-dfs.sh'
In [ ]:
# start the hadoop distributed file system
! {hadoop_start_hdfs_cmd}
In [ ]:
# show the jave jvm process summary
# You should see NamenNode, SecondaryNameNode, and DataNode
! jps
Normal file operations and data preparation for later example
list recursively everything under the root dir
In [ ]:
Download some files for later use. The files should already be there.
In [ ]:
# We will use three ebooks from Project Gutenberg for later example
# Pride and Prejudice by Jane Austen: http://www.gutenberg.org/ebooks/1342.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1342.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/pride-and-prejudice.txt
# Alice's Adventures in Wonderland by Lewis Carroll: http://www.gutenberg.org/ebooks/11.txt.utf-8
! wget http://www.gutenberg.org/ebooks/11.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/alice.txt
# The Adventures of Sherlock Holmes by Arthur Conan Doyle: http://www.gutenberg.org/ebooks/1661.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1661.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/sherlock-holmes.txt
Delete existing folders under /user/ubuntu/ in hdfs
In [ ]:
Create input folder: /user/ubuntu/input
In [ ]:
Copy the three books to the input folder in HDFS. Similiar to normal bash cmd:
cp /home/ubuntu/shortcourse/data/wordcount/* /user/ubuntu/input/
but copy to hdfs.
In [ ]:
Show if the files are there.
In [ ]:
Let's count the single word frequency in the uploaded three books.
Start Yarn, the resource allocator for Hadoop.
In [ ]:
Start the hadoop distributed file system
In [ ]:
! {hadoop_root + 'sbin/start-yarn.sh'}
Test locally the mapper.py and reduce.py
In [ ]:
# wordcount 1 the scripts
# Map: /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py
# Test locally the map script
! echo "go gators gators beat everyone go glory gators" | \
/home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py
In [ ]:
# Reduce: /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py
# Test locally the reduce script
! echo "go gators gators beat everyone go glory gators" | \
/home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py | \
sort -k1,1 | \
/home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py
In [ ]:
# run them with Hadoop against the uploaded three books
cmd = hadoop_root + 'bin/hadoop jar ' + hadoop_root + 'hadoop-streaming-2.7.1.jar ' + \
'-input input ' + \
'-output output ' + \
'-mapper /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
'-reducer /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py ' + \
'-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
'-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py'
! {cmd}
List the output
In [ ]:
Download the output file (part-00000) to local fs.
In [ ]:
In [ ]:
# Let's see what's in the output file
# delete if previous results exist
! tail -n 20 $(THE_DOWNLOADED_FILE)
Count the single word frequency, where the words are given in a pattern file.
For example, given pattern.txt file, which contains:
"a b c d"
And the input file is:
"d e a c f g h i a b c d".
Then the output shoule be:
"a 1
b 1
c 2
d 2"
Please copy the mapper.py and reduce.py from the first wordcount example to foler "/home/ubuntu/shortcourse/notes/scripts/wordcount2/". The pattern file is given in the wordcount2 folder with name "wc2-pattern.txt"
Hint:
In [ ]:
# 1. go to wordcount2 folder, modify the mapper
In [ ]:
# 2. test locally if the mapper is working
In [ ]:
# 3. run with hadoop streaming. Input is still the three books, output to 'output2'
Verify Results
run the following command, and compare with the downloaded output
sort -nrk 2,2 part-00000 | head -n 20
The wc1-part-00000 is the output of the previous wordcount (wordcount1)
In [ ]:
# 1. list the output, download the output to local, and cat the output file
In [ ]:
# 2. use bash cmd to find out the most frequently used 20 words from the previous example,
# and compare the results with this output
In [ ]:
# stop dfs and yarn
!{hadoop_root + 'sbin/stop-yarn.sh'}
# don't stop hdfs for now, later use
# !{hadoop_stop_hdfs_cmd}
In [ ]: