Introduction to Hadoop MapReduce

Python Jupyter notebook supports execution of Linux command inside the notebook cells. This is done by adding the ! to the beginning of the command line. It should be noted that each command begins with a ! will create a new bash shell and close this cell once the execution is done:

  • Full path is required
  • Temporary results and environmental variables will be lost

Upload data into Hadoop

Upload the text directory into the newly created intro-to-hadoop directory.

$ wget https://raw.githubusercontent.com/linhbngo/Distributed-and-Cluster-Computing/master/text/gutenberg-shakespeare.txt
$ hdfs dfs -put gutenberg-shakespeare.txt intro-to-hadoop/
$ hdfs dfs -ls intro-to-hadoop

Run a sample MapReduce program

$ yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples-3.1.1.3.0.1.0-187.jar wordcount intro-to-hadoop/gutenberg-shakespeare.txt output-wordcount

Output

18/11/01 17:03:38 INFO client.RMProxy: Connecting to ResourceManager at clnode188.clemson.cloudlab.us/130.127.133.197:8050
18/11/01 17:03:39 INFO client.AHSProxy: Connecting to Application History server at clnode195.clemson.cloudlab.us/130.127.133.204:10200
18/11/01 17:03:39 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /user/lngo/.staging/job_1541104508981_0008
18/11/01 17:03:39 INFO input.FileInputFormat: Total input files to process : 1
18/11/01 17:03:39 INFO mapreduce.JobSubmitter: number of splits:1
18/11/01 17:03:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541104508981_0008
18/11/01 17:03:39 INFO mapreduce.JobSubmitter: Executing with tokens: []
18/11/01 17:03:39 INFO conf.Configuration: found resource resource-types.xml at file:/etc/hadoop/3.0.1.0-187/0/resource-types.xml
18/11/01 17:03:40 INFO impl.YarnClientImpl: Submitted application application_1541104508981_0008
18/11/01 17:03:40 INFO mapreduce.Job: The url to track the job: http://clnode188.clemson.cloudlab.us:8088/proxy/application_1541104508981_0008/
18/11/01 17:03:40 INFO mapreduce.Job: Running job: job_1541104508981_0008
18/11/01 17:03:44 INFO mapreduce.Job: Job job_1541104508981_0008 running in uber mode : false
18/11/01 17:03:44 INFO mapreduce.Job:  map 0% reduce 0%
18/11/01 17:03:50 INFO mapreduce.Job:  map 100% reduce 0%
18/11/01 17:03:54 INFO mapreduce.Job:  map 100% reduce 100%
18/11/01 17:03:54 INFO mapreduce.Job: Job job_1541104508981_0008 completed successfully
18/11/01 17:03:54 INFO mapreduce.Job: Counters: 53
        File System Counters
                FILE: Number of bytes read=973082
                FILE: Number of bytes written=2409015
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=5447902
                HDFS: Number of bytes written=713504
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=282360
                Total time spent by all reduces in occupied slots (ms)=266760
                Total time spent by all map tasks (ms)=3620
                Total time spent by all reduce tasks (ms)=1710
                Total vcore-milliseconds taken by all map tasks=3620
                Total vcore-milliseconds taken by all reduce tasks=1710
                Total megabyte-milliseconds taken by all map tasks=289136640
                Total megabyte-milliseconds taken by all reduce tasks=273162240
        Map-Reduce Framework
                Map input records=124213
                Map output records=899681
                Map output bytes=8529629
                Map output materialized bytes=973082
                Input split bytes=158
                Combine input records=899681
                Combine output records=67109
                Reduce input groups=67109
                Reduce shuffle bytes=973082
                Reduce input records=67109
                Reduce output records=67109
                Spilled Records=134218
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=705
                CPU time spent (ms)=28770
                Physical memory (bytes) snapshot=3245010944
                Virtual memory (bytes) snapshot=210530725888
                Total committed heap usage (bytes)=3762814976
                Peak Map Physical memory (bytes)=2774241280
                Peak Map Virtual memory (bytes)=70509170688
                Peak Reduce Physical memory (bytes)=470769664
                Peak Reduce Virtual memory (bytes)=140021555200
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=5447744
        File Output Format Counters
                Bytes Written=713504

View Output

$ hdfs dfs -cat output-wordcount/part-r-00000

The Hello World of Hadoop: Word Count

Example Source Code