Hadoop Short Course

1. Hadoop Distributed File System

Hadoop Distributed File System (HDFS)

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. The HDFS Architecture Guide describes HDFS in detail. To learn more about the interaction of users and administrators with HDFS, please refer to HDFS User Guide.

All HDFS commands are invoked by the bin/hdfs script. Running the hdfs script without any arguments prints the description for all commands. For all the commands, please refer to HDFS Commands Reference

Start HDFS


In [1]:
hadoop_root = '/home/ubuntu/shortcourse/hadoop-2.7.1/'
hadoop_start_hdfs_cmd = hadoop_root + 'sbin/start-dfs.sh'
hadoop_stop_hdfs_cmd = hadoop_root + 'sbin/stop-dfs.sh'

In [2]:
# start the hadoop distributed file system
! {hadoop_start_hdfs_cmd}


15/11/03 08:28:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/ubuntu/shortcourse/hadoop-2.7.1/logs/hadoop-ubuntu-namenode-ubuntu.out
localhost: starting datanode, logging to /home/ubuntu/shortcourse/hadoop-2.7.1/logs/hadoop-ubuntu-datanode-ubuntu.out
localhost: Java HotSpot(TM) Server VM warning: You have loaded library /home/ubuntu/shortcourse/hadoop-2.7.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
localhost: It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/ubuntu/shortcourse/hadoop-2.7.1/logs/hadoop-ubuntu-secondarynamenode-ubuntu.out
0.0.0.0: Java HotSpot(TM) Server VM warning: You have loaded library /home/ubuntu/shortcourse/hadoop-2.7.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
0.0.0.0: It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
15/11/03 08:28:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [3]:
# show the jave jvm process summary
# You should see NamenNode, SecondaryNameNode, and DataNode
! jps


5160 NameNode
5526 SecondaryNameNode
5642 Jps
5320 DataNode

Normal file operations and data preparation for later example


In [4]:
# list recursively everything under the root dir
! {hadoop_root + 'bin/hdfs dfs -ls -R /'}


15/11/03 08:28:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x   - ubuntu supergroup          0 2015-10-11 14:39 /user
drwxr-xr-x   - ubuntu supergroup          0 2015-11-02 20:14 /user/ubuntu
drwxr-xr-x   - ubuntu supergroup          0 2015-11-02 20:12 /user/ubuntu/input
-rw-r--r--   1 ubuntu supergroup     167518 2015-11-02 20:12 /user/ubuntu/input/alice.txt
-rw-r--r--   1 ubuntu supergroup     717574 2015-11-02 20:12 /user/ubuntu/input/pride-and-prejudice.txt
-rw-r--r--   1 ubuntu supergroup     594933 2015-11-02 20:12 /user/ubuntu/input/sherlock-holmes.txt
drwxr-xr-x   - ubuntu supergroup          0 2015-11-02 20:13 /user/ubuntu/output
-rw-r--r--   1 ubuntu supergroup          0 2015-11-02 20:13 /user/ubuntu/output/_SUCCESS
-rw-r--r--   1 ubuntu supergroup     279999 2015-11-02 20:13 /user/ubuntu/output/part-00000
drwxr-xr-x   - ubuntu supergroup          0 2015-11-02 20:14 /user/ubuntu/output2

Download some files for later use.


In [5]:
# We will use three ebooks from Project Gutenberg for later example
# Pride and Prejudice by Jane Austen: http://www.gutenberg.org/ebooks/1342.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1342.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/pride-and-prejudice.txt

# Alice's Adventures in Wonderland by Lewis Carroll: http://www.gutenberg.org/ebooks/11.txt.utf-8
! wget http://www.gutenberg.org/ebooks/11.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/alice.txt
    
# The Adventures of Sherlock Holmes by Arthur Conan Doyle: http://www.gutenberg.org/ebooks/1661.txt.utf-8
! wget http://www.gutenberg.org/ebooks/1661.txt.utf-8 -O /home/ubuntu/shortcourse/data/wordcount/sherlock-holmes.txt


--2015-11-03 08:28:52--  http://www.gutenberg.org/ebooks/1342.txt.utf-8
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.gutenberg.org/cache/epub/1342/pg1342.txt [following]
--2015-11-03 08:28:53--  http://www.gutenberg.org/cache/epub/1342/pg1342.txt
Reusing existing connection to www.gutenberg.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 717574 (701K) [text/plain]
Saving to: ‘/home/ubuntu/shortcourse/data/wordcount/pride-and-prejudice.txt’

100%[======================================>] 717,574     1003KB/s   in 0.7s   

2015-11-03 08:28:54 (1003 KB/s) - ‘/home/ubuntu/shortcourse/data/wordcount/pride-and-prejudice.txt’ saved [717574/717574]

--2015-11-03 08:28:54--  http://www.gutenberg.org/ebooks/11.txt.utf-8
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.gutenberg.org/cache/epub/11/pg11.txt [following]
--2015-11-03 08:28:56--  http://www.gutenberg.org/cache/epub/11/pg11.txt
Reusing existing connection to www.gutenberg.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 167518 (164K) [text/plain]
Saving to: ‘/home/ubuntu/shortcourse/data/wordcount/alice.txt’

100%[======================================>] 167,518     --.-K/s   in 0.1s    

2015-11-03 08:28:57 (1.09 MB/s) - ‘/home/ubuntu/shortcourse/data/wordcount/alice.txt’ saved [167518/167518]

--2015-11-03 08:28:57--  http://www.gutenberg.org/ebooks/1661.txt.utf-8
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.gutenberg.org/cache/epub/1661/pg1661.txt [following]
--2015-11-03 08:28:58--  http://www.gutenberg.org/cache/epub/1661/pg1661.txt
Reusing existing connection to www.gutenberg.org:80.
HTTP request sent, awaiting response... 200 OK
Length: 594933 (581K) [text/plain]
Saving to: ‘/home/ubuntu/shortcourse/data/wordcount/sherlock-holmes.txt’

100%[======================================>] 594,933     2.23MB/s   in 0.3s   

2015-11-03 08:28:58 (2.23 MB/s) - ‘/home/ubuntu/shortcourse/data/wordcount/sherlock-holmes.txt’ saved [594933/594933]


In [6]:
# delete existing folders
! {hadoop_root + 'bin/hdfs dfs -rm -R /user/ubuntu/*'}


15/11/03 08:29:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/03 08:29:03 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/ubuntu/input
15/11/03 08:29:03 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/ubuntu/output
15/11/03 08:29:03 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /user/ubuntu/output2

In [7]:
# create input folder
! {hadoop_root + 'bin/hdfs dfs -mkdir /user/ubuntu/input'}


15/11/03 08:29:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [8]:
# copy the three books to the input folder in HDFS
! {hadoop_root + 'bin/hdfs dfs -copyFromLocal /home/ubuntu/shortcourse/data/wordcount/* /user/ubuntu/input/'}


15/11/03 08:29:17 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

In [9]:
# show if the files are there
! {hadoop_root + 'bin/hdfs dfs -ls -R'}


15/11/03 08:29:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x   - ubuntu supergroup          0 2015-11-03 08:29 input
-rw-r--r--   1 ubuntu supergroup     167518 2015-11-03 08:29 input/alice.txt
-rw-r--r--   1 ubuntu supergroup     717574 2015-11-03 08:29 input/pride-and-prejudice.txt
-rw-r--r--   1 ubuntu supergroup     594933 2015-11-03 08:29 input/sherlock-holmes.txt

2. WordCount Example

Let's count the single word frequency in the uploaded three books.

Start Yarn, the resource allocator for Hadoop.


In [10]:
# start the hadoop distributed file system
! {hadoop_root + 'sbin/start-yarn.sh'}


starting yarn daemons
starting resourcemanager, logging to /home/ubuntu/shortcourse/hadoop-2.7.1/logs/yarn-ubuntu-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /home/ubuntu/shortcourse/hadoop-2.7.1/logs/yarn-ubuntu-nodemanager-ubuntu.out

In [11]:
# wordcount 1 the scripts
# Map: /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py
# Test locally the map script
! echo "go gators gators beat everyone go glory gators" | \
  /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py


go	1
gators	1
gators	1
beat	1
everyone	1
go	1
glory	1
gators	1

In [12]:
# Reduce: /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py
# Test locally the reduce script
! echo "go gators gators beat everyone go glory gators" | \
  /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py | \
  sort -k1,1 | \
  /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py


beat	1
everyone	1
gators	3
glory	1
go	2

In [13]:
# run them with Hadoop against the uploaded three books
cmd = hadoop_root + 'bin/hadoop jar ' + hadoop_root + 'hadoop-streaming-2.7.1.jar ' + \
    '-input input ' + \
    '-output output ' + \
    '-mapper /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
    '-reducer /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py'

! {cmd}


15/11/03 08:29:47 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
15/11/03 08:29:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py, /home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py] [] /tmp/streamjob5390010843768271216.jar tmpDir=null
15/11/03 08:29:48 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/11/03 08:29:48 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/11/03 08:29:48 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/11/03 08:29:49 INFO mapred.FileInputFormat: Total input paths to process : 3
15/11/03 08:29:49 INFO mapreduce.JobSubmitter: number of splits:3
15/11/03 08:29:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local774667417_0001
15/11/03 08:29:50 INFO mapred.LocalDistributedCacheManager: Localized file:/home/ubuntu/shortcourse/notes/scripts/wordcount1/mapper.py as file:/home/ubuntu/shortcourse/hadoop-2.7.1/hdfs-store/tmp/mapred/local/1446568189784/mapper.py
15/11/03 08:29:50 INFO mapred.LocalDistributedCacheManager: Localized file:/home/ubuntu/shortcourse/notes/scripts/wordcount1/reducer.py as file:/home/ubuntu/shortcourse/hadoop-2.7.1/hdfs-store/tmp/mapred/local/1446568189785/reducer.py
15/11/03 08:29:50 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/11/03 08:29:50 INFO mapreduce.Job: Running job: job_local774667417_0001
15/11/03 08:29:50 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/11/03 08:29:50 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
15/11/03 08:29:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:29:50 INFO mapred.LocalJobRunner: Waiting for map tasks
15/11/03 08:29:50 INFO mapred.LocalJobRunner: Starting task: attempt_local774667417_0001_m_000000_0
15/11/03 08:29:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:29:50 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:29:50 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/pride-and-prejudice.txt:0+717574
15/11/03 08:29:50 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:29:50 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:29:50 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:29:50 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:29:50 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:29:50 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:29:50 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:29:50 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
15/11/03 08:29:50 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/11/03 08:29:50 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/11/03 08:29:50 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/11/03 08:29:50 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
15/11/03 08:29:50 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
15/11/03 08:29:51 INFO mapreduce.Job: Job job_local774667417_0001 running in uber mode : false
15/11/03 08:29:51 INFO mapreduce.Job:  map 0% reduce 0%
15/11/03 08:29:51 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:29:51 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:51 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:51 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:51 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:51 INFO streaming.PipeMapRed: Records R/W=2690/1
15/11/03 08:29:51 INFO streaming.PipeMapRed: R/W/S=10000/78760/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:51 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:29:51 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:29:51 INFO mapred.LocalJobRunner: 
15/11/03 08:29:51 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:29:51 INFO mapred.MapTask: Spilling map output
15/11/03 08:29:51 INFO mapred.MapTask: bufstart = 0; bufend = 950559; bufvoid = 104857600
15/11/03 08:29:51 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25716048(102864192); length = 498349/6553600
15/11/03 08:29:52 INFO mapred.MapTask: Finished spill 0
15/11/03 08:29:52 INFO mapred.Task: Task:attempt_local774667417_0001_m_000000_0 is done. And is in the process of committing
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Records R/W=2690/1
15/11/03 08:29:52 INFO mapred.Task: Task 'attempt_local774667417_0001_m_000000_0' done.
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Finishing task: attempt_local774667417_0001_m_000000_0
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Starting task: attempt_local774667417_0001_m_000001_0
15/11/03 08:29:52 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:29:52 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:29:52 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/sherlock-holmes.txt:0+594933
15/11/03 08:29:52 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:29:52 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:29:52 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:29:52 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:29:52 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:29:52 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:29:52 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:29:52 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:29:52 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:29:52 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:52 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:52 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:52 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:52 INFO streaming.PipeMapRed: Records R/W=3018/1
15/11/03 08:29:52 INFO streaming.PipeMapRed: R/W/S=10000/59465/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:52 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:29:52 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:29:52 INFO mapred.LocalJobRunner: 
15/11/03 08:29:52 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:29:52 INFO mapred.MapTask: Spilling map output
15/11/03 08:29:52 INFO mapred.MapTask: bufstart = 0; bufend = 793799; bufvoid = 104857600
15/11/03 08:29:52 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 25784268(103137072); length = 430129/6553600
15/11/03 08:29:52 INFO mapred.MapTask: Finished spill 0
15/11/03 08:29:52 INFO mapred.Task: Task:attempt_local774667417_0001_m_000001_0 is done. And is in the process of committing
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Records R/W=3018/1
15/11/03 08:29:52 INFO mapred.Task: Task 'attempt_local774667417_0001_m_000001_0' done.
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Finishing task: attempt_local774667417_0001_m_000001_0
15/11/03 08:29:52 INFO mapred.LocalJobRunner: Starting task: attempt_local774667417_0001_m_000002_0
15/11/03 08:29:52 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:29:52 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:29:52 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/alice.txt:0+167518
15/11/03 08:29:52 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:29:53 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:29:53 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:29:53 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:29:53 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:29:53 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:29:53 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:29:53 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:29:53 INFO mapreduce.Job:  map 100% reduce 0%
15/11/03 08:29:53 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:29:53 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:53 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:53 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:53 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:53 INFO streaming.PipeMapRed: Records R/W=3022/1
15/11/03 08:29:53 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:29:53 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:29:53 INFO mapred.LocalJobRunner: 
15/11/03 08:29:53 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:29:53 INFO mapred.MapTask: Spilling map output
15/11/03 08:29:53 INFO mapred.MapTask: bufstart = 0; bufend = 220464; bufvoid = 104857600
15/11/03 08:29:53 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26096556(104386224); length = 117841/6553600
15/11/03 08:29:53 INFO mapred.MapTask: Finished spill 0
15/11/03 08:29:53 INFO mapred.Task: Task:attempt_local774667417_0001_m_000002_0 is done. And is in the process of committing
15/11/03 08:29:53 INFO mapred.LocalJobRunner: Records R/W=3022/1
15/11/03 08:29:53 INFO mapred.Task: Task 'attempt_local774667417_0001_m_000002_0' done.
15/11/03 08:29:53 INFO mapred.LocalJobRunner: Finishing task: attempt_local774667417_0001_m_000002_0
15/11/03 08:29:53 INFO mapred.LocalJobRunner: map task executor complete.
15/11/03 08:29:53 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/11/03 08:29:53 INFO mapred.LocalJobRunner: Starting task: attempt_local774667417_0001_r_000000_0
15/11/03 08:29:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:29:53 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:29:53 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@166bea0
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334154944, maxSingleShuffleLimit=83538736, mergeThreshold=220542272, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/11/03 08:29:54 INFO reduce.EventFetcher: attempt_local774667417_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/11/03 08:29:54 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local774667417_0001_m_000002_0 decomp: 279388 len: 279392 to MEMORY
15/11/03 08:29:54 INFO reduce.InMemoryMapOutput: Read 279388 bytes from map-output for attempt_local774667417_0001_m_000002_0
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 279388, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->279388
15/11/03 08:29:54 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local774667417_0001_m_000000_0 decomp: 1199737 len: 1199741 to MEMORY
15/11/03 08:29:54 INFO reduce.InMemoryMapOutput: Read 1199737 bytes from map-output for attempt_local774667417_0001_m_000000_0
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1199737, inMemoryMapOutputs.size() -> 2, commitMemory -> 279388, usedMemory ->1479125
15/11/03 08:29:54 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local774667417_0001_m_000001_0 decomp: 1008867 len: 1008871 to MEMORY
15/11/03 08:29:54 INFO reduce.InMemoryMapOutput: Read 1008867 bytes from map-output for attempt_local774667417_0001_m_000001_0
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 1008867, inMemoryMapOutputs.size() -> 3, commitMemory -> 1479125, usedMemory ->2487992
15/11/03 08:29:54 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/11/03 08:29:54 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
15/11/03 08:29:54 INFO mapred.Merger: Merging 3 sorted segments
15/11/03 08:29:54 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 2487968 bytes
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: Merged 3 segments, 2487992 bytes to disk to satisfy reduce memory limit
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: Merging 1 files, 2487992 bytes from disk
15/11/03 08:29:54 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/11/03 08:29:54 INFO mapred.Merger: Merging 1 sorted segments
15/11/03 08:29:54 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 2487982 bytes
15/11/03 08:29:54 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:29:54 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./reducer.py]
15/11/03 08:29:54 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/11/03 08:29:54 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/11/03 08:29:54 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:54 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:54 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:54 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:54 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:55 INFO streaming.PipeMapRed: Records R/W=16105/1
15/11/03 08:29:55 INFO streaming.PipeMapRed: R/W/S=100000/10015/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:55 INFO streaming.PipeMapRed: R/W/S=200000/20446/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:29:55 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:29:55 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:29:55 INFO mapred.Task: Task:attempt_local774667417_0001_r_000000_0 is done. And is in the process of committing
15/11/03 08:29:55 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:29:55 INFO mapred.Task: Task attempt_local774667417_0001_r_000000_0 is allowed to commit now
15/11/03 08:29:55 INFO output.FileOutputCommitter: Saved output of task 'attempt_local774667417_0001_r_000000_0' to hdfs://localhost:54310/user/ubuntu/output/_temporary/0/task_local774667417_0001_r_000000
15/11/03 08:29:55 INFO mapred.LocalJobRunner: Records R/W=16105/1 > reduce
15/11/03 08:29:55 INFO mapred.Task: Task 'attempt_local774667417_0001_r_000000_0' done.
15/11/03 08:29:55 INFO mapred.LocalJobRunner: Finishing task: attempt_local774667417_0001_r_000000_0
15/11/03 08:29:55 INFO mapred.LocalJobRunner: reduce task executor complete.
15/11/03 08:29:56 INFO mapreduce.Job:  map 100% reduce 100%
15/11/03 08:29:56 INFO mapreduce.Job: Job job_local774667417_0001 completed successfully
15/11/03 08:29:56 INFO mapreduce.Job: Counters: 35
	File System Counters
		FILE: Number of bytes read=4990001
		FILE: Number of bytes written=12117697
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=4990131
		HDFS: Number of bytes written=279999
		HDFS: Number of read operations=33
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=6
	Map-Reduce Framework
		Map input records=30213
		Map output records=261582
		Map output bytes=1964822
		Map output materialized bytes=2488004
		Input split bytes=330
		Combine input records=0
		Combine output records=0
		Reduce input groups=25985
		Reduce shuffle bytes=2488004
		Reduce input records=261582
		Reduce output records=25985
		Spilled Records=523164
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=200
		Total committed heap usage (bytes)=1325137920
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1480025
	File Output Format Counters 
		Bytes Written=279999
15/11/03 08:29:56 INFO streaming.StreamJob: Output directory: output

In [14]:
# list the output
! {hadoop_root + 'bin/hdfs dfs -ls -R output'}


15/11/03 08:30:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 ubuntu supergroup          0 2015-11-03 08:29 output/_SUCCESS
-rw-r--r--   1 ubuntu supergroup     279999 2015-11-03 08:29 output/part-00000

In [15]:
# Let's see what's in the output file
# delete if previous results exist
! rm -rf /home/ubuntu/shortcourse/tmp/*
! {hadoop_root + 'bin/hdfs dfs -copyToLocal output/part-00000 /home/ubuntu/shortcourse/tmp/wc1-part-00000'}
! tail -n 20 /home/ubuntu/shortcourse/tmp/wc1-part-00000


15/11/03 08:30:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/03 08:30:44 WARN hdfs.DFSClient: DFSInputStream has been closed already
yourself--and	1
yourself.	4
yourself."	6
yourself.'	2
yourself;	1
yourself?	1
yourself?"	1
yourselves	4
yourselves,	1
yourselves?	1
youth	9
youth,	9
youth,'	3
youth?"	1
youths	1
zero,	1
zero-point,	1
zest	1
zigzag	1
zigzag,	1

3. Exercise: WordCount2

Count the single word frequency, where the words are given in a pattern file.

For example, given pattern.txt file, which contains:

"a b c d"

And the input file is:

"d e a c f g h i a b c d". 

Then the output shoule be:

"a 1
 b 1
 c 2
 d 2"

Please copy the mapper.py and reduce.py from the first wordcount example to foler "/home/ubuntu/shortcourse/notes/scripts/wordcount2/". The pattern file is given in the wordcount2 folder with name "wc2-pattern.txt"

Hint:

  1. pass the pattern file using "-file option" and use -cmdenv to pass the file name as environment variable
  2. in the mapper, read the pattern file into a set
  3. only print out the words that exist in the set

In [16]:
# execise: count the words existing in the given pattern file for the three books

cmd = hadoop_root + 'bin/hadoop jar ' + hadoop_root + 'hadoop-streaming-2.7.1.jar ' + \
    '-cmdenv PATTERN_FILE=wc2-pattern.txt ' + \
    '-input input ' + \
    '-output output2 ' + \
    '-mapper /home/ubuntu/shortcourse/notes/scripts/wordcount2/mapper.py ' + \
    '-reducer /home/ubuntu/shortcourse/notes/scripts/wordcount2/reducer.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount2/mapper.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount2/reducer.py ' + \
    '-file /home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt'

! {cmd}


15/11/03 08:31:24 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
15/11/03 08:31:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
packageJobJar: [/home/ubuntu/shortcourse/notes/scripts/wordcount2/mapper.py, /home/ubuntu/shortcourse/notes/scripts/wordcount2/reducer.py, /home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt] [] /tmp/streamjob6344597239144833252.jar tmpDir=null
15/11/03 08:31:25 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/11/03 08:31:25 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/11/03 08:31:25 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/11/03 08:31:26 INFO mapred.FileInputFormat: Total input paths to process : 3
15/11/03 08:31:26 INFO mapreduce.JobSubmitter: number of splits:3
15/11/03 08:31:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1743492606_0001
15/11/03 08:31:27 INFO mapred.LocalDistributedCacheManager: Localized file:/home/ubuntu/shortcourse/notes/scripts/wordcount2/mapper.py as file:/home/ubuntu/shortcourse/hadoop-2.7.1/hdfs-store/tmp/mapred/local/1446568287169/mapper.py
15/11/03 08:31:27 INFO mapred.LocalDistributedCacheManager: Localized file:/home/ubuntu/shortcourse/notes/scripts/wordcount2/reducer.py as file:/home/ubuntu/shortcourse/hadoop-2.7.1/hdfs-store/tmp/mapred/local/1446568287170/reducer.py
15/11/03 08:31:27 INFO mapred.LocalDistributedCacheManager: Localized file:/home/ubuntu/shortcourse/notes/scripts/wordcount2/wc2-pattern.txt as file:/home/ubuntu/shortcourse/hadoop-2.7.1/hdfs-store/tmp/mapred/local/1446568287171/wc2-pattern.txt
15/11/03 08:31:27 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/11/03 08:31:27 INFO mapreduce.Job: Running job: job_local1743492606_0001
15/11/03 08:31:27 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/11/03 08:31:27 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
15/11/03 08:31:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:31:27 INFO mapred.LocalJobRunner: Waiting for map tasks
15/11/03 08:31:27 INFO mapred.LocalJobRunner: Starting task: attempt_local1743492606_0001_m_000000_0
15/11/03 08:31:27 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:31:27 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:31:27 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/pride-and-prejudice.txt:0+717574
15/11/03 08:31:27 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:31:27 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:31:27 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:31:27 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:31:27 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:31:27 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:31:27 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:31:27 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
15/11/03 08:31:27 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/11/03 08:31:27 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
15/11/03 08:31:27 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/11/03 08:31:27 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
15/11/03 08:31:27 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
15/11/03 08:31:27 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:31:27 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:27 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:27 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:27 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:27 INFO streaming.PipeMapRed: Records R/W=2690/1
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=10000/23842/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:31:28 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:31:28 INFO mapred.LocalJobRunner: 
15/11/03 08:31:28 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:31:28 INFO mapred.MapTask: Spilling map output
15/11/03 08:31:28 INFO mapred.MapTask: bufstart = 0; bufend = 204177; bufvoid = 104857600
15/11/03 08:31:28 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26065732(104262928); length = 148665/6553600
15/11/03 08:31:28 INFO mapred.MapTask: Finished spill 0
15/11/03 08:31:28 INFO mapred.Task: Task:attempt_local1743492606_0001_m_000000_0 is done. And is in the process of committing
15/11/03 08:31:28 INFO mapred.LocalJobRunner: Records R/W=2690/1
15/11/03 08:31:28 INFO mapred.Task: Task 'attempt_local1743492606_0001_m_000000_0' done.
15/11/03 08:31:28 INFO mapred.LocalJobRunner: Finishing task: attempt_local1743492606_0001_m_000000_0
15/11/03 08:31:28 INFO mapred.LocalJobRunner: Starting task: attempt_local1743492606_0001_m_000001_0
15/11/03 08:31:28 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:31:28 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:31:28 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/sherlock-holmes.txt:0+594933
15/11/03 08:31:28 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:31:28 INFO mapreduce.Job: Job job_local1743492606_0001 running in uber mode : false
15/11/03 08:31:28 INFO mapreduce.Job:  map 100% reduce 0%
15/11/03 08:31:28 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:31:28 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:31:28 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:31:28 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:31:28 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:31:28 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:31:28 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:31:28 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: Records R/W=3018/1
15/11/03 08:31:28 INFO streaming.PipeMapRed: R/W/S=10000/17217/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:28 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:31:28 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:31:28 INFO mapred.LocalJobRunner: 
15/11/03 08:31:28 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:31:28 INFO mapred.MapTask: Spilling map output
15/11/03 08:31:28 INFO mapred.MapTask: bufstart = 0; bufend = 172324; bufvoid = 104857600
15/11/03 08:31:28 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26087308(104349232); length = 127089/6553600
15/11/03 08:31:28 INFO mapred.MapTask: Finished spill 0
15/11/03 08:31:29 INFO mapred.Task: Task:attempt_local1743492606_0001_m_000001_0 is done. And is in the process of committing
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Records R/W=3018/1
15/11/03 08:31:29 INFO mapred.Task: Task 'attempt_local1743492606_0001_m_000001_0' done.
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Finishing task: attempt_local1743492606_0001_m_000001_0
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Starting task: attempt_local1743492606_0001_m_000002_0
15/11/03 08:31:29 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:31:29 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:31:29 INFO mapred.MapTask: Processing split: hdfs://localhost:54310/user/ubuntu/input/alice.txt:0+167518
15/11/03 08:31:29 INFO mapred.MapTask: numReduceTasks: 1
15/11/03 08:31:29 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
15/11/03 08:31:29 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
15/11/03 08:31:29 INFO mapred.MapTask: soft limit at 83886080
15/11/03 08:31:29 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
15/11/03 08:31:29 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
15/11/03 08:31:29 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
15/11/03 08:31:29 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./mapper.py]
15/11/03 08:31:29 INFO mapred.LineRecordReader: Found UTF-8 BOM and skipped it
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: Records R/W=3022/1
15/11/03 08:31:29 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:31:29 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:31:29 INFO mapred.LocalJobRunner: 
15/11/03 08:31:29 INFO mapred.MapTask: Starting flush of map output
15/11/03 08:31:29 INFO mapred.MapTask: Spilling map output
15/11/03 08:31:29 INFO mapred.MapTask: bufstart = 0; bufend = 44784; bufvoid = 104857600
15/11/03 08:31:29 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26181860(104727440); length = 32537/6553600
15/11/03 08:31:29 INFO mapred.MapTask: Finished spill 0
15/11/03 08:31:29 INFO mapred.Task: Task:attempt_local1743492606_0001_m_000002_0 is done. And is in the process of committing
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Records R/W=3022/1
15/11/03 08:31:29 INFO mapred.Task: Task 'attempt_local1743492606_0001_m_000002_0' done.
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Finishing task: attempt_local1743492606_0001_m_000002_0
15/11/03 08:31:29 INFO mapred.LocalJobRunner: map task executor complete.
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Waiting for reduce tasks
15/11/03 08:31:29 INFO mapred.LocalJobRunner: Starting task: attempt_local1743492606_0001_r_000000_0
15/11/03 08:31:29 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/03 08:31:29 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
15/11/03 08:31:29 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@1ea672d
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334154944, maxSingleShuffleLimit=83538736, mergeThreshold=220542272, ioSortFactor=10, memToMemMergeOutputsThreshold=10
15/11/03 08:31:29 INFO reduce.EventFetcher: attempt_local1743492606_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
15/11/03 08:31:29 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1743492606_0001_m_000001_0 decomp: 235872 len: 235876 to MEMORY
15/11/03 08:31:29 INFO reduce.InMemoryMapOutput: Read 235872 bytes from map-output for attempt_local1743492606_0001_m_000001_0
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 235872, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->235872
15/11/03 08:31:29 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1743492606_0001_m_000002_0 decomp: 61056 len: 61060 to MEMORY
15/11/03 08:31:29 INFO reduce.InMemoryMapOutput: Read 61056 bytes from map-output for attempt_local1743492606_0001_m_000002_0
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 61056, inMemoryMapOutputs.size() -> 2, commitMemory -> 235872, usedMemory ->296928
15/11/03 08:31:29 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1743492606_0001_m_000000_0 decomp: 278513 len: 278517 to MEMORY
15/11/03 08:31:29 INFO reduce.InMemoryMapOutput: Read 278513 bytes from map-output for attempt_local1743492606_0001_m_000000_0
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 278513, inMemoryMapOutputs.size() -> 3, commitMemory -> 296928, usedMemory ->575441
15/11/03 08:31:29 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
15/11/03 08:31:29 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: finalMerge called with 3 in-memory map-outputs and 0 on-disk map-outputs
15/11/03 08:31:29 INFO mapred.Merger: Merging 3 sorted segments
15/11/03 08:31:29 INFO mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 575429 bytes
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: Merged 3 segments, 575441 bytes to disk to satisfy reduce memory limit
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: Merging 1 files, 575441 bytes from disk
15/11/03 08:31:29 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
15/11/03 08:31:29 INFO mapred.Merger: Merging 1 sorted segments
15/11/03 08:31:29 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 575433 bytes
15/11/03 08:31:29 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:31:29 INFO streaming.PipeMapRed: PipeMapRed exec [/home/ubuntu/shortcourse/notes/./reducer.py]
15/11/03 08:31:29 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
15/11/03 08:31:29 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=1000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:29 INFO streaming.PipeMapRed: R/W/S=10000/0/0 in:NA [rec/s] out:NA [rec/s]
15/11/03 08:31:30 INFO streaming.PipeMapRed: Records R/W=77075/1
15/11/03 08:31:30 INFO streaming.PipeMapRed: MRErrorThread done
15/11/03 08:31:30 INFO streaming.PipeMapRed: mapRedFinished
15/11/03 08:31:30 INFO mapred.Task: Task:attempt_local1743492606_0001_r_000000_0 is done. And is in the process of committing
15/11/03 08:31:30 INFO mapred.LocalJobRunner: 3 / 3 copied.
15/11/03 08:31:30 INFO mapred.Task: Task attempt_local1743492606_0001_r_000000_0 is allowed to commit now
15/11/03 08:31:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1743492606_0001_r_000000_0' to hdfs://localhost:54310/user/ubuntu/output2/_temporary/0/task_local1743492606_0001_r_000000
15/11/03 08:31:30 INFO mapred.LocalJobRunner: Records R/W=77075/1 > reduce
15/11/03 08:31:30 INFO mapred.Task: Task 'attempt_local1743492606_0001_r_000000_0' done.
15/11/03 08:31:30 INFO mapred.LocalJobRunner: Finishing task: attempt_local1743492606_0001_r_000000_0
15/11/03 08:31:30 INFO mapred.LocalJobRunner: reduce task executor complete.
15/11/03 08:31:30 INFO mapreduce.Job:  map 100% reduce 100%
15/11/03 08:31:30 INFO mapreduce.Job: Job job_local1743492606_0001 completed successfully
15/11/03 08:31:30 INFO mapreduce.Job: Counters: 35
	File System Counters
		FILE: Number of bytes read=1167375
		FILE: Number of bytes written=3775289
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=4990131
		HDFS: Number of bytes written=172
		HDFS: Number of read operations=33
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=6
	Map-Reduce Framework
		Map input records=30213
		Map output records=77075
		Map output bytes=421285
		Map output materialized bytes=575453
		Input split bytes=330
		Combine input records=0
		Combine output records=0
		Reduce input groups=20
		Reduce shuffle bytes=575453
		Reduce input records=77075
		Reduce output records=20
		Spilled Records=154150
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=178
		Total committed heap usage (bytes)=1031274496
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=1480025
	File Output Format Counters 
		Bytes Written=172
15/11/03 08:31:30 INFO streaming.StreamJob: Output directory: output2

Verify Results

  1. Copy the output file to local
  2. run the following command, and compare with the downloaded output

    sort -nrk 2,2 part-00000 | head -n 20

The wc1-part-00000 is the output of the previous wordcount (wordcount1)


In [17]:
! rm -rf /home/ubuntu/shortcourse/tmp/wc2-part-00000
! {hadoop_root + 'bin/hdfs dfs -copyToLocal output2/part-00000 /home/ubuntu/shortcourse/tmp/wc2-part-00000'}
! cat /home/ubuntu/shortcourse/tmp/wc2-part-00000 | sort -nrk2,2


15/11/03 08:32:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/03 08:32:05 WARN hdfs.DFSClient: DFSInputStream has been closed already
the	11273
to	7594
of	6978
and	6887
a	5182
I	4533
in	3916
was	3484
that	3204
her	2428
his	2356
you	2317
it	2284
he	2148
as	2141
had	2107
with	2097
she	2095
not	2076
be	1975

In [18]:
! sort -nr -k2,2 /home/ubuntu/shortcourse/tmp/wc1-part-00000 | head -n 20


the	11273
to	7594
of	6978
and	6887
a	5182
I	4533
in	3916
was	3484
that	3204
her	2428
his	2356
you	2317
it	2284
he	2148
as	2141
had	2107
with	2097
she	2095
not	2076
be	1975
sort: write failed: standard output: Broken pipe
sort: write error

In [19]:
# stop dfs and yarn
!{hadoop_root + 'sbin/stop-yarn.sh'}
# don't stop hdfs for now, later use
# !{hadoop_stop_hdfs_cmd}


stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop

In [ ]: