Introduction to Hadoop MapReduce

3. Optimization

First principle of optimizing Hadoop workflow: Reduce data movement in the shuffle phase

$ hdfs dfs -mkdir intro-to-hadoop
$ cp -R /local/repository/codes .

How to compile a Java source code that utilizes the Hadoop MapReduce libraries

$ cat codes/compileMR.sh

https://github.com/linhbngo/cluster-template/blob/hadoop/codes/airlines/avgDepartureDelay.java

Run a jar file on Hadoop

$ bash ./codes/compileMR.sh codes/airlines/avgDepartureDelay
$ yarn jar codes/airlines/avgDepartureDelay.jar avgDepartureDelay /repository/airlines/data/ avg-output-1

18/11/08 11:31:38 INFO mapreduce.Job: Job job_1541649618909_0006 completed successfully
18/11/08 11:31:39 INFO mapreduce.Job: Counters: 54
        File System Counters
                FILE: Number of bytes read=1092623207
                FILE: Number of bytes written=2207795995
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=12038926861
                HDFS: Number of bytes written=157
                HDFS: Number of read operations=293
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=96
                Launched reduce tasks=1
                Data-local map tasks=94
                Rack-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=38760984
                Total time spent by all reduces in occupied slots (ms)=13601844
                Total time spent by all map tasks (ms)=717796
                Total time spent by all reduce tasks (ms)=125943
                Total vcore-milliseconds taken by all map tasks=717796
                Total vcore-milliseconds taken by all reduce tasks=125943
                Total megabyte-milliseconds taken by all map tasks=39691247616
                Total megabyte-milliseconds taken by all reduce tasks=13928288256
        Map-Reduce Framework
                Map input records=123534991
                Map output records=121232833
                Map output bytes=850157535
                Map output materialized bytes=1092623777
                Input split bytes=13440
                Combine input records=0
                Combine output records=0
                Reduce input groups=29
                Reduce shuffle bytes=1092623777
                Reduce input records=121232833
                Reduce output records=29
                Spilled Records=242465666
                Shuffled Maps =96
                Failed Shuffles=0
                Merged Map outputs=96
                GC time elapsed (ms)=88962
                CPU time spent (ms)=2512020
                Physical memory (bytes) snapshot=317483470848
                Virtual memory (bytes) snapshot=4810731032576
                Total committed heap usage (bytes)=365535166464
                Peak Map Physical memory (bytes)=3499929600
                Peak Map Virtual memory (bytes)=49104142336
                Peak Reduce Physical memory (bytes)=4383043584
                Peak Reduce Virtual memory (bytes)=97246007296
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=12038913421
        File Output Format Counters
                Bytes Written=157

What is being passed from Map to Reduce?

Common optimization approaches:

Additional combiner function
In-mapper reduction of key/value pairs

Combiner function

https://github.com/linhbngo/cluster-template/blob/hadoop/codes/airlines/avgDepartureDelayVersionTwo.java

Excercise

Compile and run avgDepatureDelayVersionTwo.java on the Hadoop cluster
Examine its output information and compare shuffle data size.

In-mapper reduction of Key/Value pairs

https://github.com/linhbngo/cluster-template/blob/hadoop/codes/airlines/avgDepartureDelayVersionThree.java

Excercise

Compile and run avgDepatureDelayVersionTwo.java on the Hadoop cluster
Examine its output information and compare shuffle data size.