Hadoop Programming

Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Commands

Command	Description
archive	create a Hadoop archive
jar	to run a jar file
classpath
distcp
fs
version

Generic options:

Generic Option	Description
`-conf <config file>`	Specify the config file
`-D <propert=value>`	Set the value of a property
`-fs <local> or <namenode:port>`	specify a namenode
`-jt <local> or <jobtracker:port>`
`-files <list of files>`	Specify files to be copied to the map reduce cluster
`-libjars <list of jars>`	Specify jar files to include in the classpath
`-archives <list of archives>`

Example usage of generic option parser

Setting the number of reduce tasks in generic option parser:

hadoop jar myclass.jar MyClass -D mapred.reduce.tasks=0 /example/input.txt /example/output

Writing a Mapper Class

Context objects are used to write the output of map() function

Hadoop Data Types for Keys & Values

WritableComparable can be used for both keys & values
Writable interface is for efficiently serializing objects for input and output. can be used for values

Class	Description
BooleanWritable	Standard boolean writable
ByteWritable	a single byte
DoubleWritable	a double
FloatWritable	a float
IntWritable	an integer
LongWritable	a long
Text	to store text using UTF8 format
NullWritable	Placeholder when the key or value is not needed

MapReduce Interface

Job: represents a MapReduce job configuration. Job is typically used to specify the Mapper, combiner (if any), Partitioner, Reducer, InputFormat, OutputFormat
Mapper class: maps input key/value pairs to a set of intermediate key/value pairs
Reducer class: reduces a set of intermediate values which share a key to a smaller set of values
Combiner class: performs local aggregation of the intermediate outputs, to cut-down the cost of I/O and communication
Partitioner class: partitions the key space
Input and output: FileInputFormat indicates the set of input files

Create a new job

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "job name");

Set the main jar class

job.setJarByClass(Class);

Set the mapper class

Job.setMapperClass(Class);

Set the combiner class

Job.setCombinerClass(Class);

Set the reducer class

Job.setReducerClass(Class)

Set the number of reduce tasks:

job.setNumReduceTasks(int);

Set the input file(s):

Specify input files by Path:

FileInputFormat.setInputPaths(Job, Path); FileInputFormat.addInputPath(Job, Path);

or by String

FileInputFormat.setInputPaths(Job, String); FileInputFormat.addInputPaths(Job, String);

Set the output directory:

FileOutputFormat.setOutputPath(Path);

Minimial Hadoop Program

The most simple Hadoop program one can write must contain the minimial components of a Hadoop program. Minimial Hadoop program uses the default mapper, i.e. Mapper.class, and the default reducer Reducer.class

Mapper.class reads record line by line. The key is the offset from begining of line (LongWritable) and value is Text
The default reducer (Reducer.class ) writes its input directly to output as is, without any aggregation

miminal_hadoop.java

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.MRJobConfig;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MinimalProgram {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "minimal program");
        job.setJarByClass(minimal.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Compile

hduser@ubuntu$ javac -cp [provide class path] minimal.java

Archive into a jar file

hduser@ubuntu$ jar cvf minimal.jar *.class

Run Hadoop

hduser@ubuntu$ hadoop jar minimal.jar MinimalProgram /example/input.txt /example/out

Retrieve the results from HDFS

hduser@ubuntu$ hadoop dfs -cat /example/out/part-r-00000
0   Master Kenobi, you disappoint me.
34  Yoda holds you in such high esteem.
70  Surely you can do better!

Understanding components of this minimal Hadoop program

Libraries import the required libraries
main class the main class name should be the same as file name
Configurations create an instance of Configuration class, and specify your desired configuration properties
Job setup set the job properties and name of the class
Input and Output Format

Hadoop Program Configuraions

Default properties mapred-default.xml
Specifying properties in a configuration directory

hadoop --config <config_dir> jar <jarfile> <class_name>

Specify the configurations explicitly within the program

Configuration conf = new Configuration();
conf.set("property1", "value1");

Modify the configuration properties with generic options
```
hadoop jar <jarfile> <class_name> -D property=value
```

Wordcount

wordcount_v1.java

import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;


import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;

public class wordcount extends Configured implements Tool {

  public int run(String[] args) throws Exception {
    Job job = new Job(getConf());
    job.setJarByClass(getClass());
    job.setMapperClass(TokenCounterMapper.class);
    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static void main(String [] args) throws Exception {
    int exitCode = ToolRunner.run(new wordcount(), args);
    System.exit(exitCode);
  }

}

Built-in Mapper and Reducers:

TokenCounterMapper

import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;

IntSumReducer

import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

Set the types of output keys and values of reducer

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Put everything together

import java.io.IOException;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import org.apache.hadoop.mapreduce.lib.map.TokenCounterMapper;
import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;

public class wordcount extends Configured implements Tool {

  public int run(String[] args) throws Exception {
    Job job = new Job(getConf());
    job.setJarByClass(getClass());
    job.setMapperClass(TokenCounterMapper.class);
    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }

  public static void main(String [] args) throws Exception {
    int exitCode = ToolRunner.run(new wordcount(), args);
    System.exit(exitCode);
  }

}



In [ ]: