Hadoop WordCount java example

posted on Nov 20th, 2016

Apache Hadoop

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Pre Requirements

1) A machine with Ubuntu 14.04 LTS operating system installed.

2) Apache Hadoop 2.6.4 pre installed (How to install Hadoop on Ubuntu 14.04)

Hadoop WordCount Example

Step 1 - Add all hadoop jar files to your java project. Add following jars.

/usr/local/hadoop/share/hadoop/common/*.jar
/usr/local/hadoop/share/hadoop/common/lib/*.jar
/usr/local/hadoop/share/hadoop/mapreduce/*.jar
/usr/local/hadoop/share/hadoop/mapreduce/lib/*.jar 
/usr/local/hadoop/share/hadoop/yarn/*.jar
/usr/local/hadoop/share/hadoop/yarn/lib/*.jar

WordCount.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    //job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/user/hduser/input"));
    FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/user/hduser/output"));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Step 2 - Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Step 3 - Start all hadoop daemons

$ start-all.sh

Step 4 - Create input.txt file. In my case, i have stored input.txt in /home/hduser/Desktop/hadoop/ directory.

input.txt

Step 5 - Add following lines to input.txt file.

hadoop java hello pig hive sqoop hadoop
hadoop java hello pig hive sqoop hadoop
hadoop java hello pig hive sqoop hadoop
hadoop java hello pig hive sqoop hadoop
hadoop java hello pig hive sqoop hadoop
hadoop java hello pig hive sqoop hadoop

Step 6 - Make a new input directory in HDFS

$ hdfs dfs -mkdir /user/hduser/input

Step 7 - Copy the input.txt from local file system to HDFS.

$ hdfs dfs -copyFromLocal /home/hduser/Desktop/hadoop/input.txt /user/hduser/input

Step 8 - Run your WordCount program by submitting java project jar file to hadoop. Creating jar file is left to you.

$ hadoop jar /path/wordcount.jar WordCount

Step 9 - Now you can see the output files.

$ hdfs dfs -cat /user/hduser/output/part-r-00000

Step 10 - Dont forget to stop hadoop daemons.

$ stop-all.sh

Please share this blog post and follow me for latest updates on

facebook             google+             twitter             feedburner

Previous Post                                                                                          Next Post

Labels : Hadoop Standalone Mode Installation   Hadoop Pseudo Distributed Mode Installation   Hadoop Fully Distributed Mode Installation   Hadoop HDFS commands usage Hadoop Commissioning and Decommissioning DataNode     Hadoop Mapper/Reducer Java Example   Hadoop Combiner Java Example   Hadoop Partitioner Java Example   Hadoop HDFS operations using Java   Hadoop Distributed Cache Java Example