In this post, I would like to share something about building the jar file so that we can test our program on a distributed cluster.
I am using Hadoop 2.3 (CDH 5.0.0). But this program can be used in Hadoop 2.4. There are so little materials on the Internet to use IDEA writing programs in Hadoop. I tried several times and different ways and finally find a way to run the program successfully.
Firstly, we can create an project and then add the wordcount example code. Here is the program I used in this example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import java.io.IOException; import java.util.StringTokenizer; public class TestWordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = Job.getInstance(conf); job.setJarByClass(TestWordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } |
Of course we need to add the libraries. In this example, we only need 3 jars.
Then we need to add a new artifact. Click File -> Project Structure, select artifacts on the left. Click Add button -> Jar -> From modules with dependencies. Choose the module you create. If you specify the Main class here, you don’t need to add its class name in the following command. Click OK to save the settings.
Now you can click Build -> Build Artifacts. Select wordcount.jar and click build.
Now we have our Jar file. We can use the following command to run the map-reduce program, in which input is the input path and output is the output path.
1 |
hadoop jar wordcount.jar TestWordCount input output |
Be careful here if you specify the Main Class. If you do that, you don’t need to specify the class name.
1 |
hadoop jar wordcount.jar input output |
After the job is done, you can check the result in the output path.