Geeks With Blogs
Rahul Anand's Blog If my mind can conceive it, and my heart can believe it, I know I can achieve it.

Before we delve into the IDE and start writing code lets understand a bit more about the MapReduce.

The MapReduce computation takes a set of input key/value pairs, and produces a set of output key/value pairs.

The user of the MapReduce library expresses the computation as two functions: Map and Reduce. Map takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key (k2) and passes them to the Reduce function. The Reduce function accepts an intermediate key (k2) and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation.

The Map

A map transform is provided to transform an input data row of key and value to an output key/value:

• map(k1,v1) -> list<k2,v2>

That is, for an input it returns a list containing zero or more (k,v) pairs:

1. The output can be a different key from the input

2. The output can have multiple entries with the same key

The Reduce

A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.

• reduce(k2, list<v2>) -> list<v3>

The MapReduce Engine

The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data.

Apache Hadoop is one such MapReduce engine.

5-step parallel and distributed computation in MapReduce:

1. Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key value K1 that each processor would work on, and provides that processor with all the input data associated with that key value.
2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2.
3. "Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated data associated with that key value.
4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step.
5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

Refer http://en.wikipedia.org/wiki/MapReduce for more details.

For learning the first program in Hadoop we will follow the example from the book “Hadoop - The definitive guide”. The sample application finds the max temperature for each year by analyzing the weather data provided by National Climatic Data Center (NCDC).

Tip: It is important to use the same version of JDK, which is available on your Hadoop system. Forgetting this may cause “Unsupported major.minor version 52.0” error when you try to run your program. You can find the Java version running on Hadoop system by running “java -version”.

Tip: It is important to use the same version of HADOOP, which is available on your Hadoop system. Forgetting this may cause “Unsupported major.minor version 52.0” error when you try to run your program. You can find the Hadoop version running on Hadoop system by running “hadoop version”.

SFTP the files to a directory in your virtual linux sandbox, and run gunzip.

 [root@sandbox rahul]# gunzip *.gz [root@sandbox rahul]# ll total 1740 -rw-r--r-- 1 root root 888190 Jun 3 12:37 1901 -rw-r--r-- 1 root root 888978 Jun 3 12:37 1902

Next create a java project and add external Hadoop library JAR files to its build path. The Hadoop JAR files are available under HADOOP_HOME/share/hadoop. Search for all *.jar files and add it in build path.

In general you will require JARS from common, hdfs, mapreduce.

Now add three class files MaxTemperature.java, MaxTemperatureMapper.java, MaxTemperatureReducer.java. Download the source code from the attached zip file.

The MaxTemperature.java is the MapReduce job which uses the mapper and reducer functions provided under corresponding java files, and schedules the map / reduce tasks to obtain the final output.

Export the project from Eclipse using “Runnable JAR option” and sftp to the Hadoop system (give it a name say MaxTemperature.jar).

You can now run the jar using:

Since we did not provide the input output paths it will print usage help “Usage: MaxTemperature <input path> <output path>”. This confirms everything so far is good.

Tip: In case you get any error, create a simple java hello world program and try to run it with Java / Hadoop.

 [root@sandbox rahul]# java -jar HelloWorld.jar Hello World! [root@sandbox rahul]# hadoop jar HelloWorld.jar Hello World!

Copy your sample data to HDFS:

 [root@sandbox rahul]# hadoop fs -mkdir /mapred/temp [root@sandbox rahul]# hadoop fs -copyFromLocal 1901 /mapred/temp/1901 [root@sandbox rahul]# hadoop fs -copyFromLocal 1902 /mapred/temp/1902

Execute the Hadoop job (provide input and output files):

And you have now successfully executed a Hadoop job! The result will be available in the default directory of user. In my case it is /user/root/output and can be viewed with following command:

 [root@sandbox rahul]# hadoop fs -cat output/part-r-00000 1901 317

# re: Writing your first MapReduce program
Left by jakir chy on Oct 20, 2015 1:44 AM

# re: Writing your first MapReduce program