Skip to content

Run MapReduce on hadoop 1 What is mapreduce ?

chris_podorsiki edited this page Jan 1, 2018 · 1 revision

let's review what is MapReduce and the workflow on hadoop

1 Mapreduce is a algorithm which is developed by google and largely used into distributed systems for big data calculation and processing. it has two main method: Map() and reduce(). Sometimes, we will use the combiner to reduce the output from map() and further optimize the reduce() for less resource usage.

2 To better understand the mapreduce, let's make a workflow diagram to explore it further.

  • datastream push data into HDFS system
  • Mapper class (not map) will create a new abstract class called "context", this context will use MapContext interface to get the taskInputOutputContext interface which will provide the getCurrentKey、getCurrentValue、nextKeyValue, getOutputCommitter (abstract class from OutputCommitter, in order to tell how to get the Output and Output format)
  • the taskInputOutputContext interface will implement the TaskAttemptContext interface which will provide the JobContext
  • Input file will be splitted into several chunks (default is 3 and each will be 64 mb), usually it will be 3 chunks
  • JobContext interface will use job class to get the input file and output path
  • Each chunk will be formatted into a key-value format for mapper's input. usually key will be the off-set byte of the line, and value will be the line contents
  • Then each chunk will be using the mapper class once
  • Then each key-value pair of that chunk will use map() method once
  • Each map() method will output a new defined "key-value" pair
  • Each new "K-V" pair will be shuffled and sorted with the same key
  • Then Reducer class will use reduce() method to deal with the same key with different value iteration
  • Then process the value in each iteration for each key
  • Then reduce() method will give a new value for each key
  • task finished

Let's use a diagram to show the above:

Using a more detailed diagram to explain the map-reduce phase: