Sunday, August 9, 2015

MapReduce


Computers are always driven by TWO key factors

  1. Storage
  2. Process


So far, we covered SIX tips on Storage - HDFS model during last 2 months.  Let us start exploring the process layer - Map Reduce.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

As depicted in the diagram, MapReduce program is composed of a Map procedure that performs filtering and sorting and a Reduce procedure that performs a summary operation. MapReduce Framework orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been generalized.

HDFS ecosystem has TWO different versions on MapReduce, namely V1 and V2.
  • V1 is the orginal MapReduce that uses TaskTracker and JobTrackerdaemons. 
  • V2 is called YARN. YARN to splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Resource manager, application master, and node manager.


We'll see more details on Map Reduce V1 & V2 in the next blogs.

No comments:

Post a Comment