Sunday, August 16, 2015

MapReduce Execution

On submission of the job by the User, Hadoop initiates the Job Tracker process at Master Node.  Internally, the execution undergoes 3 major tasks/steps as below:

1. Map 
A map task is created that runs the user-supplied map function on each record. The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs. Map tasks are executed in parallel by various machines across the cluster.  

Mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up  after the Hadoop Job completes.

2. Shuffle and Sort
It is preformed by the reducers (reduce tasks). Each reducer is assigned one of the partitions on which it should work. This is a flurry of network copies between each reducer in the cluster so it can get the partition (intermediate key-value) data it was assigned to work on.

The output of the reduce is normally stored in HDFS for reliability. This step uses a lot of bandwidth between servers and benefits from very fast networking like 10G.

3. Reduce
After the partition data has been copied we can start performing a merge sort of the data. A merge sort takes a number of sorted items and merges them together to form a fully sorted list. 
Each reducer produces a separate output file, usually in HDFS. Each reducer output file usually named part-, where  is the number of the reduce task within the job. The output format of the file is specified by the author of the MapReduce job. The number of reducers is defined by the developer.

Attached image depicts the execution flow of Hadoop job.

Sunday, August 9, 2015


Computers are always driven by TWO key factors

  1. Storage
  2. Process

So far, we covered SIX tips on Storage - HDFS model during last 2 months.  Let us start exploring the process layer - Map Reduce.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

As depicted in the diagram, MapReduce program is composed of a Map procedure that performs filtering and sorting and a Reduce procedure that performs a summary operation. MapReduce Framework orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been generalized.

HDFS ecosystem has TWO different versions on MapReduce, namely V1 and V2.
  • V1 is the orginal MapReduce that uses TaskTracker and JobTrackerdaemons. 
  • V2 is called YARN. YARN to splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Resource manager, application master, and node manager.

We'll see more details on Map Reduce V1 & V2 in the next blogs.

Saturday, August 8, 2015

Back to School

Go after your Dream, No matter how unattainable other think it is!  Interesting quote, isn't it?

I always dream about the top-notch premium institute. God blessed me to admit into Indian Institute of Technology as Research Scholor, since August 2015.

At IIT Madras, research is a preoccupation of around 550 faculty members, 3000+ MS and PhD research scholars, more than 800 project staff, and a good number of undergraduates as well.  IIT is always world class institute and the alumni speaks the industry value across the globe.

My blessed journey is starting with the emerging technology course on Searching and Indexing in Large Datasets.  Wishing your prayer and support to execute this life filling opportunity.

Sunday, August 2, 2015

Hadoop Command Line

All Hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.

hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs

Hadoop comes with a number of command-line tools that enable basic filesystem operations. HDFS commands are subcommands of the hadoop command-line utility. To display basic usage information, the command is:

hadoop fs  

Hadoop uses value in core-site.xml file if full url syntax is not used.  Command to list files in a dir.

hadoop fs -ls /user/myid
hadoop fs -ls hdfs://

Command to upload file with -put or -copyFromLocal which copies file form local filesystem

hadoop fs -put /etc/mytest.conf /user/myid/

To download file from HDFS using -get or -copyToLocal.

hadoop fs -get /user/myid/mytest.conf ./

Process to set a replication factor for a file or dir of files with the -R

hadoop fs -setrep 5 -R /user/myid/rep5/

fsck is command to run HDFS filesystem checking utility. Run a fsck on the files we set the rep factor on and see if it looks correct

hadoop fsck /user/myid/rep5 -files -blocks -locations

These are basic frequently used HDFS commands using command line operations.