We know that MapReduce is the ice breaker for the traditional computing model by introducing Scale Out technology in the easy way. Google has a separate research web page on Google's MapReduce at http://research.google.com/archive/mapreduce.html . Authors are Sanjay Ghemawat & Jeff Dean from Google Inc.
As the research scholar, I liked the motivation of their research paper - Large Scale Data Processing. It was achievable with super computing on earlier days. But key difference is parallel execution of hundreds or thousands of CPUs, with commodity box and easy mode. More over, MapReduce provides:
- Automatic parallelization and distribution
- Fault-tolerance
- I/O scheduling
- Status and monitoring
Fault-tolerance is handled via re-execution. On worker failure:
- Detect failure via periodic heartbeats
- Re-execute completed and in-progress map tasks
- Re-execute in progress reduce tasks
- Task completion committed through master
Data Locality Optimization, Skipping Bad Records and Compression of intermediate data are their few refinement technique to boost the performance on large scale data.
In their research paper, the use case was listed in August 2004 with the below metric:
- Number of jobs 29,423
- Average job completion time 634 secs
- Machine days used 79,186 days
- Input data read 3,288 TB
- Intermediate data produced 758 TB
- Output data written 193 TB
- Average worker machines per job 157
- Average worker deaths per job 1.2
- Average map tasks per job 3,351
- Average reduce tasks per job 55
- Unique map implementations 395
- Unique reduce implementations 269
- Unique map/reduce combinations
Amazing and game changing methodology with easiness, as the result of great minds research from Google. Herez an opportunity for me to highlight the authors of MapReduce - Sanjay Ghemawat & Jeff Dean
No comments:
Post a Comment