Saturday, July 11, 2015

HDFS Goals

In last tip, we saw Google’s Whitepapers on Big Data.  Apache Hadoop has been originated from Google’s Whitepapers:

  1. Apache HDFS is derived from GFS  (Google File System).
  2. Apache MapReduce is derived from Google MapReduce
  3. Apache HBase is derived from Google BigTable.

This Tip#2, is on HDFS goals & roles in Big Data.  What is HDFS?

HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on COMMODITY hardware. HDFS is not a POSIX-compliant filesystem.

In HDFS, each machine in a cluster stores a subset of the data (blocks) that makes up the complete filesystem. Itz metadata is stored on a centralized server, acting as a directory of block data and providing a global picture of the filesystem's state.

Top 5 Goals of HDFS

  1. Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.
  2. Use a scale-out model based on inexpensive commodity servers with internal JBOD  ("Just a bunch of disks") rather than RAID to achieve large-scale storage. 
  3. Accomplish availability and high throughput through application-level replication of data.
  4. Optimize for large, streaming reads and writes rather than low-latency access to many small files. 
  5. Support the functionality and scale requirements of MapReduce processing.

No comments:

Post a Comment