In last tip, we saw Google’s Whitepapers on Big Data. Apache Hadoop has been originated from Google’s Whitepapers:
- Apache HDFS is derived from GFS (Google File System).
- Apache MapReduce is derived from Google MapReduce
- Apache HBase is derived from Google BigTable.
This Tip#2, is on HDFS goals & roles in Big Data. What is HDFS?
HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on COMMODITY hardware. HDFS is not a POSIX-compliant filesystem.
In HDFS, each machine in a cluster stores a subset of the data (blocks) that makes up the complete filesystem. Itz metadata is stored on a centralized server, acting as a directory of block data and providing a global picture of the filesystem's state.
Top 5 Goals of HDFS
- Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.
- Use a scale-out model based on inexpensive commodity servers with internal JBOD ("Just a bunch of disks") rather than RAID to achieve large-scale storage.
- Accomplish availability and high throughput through application-level replication of data.
- Optimize for large, streaming reads and writes rather than low-latency access to many small files.
- Support the functionality and scale requirements of MapReduce processing.