Ganesan Senthilvel: July 2015

Tuesday, July 28, 2015

Data Node

DataNode(DN) is third/last domain in HDFS architecture, which is responsible for storing and retrieving the actual data. In DataNode, data is simply written to multiple machines in the highly distributed mode.

A very large number of independently spinning disks performing huge sequential I/O operations with independent I/O queues can actually outperform RAID in the specific use case of Hadoop workloads. Typically, DataNodes have a large number of independent disks, each of which stores full blocks.

In general, Blocks are nothing more than chunks of a file, binary blobs of data. The DataNode has direct local access to one or more disks—commonly called data disks in a server on which itz permitted to store block data.

Point DataNode to new disks in existing servers or adding new servers with more disks increases the amount of storage in the cluster. Block data is streamed to and from DataNode directly, so bandwidth is not limited by a single node. These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e. there will be 3 copies of the same block in the cluster

DataNodes regularly report their status to the NameNode in a heartbeat. DataNode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report is simply a list of all blocks the DataNode currently has on its disks. The NameNode keeps track of all the changes.

In the nutshell, DataNode is the actual storage platform, which manages the file blocks within the node.

Wednesday, July 22, 2015

Secondary NameNode

In the series of Big Data Tips, we are in 5th post of the series. Today, we are going to talk about Secondary NameNode.

Secondary NameNode IS NOT A BACKUP FOR THE NameNode! The name is horrible, but it is what it is.

The secondary Namenode job is to periodically merge the image file with edit log file. This merge operation is to avoid the edit log file from growing into a large file. The secondary namenode runs on a separate machine and copies the image and edit log files periodically. The secondary namenode requires as much memory as the primary namenode. If the namenode crashes, then you can use the copied image and edit log files from secondary namenode and bring the primary namenode up. However, the state of secondary namenode lags from the primary namenode.

NameNode may not have the available resources (CPU or RAM) to do this while continuing to provide service to the cluster so the secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the primary. This is known as the checkpointing process. The checkpoint file is never changed by the NameNode only the Secondary NameNode.

This application of the updates (checkpointing) to the fsimage file occurs every 60 mins by default or whenever the NameNodes edits file reaches 64 MB. Which ever happens first. Newer versions of Hadoop use a defined number of transactions rather than file size to determine when to perform a checkpoint.

If the secondary NameNode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the NameNode needs to combine the edit log and the current filesystem checkpoint image.

Sunday, July 19, 2015

Name Node

NameNode acts as the data manager between the requested client and content holder - DataNodes.

On start up, DataNodes send the block report to NameNode on every hour, along with the heartbeats on every 3 seconds. NameNode keeps track of every data change. At any given time, the NameNode has a complete view of all DataNodes in the cluster, their current health, and what blocks they have available. File to block mapping on the NameNode is stored on disk.

NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes. The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node. NameNode stores its filesystem metadata on local filesystem disks. 2 Key files are (1)FsiImage (2)EditsLog

FsiImage contains the complete snapshot of filesystem at iNode level. iNode is an internal representation of a file or directory's metadata and contains such information as the file's replication level, modification and access times, access permissions, block size, and the blocks a file is made up of. This design makes not to worry about the changing DataNodes' hostname or IP address.

Edits file (journal) contains only incremental modifications made to the metadata. It uses a write ahead log which reduces I/O operations to sequential, append-only operations, which avoids costly seek operations and yields better overall performance.

On NameNode startup, the fsimage file is loaded into RAM and any changes in the edits file are replayed, bringing the in-memory view of the filesystem up to date. NameNode filesystem metadata is served entirely from RAM. This makes it fast, but limits the amount of metadata a box can handle. Roughly 1 million blocks occupies roughly 1 GB of heap.

Thus, NameNode performs the filesystem operations in the highly distributed methodology for the Client request(s).

Sunday, July 12, 2015

HDFS Design

In principle, HDFS has a block size higher than most other file systems. The default is 128M and some go as high as 1G. Files in HDFS are write once.

There are three daemons that make up a standard HDFS cluster.

NameNode - 1 per cluster. Meta data's centralized server to provide a global picture of the filesystem's state.
Secondary NameNode - 1 per cluster. Performs internal NameNode transaction log check pointing.
DataNode - Many per cluster. Stores block data (contents of files).

NameNode stores its filesystem metadata on local filesystem disks in a few different files, but the two most important of which are fsimage and edits. Fsimage contains a complete snapshot of the filesystem metadata including a serialized form of all the directory and file inodes in the filesystem. Edits file (journal) contains only incremental modifications made to the metadata, which acts as write ahead log.

Secondary NameNode is not only backup of NameNode but also shares the workload via checkpointing process. In which secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the primary. Checkpointing is controlled by duration (default 60 mins) and/or file size and/or transaction count of Edits file.

Daemon responsible for storing and retrieving block (chunks of a file) data is called the DataNode . Datanodes regularly report their status to the NameNode in a heartbeat mode; default 3 mins. It sends Block Report (list of all usable blocks of DataNode disks) to NameNode; default 60 mins.

3 key daemons of HDFS Architecture, is represented in the attached diagram.

Saturday, July 11, 2015

HDFS Goals

In last tip, we saw Google’s Whitepapers on Big Data. Apache Hadoop has been originated from Google’s Whitepapers:

Apache HDFS is derived from GFS (Google File System).
Apache MapReduce is derived from Google MapReduce
Apache HBase is derived from Google BigTable.

This Tip#2, is on HDFS goals & roles in Big Data. What is HDFS?

HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on COMMODITY hardware. HDFS is not a POSIX-compliant filesystem.

In HDFS, each machine in a cluster stores a subset of the data (blocks) that makes up the complete filesystem. Itz metadata is stored on a centralized server, acting as a directory of block data and providing a global picture of the filesystem's state.

Top 5 Goals of HDFS

Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.
Use a scale-out model based on inexpensive commodity servers with internal JBOD ("Just a bunch of disks") rather than RAID to achieve large-scale storage.
Accomplish availability and high throughput through application-level replication of data.
Optimize for large, streaming reads and writes rather than low-latency access to many small files.
Support the functionality and scale requirements of MapReduce processing.

Thursday, July 9, 2015

ThankYou Card

A lot of people don’t realize this, but the Pulse app was built as a class project at Stanford University in 2010. Pulse today powers a lot of content you see on LinkedIn’s homepage feed. They reached an incredible milestone for the LinkedIn publishing platform: 1 million professionals have now written a post on LinkedIn.

Over 1 million unique writers publish more than 130,000 posts a week on LinkedIn. About 45% of readers are in the upper ranks of their industries: managers, VPs, CEOs, etc. The top content-demanding industries are tech, financial services and higher education. The average post now reaches professionals in 21 industries and 9 countries.

In this big metric, I'm also taking tiny contribution @ Pulse. In conjunction with 1 million post celebration, I just received 'Thank You' card from LinkedIn.

If you’re writing on LinkedIn or interested in starting, join the Writing on LinkedIn Group. Not sure “What’s stopping you?”. Continuous Learning & Continuous Sharing is mantra for our industry.

Sunday, July 5, 2015

Google WhitePaper

Dear readers, I recently got few requests to share few 'Useful Tips' on Big Data ecosystem.

Tip is a small piece or part fitted to the end of an object. Let me fill Big Data Tips during this quarter - Q3 2015.

For those interested in the history, the super base class of Big Data ecosystem is Google's whitepaper.

The first, presented in 2003, describes a pragmatic, scalable, distributed file system optimized for storing enormous datasets, called "Google File system", or GFS by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In addition to simple storage, GFS was built to support large-scale, data-intensive, distributed processing applications.

The following year-2004, another paper, titled "Map-Reduce: Simplified Data Processing on Large Clusters" was presented by Jeffrey Dean and Sanjay Ghemawat, defining a programming model and accompanying framework that provided automatic parallelization, fault tolerance, and the scale to process hundreds of terabytes of data in a single job over thousands of machines.

When paired, these two systems could be used to build large data processing clusters on relatively inexpensive, commodity machines. Google White papers are industry break through and directly inspired the development of HDFS and Hadoop MapReduce, respectively.

Google's WhitePaper References are available at:
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Stay tuned for continuous tip & trick to ()l)earn more.

Wednesday, July 1, 2015

Netflix Migration

Last weekend, I read an interesting business use case on Big Data - Cassandra migration. We know about the streaming media leader Netflix; Itz about their Big Data migration from Traditional storage.

About Netflix:
Netflix is the world’s leading Internet television network with more than 48 million streaming members in more than 40 countries. It has successfully shifted its business model from DVDs by mail to online media and today leads the streaming media industry with more than $1.5 billion digital revenue. Netflix dominates the all peak-time Internet usage and its shares continue to skyrocket with soaring numbers of subscribers.

Trigger Point:
In 2010, Netflix began moving its data to Amazon Web Services (AWS) to offer subscribers more flexibility across devices with 'Cloud-First/Mobile-First' Strategy. At the time, Netflix was using Oracle as the back-end database and was approaching limits on traffic and capacity with the ballooning workloads managed in the cloud.

Interestingly, the entire migration of more than 80 clusters and 2500+ nodes was completed with only two engineers

Use Case:
Systems that understand each person’s unique habits and preferences and bring to light products and items that a user may be unaware of and not looking for. In a nutshell, personalizes viewing for over 50 Million Customers.

Challenges:
Challenges related to the given use case:

Affordable capacity to store and process immense amounts of data
Data Volume more than 2.1 billion reads and 4.3 billion writes per day
Single point of failure with Oracle’s legacy relational architecture
Achieving business agility for international expansion

Solution
Big Data Cassandra delivers a persistent data-store

100% up-time and cost effective scale across multiple data centers
DataStax expert support the results
It delivers a throughput of more than 10 million transactions per second
Effortless creation/management of new data clusters across various regions
Capture of every detail of customer viewing and log data

The attached Netflix Deployment Diagram is published in Netflix Tech Blog @http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html