Wednesday, December 30, 2015

Annual Report


Here, you go to the end of the year 2015 and ready to welcome 2016.

An annual report is a comprehensive report on the individual activities throughout the preceding year. Herez my annual report of 2015.

Itz been quite interesting year, as I logged into my dream institute IIT (Indian Institute of Technology). It was my aspiration and so enjoying each and every moment in the campus.

In terms of career, proud to lead the huge production roll-out of enterprise data hub product for the leading financial firm, as the result of 15+ months team effort. In continuation, learnt a lot on seeding the automated application development model for the telephony system as Solutions Architect.

Had an opportunity to have momentary Europe (France, Germany, Swiss, Belgium) trip with my family (whoz behind my effort/success).

Had 3 vacation trip with my friends & their family to strengthen the re-union and true friendship.

On the flip side, rain batters Chennai and made the city to shut down for 5 days.  It was pathetic situation around the city. Mother Nature taught us a lot.  At the same time, humanity was exhibited by the fellow citizens around India.  Personally, I was so much impressed by their unconditional love.

Live every minute @ passionate way with love, grace and gratitude.  Do what you Love; Love what you Do.

Best wishes and prayers to you & your family for Happy New Year 2016.

Saturday, December 19, 2015

Apache Flink


Apache Flink is an open source platform for distributed stream and batch data processing.
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Flink includes several APIs for creating applications that use the Flink engine:
  • DataSet API for static data embedded in Java, Scala, and Python,
  • DataStream API for unbounded streams embedded in Java and Scala, and
  • Table API with a SQL-like expression language embedded in Java and Scala.

Flink also bundles libraries for domain-specific use cases:
  • Machine Learning library, and
  • Gelly, a graph processing API and library.

You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.

Flink's data streaming runtime achieves high throughput rates and low latency with little configuration. The charts below show the performance of a distributed item counting task, requiring streaming data shuffles.

Flink programs can be written in Java or Scala and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment. Flink does not provide its own data storage system, input data must be stored in a distributed storage system like HDFS or HBase. For data stream processing, Flink consumes data from (reliable) message queues like Kafka.

Sunday, December 13, 2015

USB 3


In this blog, we are going to discuss about USB 3.0, which is the third major version of the Universal Serial Bus (USB) standard for interfacing computers and electronic devices. Among other improvements, USB 3.0 adds the new transfer mode SuperSpeed (SS) that can transfer data at up to 5 Gbit/s (625 MB/s), which is about ten times faster than the USB 2.0 standard

The USB connector has been one of the greatest success stories in the history of computing, with more than 2 billion USB-connected devices sold to date. But in an age of terabyte hard drives, the once-cool throughput of 480 megabits per second that a USB 2.0 device can realistically provide just doesn't cut it any longer.

USB 3 promises to increase performance by a factor of 10, pushing the theoretical maximum throughput of the connector all the way up to 4.8 gigabits per second, or processing roughly the equivalent of an entire CD-R disc every second. USB 3.0 devices will use a slightly different connector, but USB 3.0 ports are expected to be backward-compatible with current USB plugs, and vice versa.

From USB 3.0 (release on Nov 2008), USB 3.1 is available after its release on July 2013.  USB 4.0 is still under design and expected to release by 2020.

USB, short for Universal Serial Bus, is an industry standard developed in the mid-1990s that defines the cables, connectors and communications protocols used in a bus for connection, communication, and power supply between computers and electronic devices. Itz currently developed by the USB Implementers Forum.

Friday, December 11, 2015

IRNSS


Indian Regional Navigation Satellite System (IRNSS) is an independent regional navigation satellite system being developed by Indian Space Research Organization (ISRO).

In May 2013, India decided to develop its own standards for navigation to usher in a new era in terrestrial, aerial and marine navigation services. During 2014, ISRO delivered the same and launched the IRNSS 1C on board ISRO’s PSLV C26 rocket.

IRNSS was built to offer an alternative to the American GPS or Global Positioning System that is widely, used by consumers on mobile phones to mapping giants and even the military to triangulate location.

It is designed to provide accurate position information service to users in India as well as the region extending up to 1500 km from its boundary, which is its primary service area. An Extended Service Area lies between primary service area and area enclosed by the rectangle from Latitude 30 deg South to 50 deg North, Longitude 30 deg East to 130 deg East.

IRNSS will provide two types of services, namely,

  1. Standard Positioning Service (SPS) which is provided to all the users  
  2. Restricted Service (RS), which is an encrypted service provided only to the authorized users. 


The IRNSS System is expected to provide a position accuracy of better than 20 m in the primary service area. The space segment consists of the IRNSS constellation of seven satellites.

According to Deviprasad Karnik, the director, publication and public relations, ISRO the project is heading towards being operational by the middle of 2016.   The Indian Space Research Organisation has unveiled plans to gradually make its regional satellite navigation system global — akin to powerful position-telling systems such as USA GPS and the Russian GLONASS.

It's time to move away from the American Global Positioning System (GPS) and make way for India's own desi navigation system — IRNSS on the mobile phones.

Thursday, December 3, 2015

Li-Fi Light Fidelity


Li-Fi (Light Fidelity) is a bidirectional, high speed and fully networked wireless communication technology similar to Wi-Fi. Li-fi can deliver internet access 100 times faster than traditional wi-fi, offering speeds of up to 1Gbps (gigabit per second). It requires a light source, such as a standard LED bulb, an internet connection and a photo detector.

With the use of light radiating diodes Li-Fi technology transfers data through wireless. Li-Fi is a new exemplar for photosensitive wireless technology to provide unprecedented connectivity within a localized data centric environment. There has been a complete shift in wireless technology due to increase demand for faster and more secure and protected data transmission.

In this new technology, you will be having a led at one corner which will be working as a light source and on the other corner a Light Sensor or a photo detector. Light Sensor detect light as soon as the LED light starts glowing and will give an output of either binary1 or binary0.

(+) One of the big advantage is Li-Fi (unlike wi-fi) does not interfere with other radio signals, so could be utilised on aircraft and in other places where interference is an issue.

(-) As the limitation, Li-Fi cannot be deployed outdoors in direct sunlight, because that would interfere with its signal.

Li-Fi strategy is to have the future where data for laptops, smart phones, and tablets is transmitted through the light in a room. I'm waiting !!!

Wednesday, November 25, 2015

Apache Phoenix


Apache Phoenix is an efficient SQL skin for Apache HBase that has created a lot of buzz. Many companies are successfully using this technology, including Salesforce.com, where Phoenix first started.

Internally, Phoenix takes your SQL query, compiles it into a series of native HBase API calls, and pushes as much work as possible onto the cluster for parallel execution. It automatically creates a metadata repository that provides typed access to data stored in HBase tables. Phoenix’s direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Regardless of these helpful features, Phoenix is not a drop-in RDBMS replacement. There are some limitations:

  1. Phoenix doesn’t support cross-row transactions yet.
  2. Itz query optimizer and join mechanisms are less sophisticated than most COTS DBMSs.
  3. As secondary indexes are implemented using a separate index table, they can get out of sync with the primary table (although perhaps only for very short periods.) These indexes are therefore not fully-ACID compliant.
  4. Multi-tenancy is constrained—internally, Phoenix uses a single HBase table.


Unlike Impala, however, Phoenix is intended to operate exclusively on HBase data; its design and implementation are heavily customized to leverage HBase features including coprocessors and skip scans.

Top-3 differences between Impala and Phoenix

  1. The main goal of Phoenix is to provide a high-performance relational database layer over HBase for low-latency applications. Impala’s primary focus is to enable interactive exploration of large data sets by providing high-performance, low-latency SQL queries on data stored in popular Hadoop file formats. Hive is mainly concerned with providing data warehouse infrastructure, especially for long-running batch-oriented tasks.
  2. Phoenix is a good choice, for example, in CRUD applications where you need the scalability of HBase along with the facility of SQL access. In contrast, Impala is a better option for strictly analytic workloads and Hive is well suited for batch-oriented tasks like ETL.
  3. Phoenix is comparatively lightweight since it doesn’t need an additional server.


Next Cloudera release CDH 5.5 will be packaged with Apache Phoenix to leverage. I'm waiting !!!

Wednesday, November 11, 2015

Amazon Physical Store


In the recent Wall Street Journal (WSF), Amazon.com Inc. plans to open a store in the middle of New York City.

Itz the first brick-and-mortar outlet in its 20-year history and an experiment to provide the type of face-to-face experience found at traditional retailers.  Physical store is contradictory strategy of online Amazon store.

Operating stores also carries risks. Until now, Amazon largely has avoided some costs associated with retailing, including leases, paying employees and managing inventory in hundreds of stores. Those expenses could imperil the company’s already thin profit margins.

If it is successful, however, the New York location could presage a roll-out to other U.S. cities, according to the people familiar with the company’s thinking.

Amazon took some inspiration from a trial by the U.K.’s Home Retail Group PLC, allowing customers to order eBay goods online and pick them up in its Argos stores. By year’s end Argos expects to provide the service at 650 stores from 65,000 eBay sellers.

Other primarily online retailers have opened physical storefronts, including clothier Bonobos Inc., eyeglasses purveyor Warby Parker, and subscription beauty-products service Birchbox.

One philosophy is striking in my mind.  Life is a circle, which will repeat/rotate the trends.

Thursday, November 5, 2015

Virtusa acquires Polaris


Today, Virtusa announces definitive agreement to acquire a majority interest in Polaris Consulting & Services, Ltd.

Virtusa and Polaris collectively will have approximately 18,000 employees, generating $826 million of pro forma revenue for the twelve months ended September 30, 2015.

Upon the closing of the Polaris transaction, the combination of Virtusa and Polaris would create a leading global provider of IT services and solutions to the banking and financial services industry segment. The acquisition would combine Virtusa’s deep domain expertise in consumer and retail banking with Polaris’ proven strength in corporate and investment banking. This combination would provide an end-to-end portfolio of differentiated solutions to the global banking and financial services industry segment, improving the combined entity’s competitive position, and expanding its addressable market.

Virtusa expects to realize over $100 million of cumulative revenue synergies over the next three fiscal years from the business combination.

Thursday, October 29, 2015

Cloud Based Load Testing using TFS


What is Cloud Based Load Testing?
It goes without saying that performance testing your application not only gives you the confidence that the application will work under heavy levels of stress but also gives you the ability to test how scalable the architecture of your application is. It is important to know how much is too much for your application!

Working with various clients in the industry, it has been realized that the biggest barriers in Load Testing & Performance Testing adoption are,

  1. High infrastructure and administration cost that comes with this phase of testing
  2. Time taken to procure & set up the test infrastructure
  3. Finding use for this infrastructure investment after completion of testing


Endurance Testing
To test for endurance, Azure diagnostics is used to begin with, but later started using Cerebrata Azure Diagnostics Manager to capture the metrics of the machine under test. Currently Microsoft Load Test service does not support metrics from the machine under test.


 Herez an use case of quote generation engine with an expected fixed user load and ran the test for very long duration such as over 48 hours and observed the affect of the long running test on the Azure infrastructure.

Threshold Testing
Another use case to step load the quote generation engine by incrementing user load with different variations of incremental user load per minute till the application crashed out and forced an IIS reset.



It depicts the threshold testing as above.

There a few limits on the usage of Microsoft Cloud based Load Test service that you can read at http://blogs.msdn.com/b/visualstudioalm/archive/2013/06/26/load-testing-with-team-foundation-service-launching-preview-and-early-adoption-program.aspx

Monday, October 19, 2015

Dell EMC


Computer-maker Dell Inc struck a deal on Monday to buy data storage company EMC Corp for $67 billion in cash & stock with 19% premium. The acquisition, the year's third-largest in all sectors, top in technology sector.

The deal should help privately held Dell, the world's No. 3 computer maker, diversify from a stagnant consumer PC market and give it greater scale in the more profitable and faster-growing market for cloud-based data services.

The deal will be financed through a combination of new equity from Dell's owners - founder and Chief Executive Michael Dell, its investment firm MSD Partners, private equity firm Silver Lake and Singapore state-owned investor Temasek Holdings - as well as the issuance of the tracking stock, new debt and cash on hand.

Michael Dell, with the help of Silver Lake, took the PC maker private in $25 billion deal two years ago.

Dell first approached EMC in October 2014 following speculation over a deal between Hewlett-Packard and EMC collapsing and Elliott attacking the company, the source said.

Michael Dell then met EMC Chief Executive Joe Tucci at the World Economic Forum Annual Meeting 2015 in Davos in January, the source added. Negotiations between Dell and EMC intensified in the last two months.

The transaction is expected to close between May and October 2016, the companies said.  The combined company aims to be a leading enterprise player in the following markets:
  • Servers and Storage fro the Enterprise Market
  • Virtualization via VMware;
  • Converged infrastructure (EMC owns VCE);
  • Hybrid cloud and cloud computing
  • And security via RSA, which is owned by EMC.

Tuesday, October 6, 2015

Airbnb Airpal


We know that Airbnb is a popular website for people to list, find, and rent lodging. It has over 1,500,000 listings in 34,000 cities and 190 countries.

Recently, Airbnb  launched Big Data tool Airpal, a web-based query execution tool that leverages Facebook’s PrestoDB to facilitate data analysis by authoring queries and retrieving results simple for users.

Key features of Airpal:

  • optional access controls for users
  • ability to search and find tables
  • see metadata, partitions, schemas, and sample rows
  • write queries in an easy-to-read editor
  • submit queries through a web interface
  • track query progress
  • get the results back through the browser as a CSV
  • create new Hive table based on the results of a query
  • save queries once written
  • searchable history of all queries run within the tool

Requirements are:

  • Java 7 or higher
  • MySQL database
  • Presto 0.77 or higher
  • S3 bucket (to store CSVs)
  • Gradle 2.2 or higher


On keeping with the spirit of Presto, they have tried to make it simple to install Airpal by providing a local storage option for people who would like to test it out without any overhead or cost.

For more detailed information, visit the GitHub page here: https://github.com/airbnb/airpal

Wednesday, September 30, 2015

Coursera Architecture


Coursera is a venture backed for-profit educational technology company that offers massive open online courses (MOOCs).

It works with top universities and organizations to make some of their courses available online, and offers courses in physics, engineering, humanities, medicine, biology, social sciences, mathematics, business, computer science, digital marketing, data science and other subjects.  Itz an online educational startup with over 14 million learners across the globe to offer more than 1000 courses from over 120 top universities.

At Coursera, Amazon Redshift is used as primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. AWS Data Pipeline is used to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for ETL processing.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management.

At Coursera, 150+ pipelines were executed to pull the data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. 300+ tables are loaded every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The attached image below illustrates the data flow at Coursera.

Monday, September 28, 2015

ScyllaDB


ScyllaDB is the world's fastest NoSQL column store database, which is written in C++. Itz fully compatible with Apache Cassandra at 10x throughput and jaw dropping low latency.

Scylla will work with existing Cassandra command line CQL clients. However, mixed clusters of Scylla and Cassandra nodes are not supported. A Scylla node cannot join a Cassandra cluster, and a Cassandra node cannot join a Scylla cluster.

To share the benchmark between Scylla and Cassandra, both throughput on a single multi core server is evaluated with the Hardward specification:
2x Xeon E5-2695v3: 2.3GHz base, 35M cache,
14 core -> 28 core with HT
64GB RAM
2x 400GB Intel NVMe P3700 SSD
Intel Ethernet CNA XL710-QDA1

In terms of software Scylla 0.8 & Cassandra 3.0, is enabled as TestBed.

In the attached image, average throughput for the test is presented as lines, latency as bars.

Scylla’s measured latency of less than 1 ms for the 99th percentile is significantly lower than Cassandra’s, while providing significantly higher throughput (the single client machine could not fully load the server).  The lack of garbage collection means that there are no externally imposed latency sources, so Scylla latency can be brought even lower.


Scylla’s order of magnitude improvement in performance opens a wide range of possibilities.  Instead of designing a complex data model to achieve adequate performance, use a straightforward data model, eliminate complexity, and finish your NoSQL project in less time with fewer bugs.

Thursday, September 10, 2015

MapReduce Authors


We know that MapReduce is the ice breaker for the traditional computing model by introducing Scale Out technology in the easy way. Google has a separate research web page on Google's MapReduce at http://research.google.com/archive/mapreduce.html . Authors are Sanjay Ghemawat & Jeff Dean from Google Inc.

As the research scholar, I liked the motivation of their research paper - Large Scale Data Processing. It was achievable with super computing on earlier days. But key difference is parallel execution of hundreds or thousands of CPUs, with commodity box and easy mode.  More over, MapReduce provides:
  1. Automatic parallelization and distribution
  2. Fault-tolerance
  3. I/O scheduling
  4. Status and monitoring

Fault-tolerance is handled via re-execution.  On worker failure:
  • Detect failure via periodic heartbeats
  • Re-execute completed and in-progress map tasks
  • Re-execute in progress reduce tasks
  • Task completion committed through master

Data Locality Optimization, Skipping Bad Records and Compression of intermediate data are their few refinement technique to boost the performance on large scale data.

In their research paper, the use case was listed in August 2004 with the below metric:
  • Number of jobs 29,423
  • Average job completion time 634 secs
  • Machine days used 79,186 days
  • Input data read 3,288 TB
  • Intermediate data produced 758 TB
  • Output data written 193 TB
  • Average worker machines per job 157
  • Average worker deaths per job 1.2
  • Average map tasks per job 3,351
  • Average reduce tasks per job 55
  • Unique map implementations 395
  • Unique reduce implementations 269
  • Unique map/reduce combinations 

Amazing and game changing methodology with easiness, as the result of great minds research from Google.  Herez an opportunity for me to highlight the authors of MapReduce - Sanjay Ghemawat & Jeff Dean

Sunday, August 16, 2015

MapReduce Execution



On submission of the job by the User, Hadoop initiates the Job Tracker process at Master Node.  Internally, the execution undergoes 3 major tasks/steps as below:

1. Map 
A map task is created that runs the user-supplied map function on each record. The map function takes a key-value pair as input and produces zero or more intermediate key-value pairs. Map tasks are executed in parallel by various machines across the cluster.  

Mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up  after the Hadoop Job completes.

2. Shuffle and Sort
It is preformed by the reducers (reduce tasks). Each reducer is assigned one of the partitions on which it should work. This is a flurry of network copies between each reducer in the cluster so it can get the partition (intermediate key-value) data it was assigned to work on.

The output of the reduce is normally stored in HDFS for reliability. This step uses a lot of bandwidth between servers and benefits from very fast networking like 10G.

3. Reduce
After the partition data has been copied we can start performing a merge sort of the data. A merge sort takes a number of sorted items and merges them together to form a fully sorted list. 
   
Each reducer produces a separate output file, usually in HDFS. Each reducer output file usually named part-, where  is the number of the reduce task within the job. The output format of the file is specified by the author of the MapReduce job. The number of reducers is defined by the developer.

Attached image depicts the execution flow of Hadoop job.

Sunday, August 9, 2015

MapReduce


Computers are always driven by TWO key factors

  1. Storage
  2. Process


So far, we covered SIX tips on Storage - HDFS model during last 2 months.  Let us start exploring the process layer - Map Reduce.

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduce and scatter operations.

As depicted in the diagram, MapReduce program is composed of a Map procedure that performs filtering and sorting and a Reduce procedure that performs a summary operation. MapReduce Framework orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been generalized.

HDFS ecosystem has TWO different versions on MapReduce, namely V1 and V2.
  • V1 is the orginal MapReduce that uses TaskTracker and JobTrackerdaemons. 
  • V2 is called YARN. YARN to splits up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. Resource manager, application master, and node manager.


We'll see more details on Map Reduce V1 & V2 in the next blogs.

Saturday, August 8, 2015

Back to School


Go after your Dream, No matter how unattainable other think it is!  Interesting quote, isn't it?

I always dream about the top-notch premium institute. God blessed me to admit into Indian Institute of Technology as Research Scholor, since August 2015.

At IIT Madras, research is a preoccupation of around 550 faculty members, 3000+ MS and PhD research scholars, more than 800 project staff, and a good number of undergraduates as well.  IIT is always world class institute and the alumni speaks the industry value across the globe.

My blessed journey is starting with the emerging technology course on Searching and Indexing in Large Datasets.  Wishing your prayer and support to execute this life filling opportunity.

Sunday, August 2, 2015

Hadoop Command Line


All Hadoop commands are invoked by the bin/hadoop script. Running the hadoop script without any arguments prints the description for all commands.

hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs

Hadoop comes with a number of command-line tools that enable basic filesystem operations. HDFS commands are subcommands of the hadoop command-line utility. To display basic usage information, the command is:

hadoop fs  

Hadoop uses fs.default.name value in core-site.xml file if full url syntax is not used.  Command to list files in a dir.

hadoop fs -ls /user/myid
or
hadoop fs -ls hdfs://NameNode.blah.com:8020/home/myid

Command to upload file with -put or -copyFromLocal which copies file form local filesystem

hadoop fs -put /etc/mytest.conf /user/myid/

To download file from HDFS using -get or -copyToLocal.

hadoop fs -get /user/myid/mytest.conf ./

Process to set a replication factor for a file or dir of files with the -R

hadoop fs -setrep 5 -R /user/myid/rep5/

fsck is command to run HDFS filesystem checking utility. Run a fsck on the files we set the rep factor on and see if it looks correct

hadoop fsck /user/myid/rep5 -files -blocks -locations

These are basic frequently used HDFS commands using command line operations.

Tuesday, July 28, 2015

Data Node


DataNode(DN) is third/last domain in HDFS architecture, which is responsible for storing and retrieving the actual data. In DataNode, data is simply written to multiple machines in the highly distributed mode.

A very large number of independently spinning disks performing huge sequential I/O operations with independent I/O queues can actually outperform RAID in the specific use case of Hadoop workloads. Typically, DataNodes have a large number of independent disks, each of which stores full blocks.

In general, Blocks are nothing more than chunks of a file, binary blobs of data. The DataNode has direct local access to one or more disks—commonly called data disks in a server on which itz permitted to store block data.

Point DataNode to new disks in existing servers or adding new servers with more disks increases the amount of storage in the cluster. Block data is streamed to and from DataNode directly, so bandwidth is not limited by a single node.  These blocks are then replicated across the different nodes (DataNodes) in the cluster. The default replication value is 3, i.e. there will be 3 copies of the same block in the cluster

DataNodes regularly report their status to the NameNode in a heartbeat. DataNode initially starts up, as well as every hour thereafter, a block report to the NameNode. A block report is simply a list of all blocks the DataNode currently has on its disks. The NameNode keeps track of all the changes.

In the nutshell, DataNode is the actual storage platform, which manages the file blocks within the node.

Wednesday, July 22, 2015

Secondary NameNode


In the series of Big Data Tips, we are in 5th post of the series.  Today, we are going to talk about Secondary NameNode.

Secondary NameNode IS NOT A BACKUP FOR THE NameNode! The name is horrible, but it is what it is.

The secondary Namenode job is to periodically merge the image file with edit log file. This merge operation is to avoid the edit log file from growing into a large file. The secondary namenode runs on a separate machine and copies the image and edit log files periodically. The secondary namenode requires as much memory as the primary namenode. If the namenode crashes, then you can use the copied image and edit log files from secondary namenode and bring the primary namenode up. However, the state of secondary namenode lags from the primary namenode.

NameNode may not have the available resources (CPU or RAM) to do this while continuing to provide service to the cluster so the secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the primary. This is known as the checkpointing process.  The checkpoint file is never changed by the NameNode only the Secondary NameNode.

This application of the updates (checkpointing) to the fsimage file occurs every 60 mins by default or whenever the NameNodes edits file reaches 64 MB. Which ever happens first. Newer versions of Hadoop use a defined number of transactions rather than file size to determine when to perform a checkpoint.

If the secondary NameNode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the NameNode needs to combine the edit log and the current filesystem checkpoint image.

Sunday, July 19, 2015

Name Node


NameNode acts as the data manager between the requested client and content holder - DataNodes.

On start up, DataNodes send the block report to NameNode on every hour, along with the heartbeats on every 3 seconds.  NameNode keeps track of every data change.  At any given time, the NameNode has a complete view of all DataNodes in the cluster, their current health, and what blocks they have available.  File to block mapping on the NameNode is stored on disk.

NameNode does not directly send requests to DataNodes. It uses replies to heartbeats to send instructions to the DataNodes.  The instructions include commands to replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, and shut down the node.  NameNode stores its filesystem metadata on local filesystem disks. 2 Key files are (1)FsiImage (2)EditsLog

FsiImage contains the complete snapshot of filesystem at iNode level.  iNode is an internal representation of a file or directory's metadata and contains such information as the file's replication level, modification and access times, access permissions, block size, and the blocks a file is made up of.  This design makes not to worry about the changing DataNodes' hostname or IP address.

Edits file (journal) contains only incremental modifications made to the metadata. It uses a write ahead log which reduces I/O operations to sequential, append-only operations, which avoids costly seek operations and yields better overall performance.

On NameNode startup, the fsimage file is loaded into RAM and any changes in the edits file are replayed, bringing the in-memory view of the filesystem up to date.  NameNode filesystem metadata is served entirely from RAM. This makes it fast, but limits the amount of metadata a box can handle. Roughly 1 million blocks occupies roughly 1 GB of heap.

Thus, NameNode performs the filesystem operations in the highly distributed methodology for the Client request(s).

Sunday, July 12, 2015

HDFS Design


In principle, HDFS has a block size higher than most other file systems. The default is 128M and some go as high as 1G. Files in HDFS are write once.

There are three daemons that make up a standard HDFS cluster.
  1. NameNode - 1 per cluster. Meta data's centralized server to provide a global picture of the filesystem's state.
  2. Secondary NameNode - 1 per cluster.  Performs internal NameNode transaction log check pointing.
  3. DataNode - Many per cluster.  Stores block data (contents of files).

NameNode stores its filesystem metadata on local filesystem disks in a few different files, but the two most important of which are fsimage and edits.  Fsimage contains a complete snapshot of the filesystem metadata including a serialized form of all the directory and file inodes in the filesystem.  Edits file (journal) contains only incremental modifications made to the metadata, which acts as write ahead log.

Secondary NameNode is not only backup of NameNode but also shares the workload via checkpointing process.  In which secondary NameNode applies the updates from the edits file to the fsimage file and sends it back to the primary.  Checkpointing is controlled by duration (default 60 mins) and/or file size and/or transaction count of Edits file.

Daemon responsible for storing and retrieving block (chunks of a file) data is called the DataNode .  Datanodes regularly report their status to the NameNode in a heartbeat mode; default 3 mins.  It sends Block Report (list of all usable blocks of DataNode disks) to NameNode; default 60 mins.

3 key daemons of HDFS Architecture, is represented in the attached diagram.

Saturday, July 11, 2015

HDFS Goals


In last tip, we saw Google’s Whitepapers on Big Data.  Apache Hadoop has been originated from Google’s Whitepapers:

  1. Apache HDFS is derived from GFS  (Google File System).
  2. Apache MapReduce is derived from Google MapReduce
  3. Apache HBase is derived from Google BigTable.


This Tip#2, is on HDFS goals & roles in Big Data.  What is HDFS?

HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns, running clusters on COMMODITY hardware. HDFS is not a POSIX-compliant filesystem.

In HDFS, each machine in a cluster stores a subset of the data (blocks) that makes up the complete filesystem. Itz metadata is stored on a centralized server, acting as a directory of block data and providing a global picture of the filesystem's state.

Top 5 Goals of HDFS

  1. Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.
  2. Use a scale-out model based on inexpensive commodity servers with internal JBOD  ("Just a bunch of disks") rather than RAID to achieve large-scale storage. 
  3. Accomplish availability and high throughput through application-level replication of data.
  4. Optimize for large, streaming reads and writes rather than low-latency access to many small files. 
  5. Support the functionality and scale requirements of MapReduce processing.

Thursday, July 9, 2015

ThankYou Card


A lot of people don’t realize this, but the Pulse app was built as a class project at Stanford University in 2010. Pulse today powers a lot of content you see on LinkedIn’s homepage feed. They reached an incredible milestone for the LinkedIn publishing platform: 1 million professionals have now written a post on LinkedIn.

Over 1 million unique writers publish more than 130,000 posts a week on LinkedIn. About 45% of readers are in the upper ranks of their industries: managers, VPs, CEOs, etc. The top content-demanding industries are tech, financial services and higher education. The average post now reaches professionals in 21 industries and 9 countries.

In this big metric, I'm also taking tiny contribution @ Pulse.  In conjunction with 1 million post celebration, I just received 'Thank You' card from LinkedIn.

If you’re writing on LinkedIn or interested in starting, join the Writing on LinkedIn Group. Not sure “What’s stopping you?”.  Continuous Learning & Continuous Sharing is mantra for our industry.

Sunday, July 5, 2015

Google WhitePaper


Dear readers, I recently got few requests to share few 'Useful Tips' on Big Data ecosystem.

Tip is a small piece or part fitted to the end of an object.  Let me fill Big Data Tips during this quarter - Q3 2015.

For those interested in the history, the super base class of Big Data ecosystem is Google's whitepaper.

The first, presented in 2003, describes a pragmatic, scalable, distributed file system optimized for storing enormous datasets, called "Google File system", or GFS by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In addition to simple storage, GFS was built to support large-scale, data-intensive, distributed processing applications.

The following year-2004, another paper, titled "Map-Reduce: Simplified Data Processing on Large Clusters" was presented by Jeffrey Dean and Sanjay Ghemawat, defining a programming model and accompanying framework that provided automatic parallelization, fault tolerance, and the scale to process hundreds of terabytes of data in a single job over thousands of machines.

When paired, these two systems could be used to build large data processing clusters on relatively inexpensive, commodity machines. Google White papers are industry break through and directly inspired the development of HDFS and Hadoop MapReduce, respectively.

Google's WhitePaper References are available at:
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Stay tuned for continuous tip & trick to ()l)earn more.

Wednesday, July 1, 2015

Netflix Migration


Last weekend, I read an interesting business use case on Big Data - Cassandra migration. We know about the streaming media leader Netflix; Itz about their Big Data migration from Traditional storage.
About Netflix:
Netflix is the world’s leading Internet television network with more than 48 million streaming members in more than 40 countries. It has successfully shifted its business model from DVDs by mail to online media and today leads the streaming media industry with more than $1.5 billion digital revenue. Netflix dominates the all peak-time Internet usage and its shares continue to skyrocket with soaring numbers of subscribers.
Trigger Point:
In 2010, Netflix began moving its data to Amazon Web Services (AWS) to offer subscribers more flexibility across devices with 'Cloud-First/Mobile-First' Strategy. At the time, Netflix was using Oracle as the back-end database and was approaching limits on traffic and capacity with the ballooning workloads managed in the cloud.

Interestingly, the entire migration of more than 80 clusters and 2500+ nodes was completed with only two engineers
Use Case: 
Systems that understand each person’s unique habits and preferences and bring to light products and items that a user may be unaware of and not looking for. In a nutshell, personalizes viewing for over 50 Million Customers.
Challenges:
Challenges related to the given use case:
  • Affordable capacity to store and process immense amounts of data 
  • Data Volume more than 2.1 billion reads and 4.3 billion writes per day
  • Single point of failure with Oracle’s legacy relational architecture
  • Achieving business agility for international expansion
Solution
Big Data Cassandra delivers a persistent data-store 
  • 100% up-time and cost effective scale across multiple data centers
  • DataStax expert support the results
  • It delivers a throughput of more than 10 million transactions per second
  • Effortless creation/management of new data clusters across various regions
  • Capture of every detail of customer viewing and log data
The attached Netflix Deployment Diagram is published in Netflix Tech Blog @http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html

Tuesday, June 23, 2015

AWS Enterprise Summit 2015


AWS (Amazon Web Service) Enterprise Summit are designed to educate new customers about the AWS platform; offers existing customers deep technical content to be more successful with AWS.
Today(23 Jun), I had the chance to attend AWS Enterprise Summit at Chennai, India. In a nutshell, it covers keynote address, panel discussion, customers use case on technical track, etc. Herez the highlights of Today's sessions:
Development Focus
"Moore's law" is the observation that, over the history of computing hardware, the number of transistors in a dense integrated circuit has doubled approximately every two years.
Itz true and reflected in IT industry. In 1971, Intel's first processor 4004 contained 2,300 embedded transistors to execute. Now, in 2015, Intel's 18-core Xeon Haswell-EP has over 5.5 billion transistors. Amazing growth, right!!!
As hardware is rapidly expanding its band, software's development focus is migrating in the below order:
  • Mainframe - 1970s
  • Personal Computer (PC) - 1980s
  • Data Centre (RDBMS) - 1990s
  • High Performance Computing (HPC) - 2000s
  • Cloud; Horizontal Scaling - 2010s
Amazon Web Service (AWS) platform is the key player in 2010s era.

Business Model


Traditional IT
Emerging IT
Business Model

Large Capital Expenditure (CapEx)
Low variable on demand CapEx
Cost Reduction

CapEx focused
Operation Expenditure (OpEx) focused
Cost Model

Basic Computing
Broad & Deeper Platform
Scalability

Responsible for periodic upgrade
New features arrive daily
Lead to Innovation

Slow to roll new feature
Ready to use rapid feature
Agile

Traditional DR
Automatic DR
Improved Availability

Costing Strategy

Cost drop is achievable by TWO key factors in the business theory.
1. Large customer base
2. Better economy of scale
In alignment with this costing strategy, Amazon had the multiple historical price reductions i.e. 48 price slashing since 2006.

Pricing Philosophy

Like mobile plan/cost, it starts from base to advanced package based on the user's demand. It is upto the customer to select their choice. AWS has the multiple pricing plan as below:

Purchase Model
Description
Usage

Free Tier
With free usage and no commitment
For PoCs and getting started

On Demand
Pay by the hour with no long term commitment
For spiky / seasonal workloads

Reserved
Low one-time payment with significant discount
For committed utilization

Spot
Bid for unused capacity, fluctuates based on demand/supply
Time intensive or transient workloads

Dedicated
Launch instance run on hardware dedicated to a single customer
Highly intensive or compliance loads

Data Growth Trend

It is interesting to observe the data growth @ our industry
  • 7.9 Zetta Bytes of Data Persistence
  • 90% of data growth just in last 2 years
  • 5+ billion devices usage
  • 966 Exa Bytes transfer rate

Big Data Platform

AWS Platform has the end-end solution for Big Data use cases with their own cloud based tools:
  1. Kinesis - allows for large data stream processing and real-time analytics
  2. Elastic MapReduce (EMR) - API styled web service that uses Hadoop ecosystem
  3. Relational Data Store (RDS) - Scalable relational database in the cloud
  4. Simple Storage Service (S3) - Opt for virtually unlimited cloud & internet storage 
  5. RedShift - Fast, fully managed, petabyte-scale data warehouse in the cloud
  6. Dynamo DB - Fully managed NoSQL database service that provides fast and predictable performance with seamless scalability

Closing Note

Statistics indicates that 6% Enterprise into Big Data and 9% Enterprise are into Cloud platform to adapt this data growth.
Thus, AWS (Amazon Web Service) is matured enough in each and every space of Big Data and Cloud platform.  In fact, their own line of business (amazon.com) in world's leading online store, is helped to set the validity of their solution/products, before serving to the industry.  Awesome work, AWS.