Saturday, December 14, 2013

BigData Perception

An interesting words on Big Data perception in the current industry!

Sunday, December 8, 2013

No Shared Resource

As I was pretty busy in one of the big data project, I'm kind of busy in Production move and related activities.

One of the key lesson learnt is not to use network access storage like NAS/SAN.  It is highly recommended to use the local storage with the key factor 'data locality'. You can understand this concept by understanding how MapReduce and HDFS works in Big Data world.

As you know, HDFS is a distributed file system. Think of it as RAID over the Internet. What if you had a 10GB file and 10 servers, and each disk could burst 1GBps? Assume your Ethernet is also 10GB. Well, if you read the file from one server at one time and got back 1GBps, it would take 10 seconds. What if you could read all 10GB at once? In essence, that is what HDFS allows: a burst from your cluster that's bigger than the burst you could get from any individual node.

As another point, MapReduce is built by Google with the divide and conquer concept. The given problem is broken into small independent chunk and sent to each node, as part of MapReduce algorithm.  On completion of the calculated answers in parallel, the result is sent back and combined as reduced concept.  If your application add in network hops and latency along with multiple nodes contending for the same resources, then you loose the power of high performance by choosing hadoop in shared platform.

As the last point, the application doesn't have the priority control on the shared resource like NAS/SAN, by default.  So, the app loose its performance benchmark even though it is running under MapReduce concept.

The conclusion is not to use the shared resource, on building the high performance, massively parallel processing MapReduce methodology.

Saturday, November 9, 2013

Facebook Presto

Potentially raising the bar on SQL scalability, Facebook has released open source SQL query engine called Presto that was built to work with petabyte-sized data warehouses.

Currently, more than 1,000 Facebook employees use Presto daily to run 30,000 interactive queries, involving over a petabyte of processing. The company has scaled the software to run on a 1000-node cluster

Unlike Hive, Presto does not use MapReduce, which involves writing results back to disk. Instead, Presto compiles parts of the query on the fly and does all of its processing in memory. As a result, Facebook claims Presto is 10 times better in terms of CPU efficiency and latency than the Hive and MapReduce combo.
Now, Facebook wants other data-driven organizations to use, and it hopes, refine Presto. The company has posted the software’s source code at The software is already being tested by a number of other large Internet services, namely AirBnB and Dropbox.

Saturday, November 2, 2013

Big Data Expo

Big Data Expo is the single most effective event for you to learn how to use you own enterprise data – processed in the cloud – most effectively to drive value for your business.

A recent Gartner report predicts that the volume of enterprise data overall will increase by a phenomenal 650% over the next five years.

These two unstoppable enterprise IT trends, Cloud Computing and Big Data, will converge in Silicon Valley at Big Data Expo 2013 Silicon Valley – being held November 4 - 7, 2013, at the Santa Clara Convention Center in Santa Clara, CA.

It offers in Silicon Valley a vast selection of technical and strategic breakout sessions, General Sessions, Industry Keynotes, our signature discussion "Power Panels" and a bustling Expo floor complete with two busy Demo Theaters so that as a delegate you can kick the tires of solutions and offerings, and discuss one-on-one with all the leading Cloud and Big Data players what they are offering and how to make use of it in your particular situation.

Just as Big Data and Cloud solutions will be side by side on the Expo floor, so they are in the conference program. We're including below a sampler of the breakouts you can look forward to in the Big Data part of the overall technical program...and if you look at the welter of company logos at the very bottom you'll get a foretaste of the hundreds of companies in attendance on the Expo floor.

The surest way to help yourself and your company obtain maximum value from your enterprise data and to understand why Big Data is going mainstream is by attending Big Data Expo 2013 Silicon Valley

Saturday, October 19, 2013

BigData Elite

The fund, called Big Data Elite, is backed by celebrity investors like Ron Conway, one of Silicon Valley’s most prominent angel investors, and Andreessen Horowitz, the $2.5 billion venture capitalist firm, as well as Social+Capital Partnership and Anand Rajaraman, whose social media start-up was bought by Wal-Mart Stores in 2011.

In an announcement on Thursday, Big Data Elite described itself as a venture lab and early stage fund that will offer a six-month program beginning January 2014. The fund will choose 10 start-ups or individuals from a list of 20. Those chosen will work a Big Data Elite’s offices in San Francisco and will have access to advisers who work at Facebook, Zynga, Netflix, LinkedIn, Riot Games and a handful of other companies that rely heavily on data analysis.

Companies like Netflix, Google and Amazon rely heavily on the data its customers provide through their Internet searches, and over recent years they have poured large sums of money into data teams of computer scientists who sift through search and transaction data to find patterns.

“Big Data is behind every sector,” Mr. Venios said. “By 2016, the budget spend for Big Data will be larger in corporate business development than information technology,” Mr. Venios added, citing data from McKinsey & Company.

The fund will invest $150,000 to $1 million in each participant, in exchange for equity — typically 6 percent initially — and convertible notes.

Sunday, October 13, 2013


QueryIO is a Hadoop-based SQL and Big Data Analytics solution, used to store, structure, analyze and visualize vast amounts of structured and unstructured Big Data. It is especially well suited to enable users to process unstructured Big Data, give it a structure and support querying and analysis of this Big Data using standard SQL syntax.

QueryIO enables you to leverage the vast and mature infrastructure built around SQL and relational databases and utilize it for your Big Data Analytics needs.

QueryIO builds on Apache Hadoop's scalability and reliability and enhances basic Hadoop by adding data integration services, cluster management and monitoring services as well as big data querying and analysis services. It makes it easy to query and analyze big data across hundreds of commodity Compute+Store cluster nodes and petabytes of data in an easy and logical manner.

Saturday, October 5, 2013

Highly funded startup

Riding the Big Data tidal wave, MongoDB has raised $150 million in financing to accelerate its global expansion. The round was led by an unnamed financial services company and attracted industry heavy hitters such as Salesforce and Intel.

Now the most valuable startup in New York, MongoDB offers a highly scalable document-oriented database for analytical and fast-changing operational workloads. The platform has been downloaded over 5 million times, and deployed by SAP, MTV Networks, Foursquare and other big name companies.

As part of its aggressive growth strategy, MongoDB will expand its partner ecosystem and target new markets. The company also plans to double its headcount to 600 over the next 12 months.

Originally known as 10gen, the SQL vendor rebranded in August to “get back into alignment” with its flagship product. Shortly afterwards, rival MemSQL updated its distributed in-memory database to support JSON, a popular syntax for processing semi-structured data.  It also improves index scan performance and expands support for SQL.

Sunday, September 15, 2013

Hadoop on Windows

Hortonworks, a leading contributor to and provider of enterprise Apache Hadoop, today announced the general availability of Hortonworks Data Platform 1.3 (HDP) for Windows, a 100-percent open source data platform powered by Apache Hadoop.  HDP 1.3 for Windows is the only Apache Hadoop-based distribution certified to run on Windows Server 2008 R2 and Windows Server 2012, enabling Microsoft customers to build and deploy Hadoop-based analytic applications. This release is further demonstration of the deep engineering collaboration between Microsoft and Hortonworks.

New functionality in HDP 1.3 for Windows includes HBase, Flume 1.3.1, ZooKeeper 3.4.5 and Mahout 0.7.0. These new capabilities enable customers to exploit net new types of data to build new business applications as part of their modern data architecture.

Hortonworks Data Platform 1.3 for Windows is now available for download at:

Friday, September 6, 2013

Intel Big Data

Intel, increasingly customizing server chips for customers, is now tuning chips for workloads in big data.

Software is becoming an important building block in chip design, and customization will help applications gather, manage and analyze data a lot quicker, said Ron Kasabian, general manager of big data solutions at Intel.

The plan includes developing accelerators or cores for big-data type workloads. For example, Intel is working with Chinese company Bocom to implement the Smart City project, which tries to solve counterfeit license plate problems in China by recognizing plates, car makes and models. The project involves sending images through server gateways, and Intel is looking to fill software gaps by enhancing the silicon.

A lot of research is also taking place at Intel labs on stream processing and graph analytics as the company designs chips and tweaks software.

Sunday, September 1, 2013

BigData Ayasdi

Ayasdi’s Insight Discovery platform highlights include automatic discoveries from complex data and operationalization of end-to-end analytic workflow tied to complex and expensive business problems. It computes across hundreds to millions of attributes to automatically find similarity amongst data points and surface hidden patterns and anomalies in the data. Users can also deploy an API to automate integration across in-house and third party applications to fully operationalize the end-to-end analytic workflow.

 Topology isn't new, but using it to analyze and understand big data is a novel idea -- one that could empower business users to find value in very large data sets without having to consult data scientists or write algorithms or models.

That's the promise of Ayasdi, a Palo Alto-based startup that uses topological data analysis to quickly glean meaning from big data.The company says its approach to analyzing big data is unique. Based on the work of Stanford University mathematics professor and Ayasdi cofounder Gunnar Carlsson, Ayasdi's enterprise-focused tools apply the abstruse concepts of topology to quickly identify relevant patterns in data.

Sunday, August 25, 2013


HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

Wednesday, August 14, 2013

IBM Big Data training

IBM hopes to help create the next generation of “big-data” specialists through a series of partnerships with universities around the world, as well as influence the curriculum.
Nine new agreements announced Wednesday involve Georgetown University, George Washington University, Rensselaer Polytechnic Institute, the University of Missouri, and Northwestern University in the U.S. IBM is also beginning big-data programs at Dublin City University, Mother Teresa Women’s University in India, the National University of Singapore, and the Philippines’ Commission on Higher Education.
They will result in a variety of programs, including a master of science degree in the business analytics track at George Washington University; an undergraduate course titled “Big Data Analytics” at the University of Missouri; and a center for business analytics at the National University of Singapore.
In its announcement, IBM cited U.S. Bureau of Labor statistics that found there will be a 24 percent rise in demand for people with “data analytics skills” over the next eight years.
While companies are managing to fill big data positions, there’s a caveat. “They are finding the candidates but a lot of what they’re doing is poaching candidates from other companies,” spokesman said. “One of the reasons I would expect IBM is making these partnerships to make sure there’s enough engineers to meet the demand they’re seeing.”

Thursday, August 8, 2013

Ideal data scientist

FICO, a leading predictive analytics and decision management software company, today released an infographic showing the characteristics of a good data scientist — what a Harvard Business Review article called the “sexiest job of the 21st century.”

The rise of Big Data has fueled demand for data scientists. reported that job postings for analytic scientists jumped 15,000 percent between the summer of 2011 and 2012. McKinsey & Company predicted the U.S. will see a 50- to 60-percent shortfall in analytic scientists by 2018.

“There’s more demand than ever for data scientists, but at the same time we demand more from job candidates,” said Dr. Andrew Jennings, chief analytics officer at FICO and head of FICO Labs. “FICO has been hiring data scientists — or analysts, as we used to call them — since 1956. We’ve learned that excellent math skills alone just aren’t enough. We want someone who can solve problems for businesses, and explain their insights to people who don’t have a Ph.D. in operations research.”

The FICO infographic identifies eight characteristics of a good data scientist. These include the ability to tease out insights from data, communicate with business users and focus on the practical applications of their work.

Saturday, August 3, 2013

Social Intelligence

No matter what industry a business operates in, data is now being used more than ever before to gain an advantage. Social is only one of the newest layers in this big data bonanza, and some companies that were early adopters are starting to mature their models into Social Intelligence.

Enterprises have an average of 178 social media accounts, the report found, and an array of departments and executives are increasingly active there. However, when it comes to things like customer relationship management, analytics and market research, social data is mostly isolated. This leads to disjointed efforts across a company, and doesn’t allow for a strategic, holistic view to be put into place.

It’s becoming a roadblock as companies seek to really tap into social data insights, so companies need to develop a common framework for social data collection and integration. Not doing so could result in poorer customer experiences, and of course, missed opportunities.

Altimeter collected input from 34 enterprise organizations on how to integrate social data, and how to build holistic systems that scale for its report.

Sunday, July 28, 2013

Big Data Stream

Stream computing is a new paradigm necessitated by new data-generating scenarios, such as the ubiquity of mobile devices, location services, and sensor pervasiveness. A crucial need has emerged for scalable computing platforms and parallel architectures that can process vast amounts of generated streaming data.

In static data computation (the left-hand side of attached diagram), questions are asked of static data. In streaming data computation (the right-hand side), data is continuously evaluated by static questions.

Let me give a simple example.  In financial trading platform, applications are written traditionally to analyse the historical records in the batch mode.  Meaning, we preserved the data in the data ware house.  Based on the user request/query, the result is produced/returned back to the consumer.  It is the first use case.

With big data streaming technology, the requests (like market trend of IT stocks) are pre built  On the arrival/streaming of the data, the results are published to the prescribed subscriber/consumer.  Isn't it too cool to taste the technology?

Thursday, July 25, 2013

Storm at Yahoo

Yahoo! is enhancing its web properties and mobile applications to provide its users personalized experience based on interest profiles. To compute user interest, we process billions of events from our over 700 million users, and analyze 2.2 billion content every day. Since users' change interest over time, we need to update user profiles to reflect their current interests.

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases.

Yahoo! big-data platform enables Hadoop applications and Storm applications to share data via shared storage such as HBase.  Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.

Sunday, July 21, 2013


Hadoop, the clear king of big-data analytics, is focused on batch processing. This model is sufficient for many cases (such as indexing the web), but other use models exist in which real-time information from highly dynamic sources is required. Solving this problem resulted in the introduction of Storm from Nathan Marz (now with Twitter by way of BackType). Storm operates not on static data but on streaming data that is expected to be continuous. With Twitter users generating 140 million tweets per day, it's easy to see how this technology is useful.

Storm is more than a traditional big-data analytics system: It's an example of a complex event-processing (CEP) system. CEP systems are typically categorized as computation and detection oriented, each of which can be implemented in Storm through user-defined algorithms. CEPs can, for example, be used to identify meaningful events from a flood of events, and then take actions on those events in real time.

Thursday, July 4, 2013


Splunk is getting on board with a new Hadoop-based application it cheekily calls Hunk. Hunk takes Splunk’s popular analytics platform and puts it to work on data stored in Hadoop. Businesses that use Hadoop can now use this for exploration and visualization of data. Itz key features are:
  • Splunk Virtual Index: Splunk virtual index technology enables the “seamless use” of the entire Splunk technology stack, including the Splunk Search Processing Language (SPL), for interactive exploration, analysis and visualization of data stored anywhere, as if it was stored in a Splunk software index.
  • Explore data in Hadoop from one place: Hunk is designed for interactive data exploration across large, diverse data sets on top of Hadoop.
  • Interactive analysis of data in Hadoop: Hunk enables users to drive deep analysis, detect patterns, and find anomalies across terabytes and petabytes of data.
  • Create custom dashboards: Hunk users can combine multiple charts, views and reports into role-specific dashboards which can be viewed and edited on laptops, tablets or mobile devices.
Splunk has 5,600 customers, which includes half of the Fortune 100. It says the new Hunk product will target both new and existing customers, as long as they use Hadoop.

Thursday, June 27, 2013

Citrix hypervisor open source

IT departments reevaluating their server virtualization spending have another free option to choose from.

Citrix Systems Inc. today released its XenServer 6.2 hypervisor to the open source community. All of the features previously in the XenServer Platinum Edition will be available at no charge.

Citrix had made parts of XenServer code open source in the past, as part of various cloud computing initiatives, but this marks the first time the entire product has been made available to the community.

Competitor  VMware pricing has come under fire, but Microsoft's Hyper-V -- not XenServer -- has been the biggest beneficiary. More than 22% of respondents to TechTarget's 2013 Data Center and Reader's Choice survey identified Microsoft as their primary server virtualization hypervisor -- up from 13% last year. XenServer's numbers actually dropped slightly, from 4.1%in 2013 to 3.3% in 2012, while VMware's tumbled from 73% to 60%.

Meanwhile, cloud computing has created an inflection point where organizations are reconsidering how many workloads they need to run in-house and how much money they need to spend to do so.

Sunday, June 16, 2013

Tibco Streambase

Palo Alto Calif.-based Tibco is a midsize enterprise company that competes with giants like IBM and Oracle. It has been around since the late ’90s, and it provides infrastructure software that its clients can use in the cloud or on site.

Tibco targeted Streambase for its data analytics technology, which is used for algorithmic trading, and it’s a known brand in the capital markets. Tibco is also popular with Wall Street firms, a highly lucrative source of business.

“This combination extends our event-processing abilities and provides a terrific opportunity to address a growing number of use cases for data in motion – in financial services and beyond,” said Tibco chief technology officer Matt Quinn in a statement.

This spring, Tibco acquired a French location analytics company Maporama, which it will roll into its Spotfire offering. Spotfire, along with Tibco’s messaging product, Tibbr, and some other cloud products, is growing rapidly.

Sunday, June 9, 2013

DataSift Intro

DataSift, an enterprise social data company, has added Instagram, Facebook Page and Google+ data to its managed data offering, and partnered with analytics and visualization vendor Tableau on data integration.

Datasift is now offering aggregate data of these popular sites through their public developer interfaces, and companies will be able filter relevant data from them for behavior and sentiment insights. Large companies sometimes have dozens and dozens of Facebook Pages across different departments, and DataSift will allow them all to be viewed in one stream, for example.

Mining sentiment from social media with DataSift includes features like extracting topics and sentiment around what fans are posting on those fan pages. Additionally, businesses can figure how deftly their content is picking up new fans, a good sign it is finding new customers.

The attached diagram displays the data integration from outside the enterprise with business data for maximum insight.

Wednesday, June 5, 2013

IBM 10gen Collaboration

IBM and 10gen, the MongoDB company, have announced they are collaborating on a database development standard to push MongoDB as a core NoSQL database for enterprises building web and mobile apps.

Speaking at a press event at the IBM Innovate 2013 conference here, Matt Asay, vice president of business development and corporate strategy at 10gen, said: “IBM embraces open source communities. And IBM is working with 10gen in establishing MongoDB as an industry standard for NoSQL databases. But this by no means indicates that IBM will be diminishing its investment in its own proprietary databases.” Asay was quick to note that he was not trying to speak for IBM.

The identification of a standard NoSQL database is important because millions of developers designing web and mobile apps are using popular NoSQL database technology like MongoDB, and companies need the tools to combine data from these new apps with enterprise databases like IBM’s DB2 that power organizations of all sizes today. By embracing MongoDB IBM is providing mobile developers with the ability to tap into critical data managed by DB2 systems and enable organizations to extend their business through compelling enterprise apps.

Sunday, June 2, 2013

Netflix BigData

Netflix is the big Kahuna of a Web media businesses, with 33 million subscribers in more than 40 countries. As Netflix's "watch now" streaming service has grown, the company has had to rethink its data and storage strategies to cope with ballooning workloads managed in the cloud.

Today, the company is nearly complete in its migration from Oracle to the NoSQL database Cassandra, improving availability and essentially eliminating downtime incurred by database schema changes.

Netflix launched its streaming service in 2007, using the Oracle database as the back end. In 2010, Netflix began moving its data to Amazon Web Services. The current step is to replace its Oracle database with Apache Cassandra, an open source NoSQL database known for its scalability and enterprise-grade reliability.

With billions of reads and writes daily, Netflix relies on NoSQL database Cassandra to replace a legacy Oracle deployment.

Saturday, May 11, 2013


Business intelligence firm Alteryx has debuted Project Edition, an analytics package it has dubbed instant analytics, the kind to be deployed for a specific project and one that doesn't require assistance from IT to run.

Project Edition, as the name implies, is a scaled down version of the Alteryx Strategic Analycis 8.5 system, and the company has released it in the hopes of exposing a wider customer base to the full paid version. Project Edition, while free to use, is limited because it only allows data to be run a handful of times for a given purpose.

Data can come from a variety of sources like Excel, text files, data warehouses, cloud apps, Hadoop or social media, and can be integrated and cleansed by Alteryx. This single analytics workflow is meant to help teams crunch data for particluar assignment even if they don't have any coding or programming skills. Once the data is analyzed, it can be presented in reports, data files or Tableau, a data visualization tool.

Sunday, March 24, 2013

BigData Summit

Big data. It just keeps getting bigger. Why? Because top management’s attention is laser-focused on big data as the solution to many corporate problems.  More actions at CIO Big Data summit at New York on first week of May'13.

Gartner Sales Performance Analyst Patrick Stakenas. “There’s no question that big data is the biggest trend in business intelligence, and it will remain that way for the foreseeable future. But it’s not just an IT issue. It’s a management issue. Relying too much on big data analytics risks losing the personal approach to selling.”

The huge stores of data that companies have accumulated haven’t added true value to the enterprise yet, because data requires context in order to be useful. Context includes a clearly articulated business strategy for using the data, an understanding of competitive shifts, an understanding of the market’s perceptions about your company and your products, and much, much more.

Gartner’s Laney says, for example, that social media is a great source of data and information about customers, but it can cause real problems for executives unless it’s put into context.

Even after all the data is collected and analyzed, there’s still one more pitfall to look for, adds Adam Sarner, Gartner’s big data and CRM analyst. “The successful big data project isn’t about collecting massive amounts of this data,” Sarner says. “It’s about making the right information accessible and action-oriented for the company and the customer for core CRM.”

Friday, March 8, 2013

Big Data Splunk

The initial focus of 'big data' has been about its increasing volume, velocity and variety — the "three Vs" — with little mention of real world application. Now is the time to get down to business.

Splunk is the platform for machine data. It’s the easy, fast and resilient way to collect, analyze and secure the massive streams of machine data generated by your IT systems and technology infrastructure—whether it’s physical, virtual or in the cloud.

Splunk software collects machine data securely and reliably from wherever it’s generated. It stores and indexes the data in real time in a centralized location and protects it with role-based access controls. Splunk lets you search, monitor, report and analyze your real-time and historical data

451 Research, and three "real world" case studies of Splunk customers handling the variety and velocity of their ever increasing unstructured data. 451 Research believes that in order to deliver value from 'big data', businesses need to look beyond the nature of the data and re-assess the technologies, processes and policies they use to engage with that data

Saturday, March 2, 2013

Oracle Big Data

Enterprise systems have long been designed around capturing, managing and analyzing business transactions e.g. marketing, sales, support activities etc. However, lately with the evolution of automation and Web 2.0 technologies like blogs, status updates, tweets etc. there has been an explosive growth in the arena of machine and consumer generated data. Defined as “Big Data”, this data is characterized by attributes like volume, variety, velocity and complexity and essentially represents machine and consumer interactions

Big data analytics lifecycle includes steps like acquire, organize and analyze. The analytics process starts with data acquisition. The structure and content of big data can’t be known upfront and is subject to change in-flight so the data acquisition systems have to be designed for flexibility and variability; no predefined data structures, dynamic structures are a norm. The organization step entails moving the data in well defined structures so relationships can be established and the data across sources can be combined to get a complete picture.

Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data sources and analyzes them alongside your existing data to find new insights and capitalize on hidden relationships. Attached diagram helps you to understand how Oracle acquire, organize, and analyze your big data.

Wednesday, February 20, 2013

Red Hat Big Data

Open source vendor Red Hat announces a Big Data strategy that spans the full enterprise software stack, both in the public cloud and on-premise.

Red Hat Enterprise Linux (RHEL) is arguably Raleigh, North Carolina-based Red Hat's flagship product, but the operating system arena is not by any means its only focus.  Red Hat also has big irons in the storage, cloud and developer fires, and its Big Data strategy announcement addressed all three of these.  Big Data is now a relevant factor in the entire enterprise software stack.

Red Hat rightly pointed out that the majority of Big Data projects are built on open source software (including Linux, Hadoop, and various NoSQL databases) and so it's fitting that such an important company in the open source world as Red Hat would announce its Big Data strategy.

Red Hat big data components are illustrated in the attached system diagram.

Saturday, February 2, 2013

Big Data Fourth Dimension

We knew that Big Data has 3 pillars namely Volume, Velocity and Variety.  I learnt a new (4th) dimension namely Veracity.  What does it mean?  Accuracy: conformity with truth or fact (or) truthfulness: devotion to the truth.

1. Volume:
Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.

  • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
  • Convert 350 billion annual meter readings to better predict power consumption

2. Velocity
Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

  • Scrutinize 5 million trade events created each day to identify potential fraud
  • Analyze 500 million daily call detail records in real-time to predict customer churn faster

3. Variety
Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

  • Monitor 100’s of live video feeds from surveillance cameras to target points of interest
  • Exploit the 80% data growth in images, video and documents to improve customer satisfaction

4. Veracity
1 in 3 business leaders don’t trust the information they use to make decisions. How can you act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows.

Tuesday, January 29, 2013

Polyglot Persistence

In computing, a polyglot is a computer program or script written in a valid form of multiple programming languages, which performs the same operation  to compile or interpret it.

In NoSQL world, Polyglot Persistence contains a variety of different data storage technologies for different kinds of data in any decent sized enterprise.  Complex applications combine different types of problems, so picking the right language for the job may be more productive than trying to fit all aspects into a single language.  In Big Data era, there's been an explosion of interest in new languages, particularly functional languages like Clojure, Scala, Erlang.  In the new strategic enterprise application, the persistence should be no longer relational.  

A common example is configuring an Apache Solr server to stay in sync with a SQL-based database. Then you can do scored keyword/substring/synonym/stemmed/etc queries against Solr but do aggregations in SQL.

Another example is to use the same datastore, but store the same data in multiple aggregated formats. For example, having a dataset that is rolled up by date (each day getting a record) can also be stored rolled up by user (each user getting a record). Depending on the query you want to run, you choose the set that will give you the best performance. If the data is large enough, the overhead of keeping the two sets synchronized more than pays for itself in increased query speed.

Herez the attached reference architecture of Polyglot Persistence in a typical Web App.

Sunday, January 27, 2013

Microsoft Big Data

As one of the big fan of Microsoft, I am learning their Big Data Strategy.

Recently, Microsoft announced HDInsight, Microsoft’s Hadoop distribution for managing, analysing and making sense out of large volumes of data. Microsoft worked with Hortonworks and the Hadoop ecosystem to broaden the adoption of Hadoop and the unique opportunity for organizations to derive new insights from Big Data.

The proposal was to the Apache Software Foundation for enhancements to Hadoop to run on Windows Server and are also in the process of submitting further proposals for a JavaScript framework and a Hive ODBC Driver.

The JavaScript framework simplifies programming on Hadoop by making JavaScript a first class programming language for developers to write and deploy MapReduce programs. The Hive ODBC Driver enables connectivity to Hadoop data from a wide range of Business Intelligence tools including PowerPivot for Excel.

As the collaboration result of several leading Big Data vendors like Karmasphere, Datameer and HStreaming,  we've Big Data solutions of Hadoop based service on Windows Server & Windows Azure.

Overall, Microsoft’s strategy seems to be to offer path of least resistance for it’s customers to adopting Big Data – by extending existing tools such as SQL Server and Office to work seamlessly with new data types and allowing companies to take advantage of their existing investments while making new ones.

Saturday, January 19, 2013

Big Data Business

In the current state of IT world, you can get the answer for any question with the multiple sources.  Vast data around you.  All you need to do is analyze and know how to fetch what you want.

Until two years ago though, much of that data wasn’t structured, nor was it leveraged in the most effective manner. Which is why the multi-billion dollar multinational decided to create a separate team—and assigned a dedicated leader—to spearhead a big data initiative.

Let get into the business advantage of Big Data. Three key business benefits:

  1. Allows companies to streamline costs.
  2. Helps introduce relevant product to the market.
  3. Drives up market share

BloomReach, a provider of web site content optimization solutions based on Big Data, recently announced Series C funding of $25 million. GoodData, which provides dashboards for understanding sales, marketing, and other business functions, announced $25M of additional funding in July. And Predictive CRM company Lattice is backed by Sequoia Capital.

Big Data Applications (BDAs) are the hot topic and itz relfected in the attached graph.

According to State of the Indian CIO Survey, 40 percent of Indian IT leaders plan to implement big data analytics over the course of this year—while 16 percent say they are already in the process of implementing it.

Saturday, January 12, 2013

Big Data Characters

Traditionally, big data describes data that’s too large for existing systems to process.  Main characteristic of Big data is defined in 3Vs namely variety, velocity and volume.

This original characteristic describes the relative size of data to the processing capability. Today a large number may be few terabytes.  Overcoming the volume issue requires technologies that store vast amounts of data in a scalable fashion and provide distributed approaches to querying or finding that data.

Velocity describes the frequency at which data is generated, captured, and shared. The growth in sensor data from devices, and web based click stream analysis now create requirements for greater real-time use cases.  The velocity of large data streams power the ability to parse text, detect sentiment, and identify new patterns

A proliferation of data types from social, machine to machine, and mobile sources add new data types to traditional transactional data.  Data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID’s, search, sentiment, streaming data, social, text, and web.  The addition of unstructured data such as speech, text, and language increasingly complicate the ability to categorize data.

In my opinion, Volume is the least weightage and Variety is the most weightage factor.  On writing the best Big Data solution, you need to give the importance in the order of Variety, Velocity and Volume.

Sunday, January 6, 2013

Big Data

Happy New Year to all readers!  Last month, was engaged fully into one of the enterprise Big Data project and so a small gap in my blog.

Information Technology (IT) experts are all dropped off from cloud computing buzz and hopped into the big data band wagon.

Is Big Data a new concept? – No. The concept has been there for four decades and it has been named as enterprise data warehouse (EDW) and the focus of EDW is primarily on the internal structured data.

The simplest view of a data warehouse is to take all the operational data to one place as single point of truth for the organization and all the combination of analytical reports are generated out of it.

Data architect (or) data scientist works to identify a set of differentiating data from a massive data set. Differentiating data will be modeled and derived when the product, service, consumer & partner trends are studied and understood. The consumer, partner, product and economical data is unstructured in uncharted territory. A massive data set in uncharted territory includes both internal, external structured and unstructured data. The massive data set is called big data.

The objective of this blog series, is to bring the key concepts of big data based on my experience.