Tuesday, January 29, 2013

Polyglot Persistence

In computing, a polyglot is a computer program or script written in a valid form of multiple programming languages, which performs the same operation  to compile or interpret it.

In NoSQL world, Polyglot Persistence contains a variety of different data storage technologies for different kinds of data in any decent sized enterprise.  Complex applications combine different types of problems, so picking the right language for the job may be more productive than trying to fit all aspects into a single language.  In Big Data era, there's been an explosion of interest in new languages, particularly functional languages like Clojure, Scala, Erlang.  In the new strategic enterprise application, the persistence should be no longer relational.  

A common example is configuring an Apache Solr server to stay in sync with a SQL-based database. Then you can do scored keyword/substring/synonym/stemmed/etc queries against Solr but do aggregations in SQL.

Another example is to use the same datastore, but store the same data in multiple aggregated formats. For example, having a dataset that is rolled up by date (each day getting a record) can also be stored rolled up by user (each user getting a record). Depending on the query you want to run, you choose the set that will give you the best performance. If the data is large enough, the overhead of keeping the two sets synchronized more than pays for itself in increased query speed.

Herez the attached reference architecture of Polyglot Persistence in a typical Web App.

Sunday, January 27, 2013

Microsoft Big Data

As one of the big fan of Microsoft, I am learning their Big Data Strategy.

Recently, Microsoft announced HDInsight, Microsoft’s Hadoop distribution for managing, analysing and making sense out of large volumes of data. Microsoft worked with Hortonworks and the Hadoop ecosystem to broaden the adoption of Hadoop and the unique opportunity for organizations to derive new insights from Big Data.

The proposal was to the Apache Software Foundation for enhancements to Hadoop to run on Windows Server and are also in the process of submitting further proposals for a JavaScript framework and a Hive ODBC Driver.

The JavaScript framework simplifies programming on Hadoop by making JavaScript a first class programming language for developers to write and deploy MapReduce programs. The Hive ODBC Driver enables connectivity to Hadoop data from a wide range of Business Intelligence tools including PowerPivot for Excel.

As the collaboration result of several leading Big Data vendors like Karmasphere, Datameer and HStreaming,  we've Big Data solutions of Hadoop based service on Windows Server & Windows Azure.

Overall, Microsoft’s strategy seems to be to offer path of least resistance for it’s customers to adopting Big Data – by extending existing tools such as SQL Server and Office to work seamlessly with new data types and allowing companies to take advantage of their existing investments while making new ones.

Saturday, January 19, 2013

Big Data Business

In the current state of IT world, you can get the answer for any question with the multiple sources.  Vast data around you.  All you need to do is analyze and know how to fetch what you want.

Until two years ago though, much of that data wasn’t structured, nor was it leveraged in the most effective manner. Which is why the multi-billion dollar multinational decided to create a separate team—and assigned a dedicated leader—to spearhead a big data initiative.

Let get into the business advantage of Big Data. Three key business benefits:

  1. Allows companies to streamline costs.
  2. Helps introduce relevant product to the market.
  3. Drives up market share

BloomReach, a provider of web site content optimization solutions based on Big Data, recently announced Series C funding of $25 million. GoodData, which provides dashboards for understanding sales, marketing, and other business functions, announced $25M of additional funding in July. And Predictive CRM company Lattice is backed by Sequoia Capital.

Big Data Applications (BDAs) are the hot topic and itz relfected in the attached graph.

According to State of the Indian CIO Survey, 40 percent of Indian IT leaders plan to implement big data analytics over the course of this year—while 16 percent say they are already in the process of implementing it.

Saturday, January 12, 2013

Big Data Characters

Traditionally, big data describes data that’s too large for existing systems to process.  Main characteristic of Big data is defined in 3Vs namely variety, velocity and volume.

This original characteristic describes the relative size of data to the processing capability. Today a large number may be few terabytes.  Overcoming the volume issue requires technologies that store vast amounts of data in a scalable fashion and provide distributed approaches to querying or finding that data.

Velocity describes the frequency at which data is generated, captured, and shared. The growth in sensor data from devices, and web based click stream analysis now create requirements for greater real-time use cases.  The velocity of large data streams power the ability to parse text, detect sentiment, and identify new patterns

A proliferation of data types from social, machine to machine, and mobile sources add new data types to traditional transactional data.  Data no longer fits into neat, easy to consume structures. New types include content, geo-spatial, hardware data points, location based, log data, machine data, metrics, mobile, physical data points, process, RFID’s, search, sentiment, streaming data, social, text, and web.  The addition of unstructured data such as speech, text, and language increasingly complicate the ability to categorize data.

In my opinion, Volume is the least weightage and Variety is the most weightage factor.  On writing the best Big Data solution, you need to give the importance in the order of Variety, Velocity and Volume.

Sunday, January 6, 2013

Big Data

Happy New Year to all readers!  Last month, was engaged fully into one of the enterprise Big Data project and so a small gap in my blog.

Information Technology (IT) experts are all dropped off from cloud computing buzz and hopped into the big data band wagon.

Is Big Data a new concept? – No. The concept has been there for four decades and it has been named as enterprise data warehouse (EDW) and the focus of EDW is primarily on the internal structured data.

The simplest view of a data warehouse is to take all the operational data to one place as single point of truth for the organization and all the combination of analytical reports are generated out of it.

Data architect (or) data scientist works to identify a set of differentiating data from a massive data set. Differentiating data will be modeled and derived when the product, service, consumer & partner trends are studied and understood. The consumer, partner, product and economical data is unstructured in uncharted territory. A massive data set in uncharted territory includes both internal, external structured and unstructured data. The massive data set is called big data.

The objective of this blog series, is to bring the key concepts of big data based on my experience.