Tuesday, November 11, 2014


Our team recently integrated Vertica with the existing DataStax Cassandra tech stack in non-prod region.

On  Day-1 PreProd end-end processing, the performance results are really impressive with an average of 90% improvement. As an example, 20 mins Hive job is slashed down to 2 mins Vertica fetch.

Technology Trade Secrets are:
  1. In-memory execution rather than disk I/O processing
  2. Powerful In memory of 256 GB RAM with 12 Core on each node
  3. Proprietary high performance(MPP) appliances like terradata, exadata
  4. Column oriented (traditional row-oriented) database for speedy fetch/analytic
  5. In-memory distributed processing (inspired by MapReduce algorithm)
  6. High level of in-built compression & encoding in data abstraction
  7. High availability using replication factor on cluster nodes

Recently Facebook selected the HP Vertica Analytics Platform as one component of its big data infrastructure. Ref:http://www.vertica.com/2013/12/12/welcoming-facebook-to-the-growing-family-of-hp-vertica-customers/

Choosing the right technology for right use case is key to success in Big Data platform.  Njoy the Continuous Learning on Big Data next generation.

Saturday, November 1, 2014

Mainframe at BigData

While web companies are building massive-scale data warehouses on top of Hadoop and analyzing data in every manner under the sun, Lockheed Martin is trying to help its government systems embrace a new world without breaking their mainframes.

Programs such as Social Security and food stamps still run on mainframes and COBOL, others implemented enterprise data warehouses in the 1990s that are reaching their scalability limits, and none of it is going anywhere. In some cases, particularly for programs and applications that can’t go offline, the process is like changing the engine of the train while the the train is still running.

As the next step, data-preparation software is coming from a startup called Trifacta. That company, like a handful of other startups including Paxata and Tamr, is using machine learning and a relatively streamlined user experience to simplify the process of transforming data from its raw form into something that analytic software or applications can actually use. They’re in the same vain as legacy data-integration or ETL tools used to do, only a lot easier and designed with big data stores like Hadoop in mind.