Sunday, December 21, 2014

HP Vertica on Hadoop

Couple of weeks back, I wrote my work experience on HP Vertica performance improvement strategy.  Another big data update on HP Vertica, integration with Hadoop ecosystem.

Hewlett-Packard is building a bridge between traditional SQL database analytics and Big Data on Hadoop systems. The result is HP Vertica for SQL on Hadoop.

HP officially unveiled the new offering — though reports about the offering (previously code-named Dragline) surfaced in May 2014.

HP said, “expanded SQL-on-Hadoop exploration and cost-optimized storage eliminates the need to move data and supports even more formats for data exploration, including Parquet, Thrift, Avro and CEF. Businesses can now ingest, explore and visualize more data more quickly and easily with their choice of Business Intelligence/data visualization environments.”

HP also confirms that the SQL-on-Hadoop support is vendor agnostic. In other words, HP’s offering will work with such leading Hadoop distributions as Apache Hadoop, Cloudera, Hortonworks and MapR.

Sunday, December 14, 2014

Google vs Microsoft

Google is really getting serious about selling Google Apps for Work, its Microsoft Office killer, to more businesses.

Currently, the standard commission it pays is 20%, sources close to Google told the Wall Street Journal( To do the math: at the low end, Google sells Apps for $50/user/year, which gives the reseller a $10/user/year commission, while Google keeps $40. On the high end, it costs $120/user/year, which equates to $48 commission for year for the reseller while Google keeps $72

For resellers that sell more, Google will now give them a bigger commission. Plus they will be encouraged to sell other services to Google Apps customers (like support, training, and customization).

This move is one of a series that Google is making to win over Microsoft.

Google hasn't publicly released details of the new commission plan. But it did post this blog telling resellers all the great new things it plans to do for them. Ref:

Saturday, December 6, 2014

Datastax 4.6

As I am working to roll out the final version of our product,  I missed couple of weeks in Nov to write.  Interesting industry Big Data update: Datastax 4.6 launch.

This week @Cassandra Summit Europe 2014, Datastax launched DSE 4.6, with in-memory big data architecture using Spark,Shark,Solr,etc. for IoT (Internet of Things), web and mobile applications.

Highlight of this release:

  1. Seamless integration with Corporate security standards, LDAP and Active Directory
  2. Apache Spark streaming analytics integrated with real-time transactions processing delivers new levels of personalization at global scale
  3. Enhanced backup and restore service sets new standards in customer data protection
  4. Advancements in OpsCenter 5.1 further enhances operational simplicity in multi datacenter environments on premise, in the cloud or both

DataStax Enterprise 4.6 is currently available and DataStax OpsCenter 5.1 will be generally available in January 2015.

DataStax Download
DSE 4.6 Release Notes
Summit Ref

Tuesday, November 11, 2014


Our team recently integrated Vertica with the existing DataStax Cassandra tech stack in non-prod region.

On  Day-1 PreProd end-end processing, the performance results are really impressive with an average of 90% improvement. As an example, 20 mins Hive job is slashed down to 2 mins Vertica fetch.

Technology Trade Secrets are:
  1. In-memory execution rather than disk I/O processing
  2. Powerful In memory of 256 GB RAM with 12 Core on each node
  3. Proprietary high performance(MPP) appliances like terradata, exadata
  4. Column oriented (traditional row-oriented) database for speedy fetch/analytic
  5. In-memory distributed processing (inspired by MapReduce algorithm)
  6. High level of in-built compression & encoding in data abstraction
  7. High availability using replication factor on cluster nodes

Recently Facebook selected the HP Vertica Analytics Platform as one component of its big data infrastructure. Ref:

Choosing the right technology for right use case is key to success in Big Data platform.  Njoy the Continuous Learning on Big Data next generation.

Saturday, November 1, 2014

Mainframe at BigData

While web companies are building massive-scale data warehouses on top of Hadoop and analyzing data in every manner under the sun, Lockheed Martin is trying to help its government systems embrace a new world without breaking their mainframes.

Programs such as Social Security and food stamps still run on mainframes and COBOL, others implemented enterprise data warehouses in the 1990s that are reaching their scalability limits, and none of it is going anywhere. In some cases, particularly for programs and applications that can’t go offline, the process is like changing the engine of the train while the the train is still running.

As the next step, data-preparation software is coming from a startup called Trifacta. That company, like a handful of other startups including Paxata and Tamr, is using machine learning and a relatively streamlined user experience to simplify the process of transforming data from its raw form into something that analytic software or applications can actually use. They’re in the same vain as legacy data-integration or ETL tools used to do, only a lot easier and designed with big data stores like Hadoop in mind.

Saturday, October 25, 2014

Apache Tez

The Apache Tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN

The 2 main design themes for Tez are:

1. Empowering end users by:

  • Expressive dataflow definition APIs
  • Flexible Input-Processor-Output runtime model
  • Data type agnostic
  • Simplifying deployment

2. Execution Performance

  • Performance gains over Map Reduce
  • Optimal resource management
  • Plan reconfiguration at runtime
  • Dynamic physical data flow decisions

By allowing projects like Apache Hive to run a complex DAG of tasks, Tez can be used to process data, that earlier took multiple MR jobs, now in a single Tez job as shown in the attached image.

Friday, October 17, 2014

Cloudera Enterprise 5.2

On this Tuesday, Cloudera launched its latest version of its big data enterprise software, Cloudera Enterprise 5.2, with a bevy of features aimed at improving analytics and integration.

With Cloudera 5.2 the focus is on building products to deliver on the promise of the enterprise data hub. In particular, new capabilities make the technology more accessible to users who are not data scientists and also increase the level of security, two hurdles which can stand in the way of Hadoop adoption.

The software company provides an enterprise version of Apache Hadoop, which is widely used for big data analytics process. Cloudera improved security, cloud management and its analytics database known as Impala 2.0.

According to Cloudera, the latest release of its flagship software better integrates with databases, data warehouses and common enterprise applications.

The big picture for Cloudera is to integrate well into enterprises building an analytics fabric designed to crunch data. Cloudera Enterprise 5.2 is compliant with PCI security certifications to crunch sensitive data.

Friday, September 26, 2014

Apache Argus

Security on Hadoop has been catch-as-catch-can for much of the product's lifetime, but a new Apache project that entered incubation earlier this year -- Apache Argus -- addresses it in a consistent manner.

With the delivery of YARN, which powers Hadoop’s ability to run multiple workloads operating on shared data sets within a single cluster, a heightened requirement for a centralized approach to security policy definition and coordinated enforcement has surfaced.

Argus will deliver this comprehensive approach to central security policy administration across the core enterprise security requirements of authentication, authorization, accounting and data protection. It already extends baseline features for coordinated enforcement across Hadoop workloads from batch, interactive SQL and real–time IN Hadoop. And we will leverage the extensible architecture of this security platform to apply policies consistently against additional Hadoop ecosystem components (beyond HDFS, Hive, and HBase) including Storm, Solr, Spark, and more. It truly represents a major step forward for the Hadoop ecosystem by providing a comprehensive approach – all completely as open source.

Argus did not start as a community initiative; it's the open-sourced version of a commercial product, XA Secure, that Hortonworks acquired and transformed into an Apache-hosted project. The idea, as Hortonworks explained earlier this year, is to provide a centralized way to define and enforce security policy across Hadoop and all its components. This includes access controls down to the folder and file level in HDFS, and to the table and column level in Hive and HBase. But don't expect automatic Argus integration -- this project has a long road ahead for Hortonworks and everyone else contributing to the Hadoop ecosystem.

In May, Hortonworks acquired XA Secure and made a promise to contribute this technology to the Apache Software Foundation.  In June, we made it available for all to download and use from our website and today we are proud to announce this technology officially lives on as Apache Argus, an incubator project within the ASF.

Monday, September 22, 2014

Cassandra Summit 2014

Cassandra Summit 2014 was the single largest gathering of Cassandra users on the planet.
It was successfully completed on September 10-12, 2014 at SanFrancisco California, USA.

In this conference, you could learn how the world’s most successful companies are transforming their businesses using Apache Cassandra™, the world's fastest and most scalable Distributed Database Management System.From best practices and how-tos, to expert panels and case studies, the participants had the amazing opportunities to learn how to conduct your business in a whole new way.

As part of this summit, Big Data professionals were benefited with:

  • 60+ sessions featuring ground breaking use cases from some of the world’s hottest companies, such as Google, FedEx, Sony, Netflix, Safeway, Neiman Marcus, and eBay.
  • Expert tips to succeed in Today's Digital economy
  • Specialized training to grow your career
  • Networking with the worlds's Apache Cassandra experts

Summit materials are available at, upto 2013 sessions.

Monday, September 8, 2014

DataStax: 106 million

DataStax, the company that delivers Apache Cassandra™ to the enterprise, announced that it has secured $106 million in Series E financing. This amount, together with the $84 million investment in previous rounds, brings the total invested to date to $190 million.

DataStax raised $106 Million, which is the proof of investment solidifies growing global enterprise demands for Distributed Database Management Systems.

In just under a year, DataStax has experienced extremely rapid growth and now has:

  • More than 350 employees (100 percent increase since December 2013) in six global locations: Santa Clara, Austin, London, Paris, Sydney and Tokyo.
  • Customers which include 25 percent of the Fortune 100 enterprises.
  • A rapidly expanding global sales force that has grown the customer base in over 50 countries and accounts for revenue growth of more than 125 percent year over year.
  • A powerful and experienced executive team that is focused on scaling the company internationally. Key hires in the past year include known industry veterans Dennis Wolf as CFO, John Schweitzer as SVP of Field Operations, Tony Kavanagh as CMO and Clint Smith as General Counsel.

With this new round of funding, DataStax expects to accelerate growth and deliver sustainable long-­term value and success to customers and partners.

Saturday, September 6, 2014

Facebook Flux

In the similar line of Google's Clould Data Flow, Facebook have already developed a data flow architecture called Flux. Flux works within the Facebook messaging system. It avoids cascading affects by preventing nested updates- simply put, Flux has a single directional data flow, meaning additional actions aren’t triggered until the data layer has completely finished processing.

Flux is the application architecture that Facebook uses for building client-side web applications. It complements React's composable view components by utilizing a unidirectional data flow. It's more of a pattern rather than a formal framework, and you can start using Flux immediately without a lot of new code.

FlumeJava, from which Cloud Dataflow evolved, is also involved the process of creating easy-to-use, efficient parallel pipelines. At Flume’s core are “a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies.”

Saturday, August 30, 2014

Amazon Kinesis

Last week, I wrote about Google's Clould Data Flow design.  It seems Amazon's Kinesis is the competitor  for Google's Cloud Dataflow.  Kinesis as a to a managed service designed for real-time data streaming developed by industry leaders Amazon Web Services.

Kinesis allows you to write applications for processing data in real-time, and works in conjunction with other AWS products such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, or Amazon Redshift.

Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. It can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.

Three key observed Use Cases are:
1. Application and Operational Logs
2. Real-Time Clickstream Analytics
3. Machine learning-based recommendation and ranking

Attached diagram represents second use case.

Sunday, August 24, 2014

Google Cloud Dataflow

Google Cloud Dataflow is designed so the user can focus on devising proper analysis, without worrying about setting up and maintaining the underlying data piping and processing infrastructure.

It could be used for live sentiment analysis, for instance, where an organization estimates the popular sentiment around a product by scanning social networks such as Twitter. It could also be used as a security tool to watch activity logs for unusual activity. It could also be used an alternative to commercial ETL (extract, transform and load) programs, widely used to prepare data for analysis by business intelligence software.

MapReduce's limitation is that it can only analyze data in batch mode, which means all the data must be collected before it can be analyzed. A number of new software programs have been developed to get around the limitation of batch processing, such as Twitter Storm and Apache Spark, which are both available as open source and can run on Hadoop.

Google's own approach to live data analysis uses a number of technologies built by the company, notably Flume and MillWheel. Flume aggregates large amounts of data and MillWheel provides a platform for low-latency data processing.

The service provides a software development kit that can be used to build complex pipelines and analysis. Like MapReduce, Cloud Dataflow will initially use the Java programming language. In the future, other languages may be supported.

The pipelines can ingest data from external sources and use them for a variety of things. The service provides a library to prepare and reformat data for further analysis, and users can write their own transformations.

The treated dataset can be queried against using Google's BigQuery service. Or the user can write modules to examine the data as it crosses the wire, to look for aberrant behavior or trends in real-time.

Wednesday, August 13, 2014

Hadoop in Teradata

Teradata has bought the assets of Revelytix and Hadapt in a bid to grow out its capabilities for the Hadoop big-data processing framework

Revelytix developed Loom, a metadata management system compatible with a number of Hadoop distributions, including those from Cloudera, Hortonworks, MapR, Pivotal, IBM and Apache, according to its website. Loom is geared at helping data scientists prepare information in Hadoop for analysis more quickly.

Hadapt is known for its software that integrates the SQL programming language with Hadoop. SQL is a common skill set among database administrators, who may not be familiar with Hadoop.

Last month, Teradata adds data-prep, data-management, and data-analysis capabilities by buying two notable independents in the big data arena.

Saturday, August 9, 2014

Google Mesa

Google has found a way to stretch a data warehouse across multiple data centers, using an architecture its engineers developed that could pave the way for much larger, more reliable and more responsive cloud-based analysis systems.Google's latest big-data tool is named as Mesa, which aims for speed.

For Google, Mesa solved a number of operational issues that traditional enterprise data warehouses and other data analysis systems could not. Google also needed a strong consistency for its queries, meaning a query should produce the same result from the same source each time, no matter which data center fields the query.

Mesa relies on a number of other technologies developed by the company, including the Colossus distributed file system, the BigTable distributed data storage system and the MapReduce data analysis framework. To help with consistency, Google engineers deployed a homegrown technology called Paxos, a distributed synchronization protocol.

In addition to scalability and consistency, Mesa offers another advantage in that it can run be run on generic servers, which eliminates the need for specialized, expensive hardware. As a result, Mesa can be run as a cloud service and easily scaled up or down to meet the job requirements.

Mesa is the latest in a series of novel data-processing applications and architectures that Google has developed to serve its business.

Thursday, July 24, 2014

Knox Gateway

This week, Apache Knox community announced the release of the Apache Knox Gateway (Incubator) 0.3.0.

The Apache Knox Gateway is a REST API Gateway for Hadoop with a focus on enterprise security integration.  It provides a simple and extensible model for securing access to Hadoop core and ecosystem REST APIs.

Apache Knox provides pluggable authentication to LDAP and trusted identity providers as well as service level authorization and more.  The attached diagram below shows how Apache Knox fits in a Hadoop cluster deployment.

Highlight of the recent release:

  • LDAP zfor REST calls to Hadoop
  • Secure Hadoop cluster (i.e. Kerberos) integration
  • HBase integration integration (non-Kerberos)
  • Simple ACL based Service(non-Kerberos)
  • Hive JDBC  Level Authorization

Sunday, July 20, 2014

Hortonworks Security

Mid May'14, Hortonworks, the leading provider of enterprise Apache Hadoop, cquired XA Secure, a leading data security company, to accelerate its delivery of a holistic and centralized approach to Hadoop security.

As the result of this acquisition, Hortonworks publishes the security roadmap with earlier, current and future state.  The relevant security capabilites, are listed as below:

Earlier State
  • Kerberos Authentication
  • HBase, Hive & HDFS authorization
  • Wire Encryption for HDFS, Shuffle & JDBC
  • Basic audit in HDFS & MR
  • ACLs for HDFS
  • Knox: Hadoop REST API Security
  • SQL-style Hive Authorization
  • Expanded Wire Encryption for HiveServer2 & WebHDFS

Current State
  • Centralized Security Administration for HDFS, HBase & Hive
  • Centralized Audit Reporting
  • Delegated Policy Administration

Future State
  • Encryption in HDFS, Hive & HBase
  • Centralized security administration for all Hadoop components
  • Expand audit to cover more operations and provide audit correlation
  • Offer additional SSO integration choices
  • Tag-based global policies

Thursday, July 10, 2014

DataStax Spark

Apache Spark is a project designed to accelerate Hadoop and other big data applications through the use of an in-memory, clustered data engine.  It is paradigm shift from disk based map reduce process.

Spark is a fast and powerful engine for processing Hadoop data. It runs in Hadoop clusters through Hadoop YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Speed is key!  Leveraging an efficient in-memory storage format, an optimistic execution engine and a cache-conscious memory layout, Apache Spark performs 100x faster than Hadoop.  Herez the performance metric, based on word count program running on both platforms,

DataStax Apache Spark support means certified Spark software now ships with DSE 4.5, and it is supported by DataStax. DSE 4.5 (release on 3rd July) provides high-availability features for Spark that ensure resilience and fail-over.

Saturday, July 5, 2014


Dear Friends/Followers,

Social media is the social interaction among people in which they create, share or exchange information and ideas in virtual communities and networks.

Travel not for the destination, but for the joy of the journey.

In my last 3 years Social Media (CodeProject, Blogger) journey, reached the benchmark of 50K+ points & 45K hits with all your support and guidance.


Appreciate your time and energy to feed the consistency within me. Will continue to (l)earn the industry trend/reputation.

Sunday, June 29, 2014

Cloudera Gazzang

Cloudera, a leader in enterprise analytic data management powered by Apache Hadoop, early this month announced that it has acquired Gazzang, the big data security experts, to dramatically strengthen its security offerings, building on the road-map laid out last year when Cloudera first delivered Sentry.

Gazzang helps organizations protect sensitive information in Cloudera Enterprise and CDH by transparently encrypting data at rest and providing advanced key management that ensures only authorized processes can access the data. The high-performance software only/no appliance architecture can secure any application or database without requiring any changes to the client environment. The encryption keys remain safe and in full compliance with HIPAA, PCI-DSS, FERPA and other data security regulations.

Last month, Cloudera's competitor Hortonworks acquired XA Secure to offer it role-based authorization, auditing, and governance.

Cloudera is continuing to invest broadly in the open source community to support and accelerate security features into project Rhino—an open source effort founded by Intel in early 2013. Project Rhino is a broad based open source security architecture addressing many of the major pillars of enterprise security including: perimeter security, entitlements and access control and data protection.

Cloudera combines Apache Sentry and Intel’s Project Rhino with Gazzang’s encryption and key management to build the Industry’s end-to-end security Offering for Hadoop environments.

Sunday, June 22, 2014

Big Data Governance

Dataguise's Data Governance suite enables enterprises to declare policies, discover sensitive data, redact required terms and automate it all.

Dataguise supports a range of platforms that include Oracle, IBM DB2, SQL Server, Teradata, Cloudera, Hortonworks, MapR and Pivotal HD.  The suite of functions works with DgSecure, Dataguise's flagship platform for data privacy, protection and security for sensitive data across the enterprise.

Dataguise for Data Governance features include:
  • Policy Quickstart: Select pre-defined policies for PCI, PII and HIPAA; or click-to-create custom policies with no coding or scripting;
  • Sensitive Data Discovery: Automatically find and track sensitive data in the enterprise, whether at rest or in motion, in structured or unstructured format, across heterogeneous data platforms;
  • Entitlements: View and track entitlements down to the user and data element level;
  • Auditing: View automated reports and dashboards to track who accessed what sensitive data.

Dataguise, founded in 2007, has most of its customers in the financial services, health care, retail, government environments. Its products are designed to reduce the risk of data breaches and to remain compliant with regulations.

Saturday, June 14, 2014

Deloitte BigData Guidebook

Deloitte recently completed a new study designed to improve Big Data’s strategic role in the Corner Office. It provides a good lens toward better strategic and holistic use of the work being done by the data scientists.

The result of the study is available at  This new CXO “guidebook” is designed to help leadership get their arms around Big Data.

The guidebook is valuable in conveying the findings, methodology and the areas where analytics can foster better knowledge-sharing among corporate functions. The firm calls these the “Analytics Connections.”

Saturday, May 31, 2014

Cassandra Partitioner

Last weekend, I had an interesting learning during BigData Cassandra App production release. Major shift was to change from RandomPartitioner to Murmur3Partitioner.  Let me ink about it.

Basically, Cassandra partitioner determines how data is distributed across the nodes in the cluster including replicas. Basically, a partitioner is a hash function for computing the token/hash of a row key. Each row of data is uniquely identified by a row key and distributed across the cluster by the value of the token.

Both the Murmur3Partitioner and RandomPartitioner use tokens to help assign equal portions of data to each node and evenly distribute data from all the tables throughout the ring or other grouping, such as a keyspace. This is true even if the tables use different row keys, such as usernames or timestamps.

Two key differences on implementation are:
  1. Murmur3Partitioner uniformly distributes data across the cluster based on MurmurHash hash values; where as RandomPartitioneron MD5 hash values.
  2. On setting the partitioner in the cassandra.yaml file, Murmur3Partitioner includes org.apache.cassandra.dht.Murmur3Partitioner, where as RandomPartitioner refers org.apache.cassandra.dht.RandomPartitioner

Sunday, May 18, 2014

Ambient Intelligence

Microsoft announced new hosted service, called the Azure ISS (Intelligent Systems Service), promises to ease the process of managing machine data from sensors and devices connected in the so-called Internet of Things. ISS is now available as a limited public preview.

Microsoft has also released APS (Analytics Platform System), an update and expansion of what was formerly called the Parallel Data Warehouse. APS can combine query results from relational data in SQL Server databases and non-relational data captured by Hadoop.

In addition, the company has launched SQL Server 2014, the first edition of Microsoft's relational database system that includes the ability to store entire databases in the working memory of a server, which allows for faster access of the data.

All these products can help an organization make better use of its "ambient intelligence," Nadella said at a customer event in San Francisco.

Nadella defined ambient intelligence as the data that is generated by both a growing number of machines, such as sensors, as well as by people who capture experiences with their digital devices.

"You have this enormous capacity to reason over all of this digitized information," Nadella said.

In a report commissioned by Microsoft, IDC estimated that organizations could generate $1.6 trillion in additional revenue and cost savings over the next four years by better understanding their data.

The new Microsoft tools are aimed at bringing big-data-styled analysis to the enterprise.

Sunday, May 11, 2014

TIBCO Jaspersoft

TIBCO made a shrewd move when it acquired Spotfire. Now the company is hoping it can catch lightning in a bottle a second time with the $185 million acquisition of Jaspersoft on Apr-14 end.

Jaspersoft provides commercial subscription support for open source data-integration, business intelligence, and analytics software it develops and upgrades with help from a large community of more than 400,000 registered users. The company is best known for its low-cost open source ETL and reporting software, as well as for embedded BI software used by partners such as Nike,, FedEx, and McGraw Hill to deliver more than 140,000 analytics-infused applications.

When TIBCO acquired Spotfire for $195 million in 2007, it was in the midst of the great consolidation of the BI market in which IBM acquired Cognos, Oracle acquired Siebel and Hyperion, and SAP acquired BusinessObjects. These were the leading business intelligence products of their day (along with MicroStrategy and InformationBuilders). But since that time, Spotfire has emerged as the third name in a hot trio that also includes Tableau Software and QlikTech.

These data-discovery products have seen the fastest growth in the category ever since, while the likes of IBM Cognos, Oracle OBIEE, SAP BusinessObjects, and MicroStrategy have seen flat to slow sales growth. But reporting, ad-hoc query tools, and embeddable BI software remain necessary. Here's where TIBCO is betting that Jaspersoft will complement Spotfire and give it a complete portfolo.

Pairing open source Jaspersoft with top-selling Spotfire, TIBCO hopes to disrupt the business intelligence and embedded analytics market.

Thursday, May 8, 2014

Big Data Top 5

We’re on the cusp of a real turning point for big data. Its applications are becoming clearer, its tools are getting easier and its architectures are maturing in a hurry. It’s no longer just about log files, clickstreams and tweets. It’s not just about Hadoop and what’s possible (or not) with MapReduce.
With each passing day, big data is becoming more about creativity — if someone can think of an application, they can probably build it. That makes the concept of big data a lot more tangible and a lot more useful to a lot more companies, and it makes the market for big data a lot more lucrative.
Here are five technologies helping spur a shift in thinking from “Why would I want to use some technology that Yahoo built? And how?” to “We have problem that needs solving. Let’s find the right tool to solve it.”  They are Shark, Spark, MlLib, GraphX, SparkR.

Saturday, April 19, 2014

Mainframe Big Data

Mainframes are reliable and highly automated, often running for years with virtually no human intervention. Mainframe analyst Josh Krischer tells a story about an Eastern European airline that ran its core systems on an IBM mainframe for five years without ever touching the machine after its mainframe IT guy retired.

However, that data often stays locked in the mainframe because sorting and transforming it to perform complex data analytic has been expensive and robs the core applications of CPU cycles.  Still, around 85% of corporate systems are running under Mainframe.

The good news is that Big Data technologies are making it easier and less costly to export that data. One option is to use JCL batch workloads to move the data to Hadoop, where it can be processed, combined with other appropriate data and, for instance, moved to a NoSQL database to support forward-looking analysis to support business decisions. The challenge is the lack of well established native connectivity between mainframes and Hadoop. 

Saturday, April 12, 2014

BigData Tech Gain

Recent funding in Hadoop vendors underscores how venture capitalists see big bucks in managing Big Data. Last month, Hadoop providers Cloudera Inc., Hortonworks Inc. and Platfora Inc. received a collective $1 billion from investors convinced that they are onto something big.

Hadoop is a storage system that ingests large amounts of data from servers and breaks it into manageable chunks. Programmers structure the data, move it into a relational database, and study it with an analytical application. Companies supplement their relational databases with Hadoop because it organizes, or processes,  data faster and more cheaply, running on a series of commodity servers. Hadoop is also better at processing text, photos, images than relational databases, which store data in tables and rows. And Hadoop’s architecture allows developers to collect information and figure out what to do with it later; relational systems require developers to carefully design and store data with a scheme planned in advance.

Stay tuned to the emerging technology trend-Big Data.

Saturday, April 5, 2014

LucidWorks Solr

Solr is based on Apache's own Lucene project and adds many options not found in the original that ought to appeal to those building next-generation data-driven apps -- for example, support for geospatial search.

The end-user advantages of Solr, lie in how it makes a broader variety of Hadoop searches possible for both less technical and more technical users. Queries can be constructed in natural language ways or through more precise key/value pairs.  Another implication of Solr being able to return search results across a Hadoop cluster is that more data can be kept in Hadoop and not pretransformed for the sake of analytics. This means not having to anticipate the questions [to ask] before you load the data.

Solr will be rolled into HDP via a multistep process. The first phase involves making Solr available for customers on June 1 within a sandbox. After that, Solr will be integrated directly into the next release of HDP, although no release schedule for that has been announced yet. Later on, Hortonworks plans to do some work on hooking up Solr to Ambari, the management and monitoring component for Hadoop, for easier control of indexing speeds and alerting, among other aspects.

LucidWorks has also produced a version of Solr that's meant to join the ever-growing parade of open source or lower-priced products designed to steal some of Splunk's log-search thunder. 

Saturday, March 22, 2014

Apache Shark

Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.

Shark uses the powerful Apache Spark engine to speed up computations. It run Hive queries up to 100x faster in memory, or 10x on disk.

Shark reuses the Hive frontend and metastore, giving you full compatibility with existing Hive data, queries, and UDFs. You simply installs it alongside Hive. By running on Spark, Shark can call complex analytics functions like machine learning right from SQL

Unlike other interactive SQL engines, Shark supports mid-query fault tolerance, letting it scale to large jobs. In terms of scalability, Shark uses the same engine for both short and long queries.


Saturday, March 8, 2014

Apache Spark

Apache Spark, an in-memory data-processing framework, an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.

It is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms. Spark can run on Hadoop 2's YARN cluster manager, and can read any existing Hadoop data.

Apache Spark™ is a fast and general engine for large-scale data processing.The Apache Software Foundation announced recently that Spark has graduated from the Apache Incubator to become a top-level Apache project, signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles. This is a major step for the community and we are very proud to share this news with users as we complete Spark’s move to Apache.

Saturday, March 1, 2014

Qlik View

Recent CITO Research report explains exactly what big data is, why it matters to you, and how to put it to work for your business. You’ll also see how big data is being used to win elections, reduce crime, and literally change the world. Make sure your business isn’t left behind.

With QlikView, we can run reports and create dashboards quickly to detect market changes and product sales in real time. This allows our salespeople to immediately respond to new opportunities and improve business performance.

Big Data’s value can be unleashed for business users by condensing it and intelligently presenting only what is relevant and contextual to the problem at hand. Whether it's an executive wanting summary data across the company’s product lines or a manager wanting more detail, but only for the areas that he or she oversees. IT professionals are challenged with not only providing the infrastructure but also to help provide meaning to the Big Data.

Friday, February 14, 2014

Intel Hadoop

Intel is competing with the likes of Hortonworks and Cloudera and others in the commercial Hadoop market. The rise of such vendors underscores the fact that Hadoop is really at a crossroads. Linux only took off once companies began investing in hardening its features and also coalesced around keeping it open.

Customers who choose Intel's Hadoop distribution over others can benefit from what Intel describes as significant performance improvements thanks to optimizations for Intel's Xeon processors, solid state storage and networking.

Intel also offers speedier data encryption and decryption within Hadoop with its AES-NI tecnology, according to the vendor. Along with security and reliability upgrades, the Data Platform features capabilities for streaming data processing, iterative analytics and graph processing, according to Intel.


Sunday, February 9, 2014


Qubole is a Big Data as a Service (BDaas) Platform Running on Leading Cloud Offerings like AWS.

Qubole Data Service (QDS) has more than a dozen data connectors for importing and exporting data to and from the platform. The core is Qubole’s managed Hadoop platform, including Hive, Sqoop, Pig and Oozie and an SK for building applications in Python.

It recently added Presto, an open-source project supporting SQL-style analytics on Big Data, created by Facebook, which says it is orders of magnitude faster than Hive, returning query results in milliseconds. Hadoop clusters include auto-scaling. MapReduce job and SQL-style queries can be created on an interactive GUI.

The benefits of a service like QDS, are the elimination of capital expenditures on hardware and challenges of hiring and retaining scarce Big Data practitioners.

Saturday, February 1, 2014

VMware vCenter Log Insight

As virtualization administrators were forced to branch out from their original comfort zones of server administration to storage, network, and even desktop administration, the next technological leap for these individuals will therefore come in the form of big data and analytics. As in other areas in the server virtualization world, companies like VMware are making an app for that in order to make that transition easier.

VMware vCenter Log Insight debuted back in June 2013, as a result of VMware's August 2012 acquisition of Pattern Insight.Common use cases for the product include security and compliance auditing, as well as monitoring and troubleshooting vSphere and other servers, storage, and networking devices.

Earlier in the month of Jan 2014, VMware released vCenter Log Insight 1.5. The bulk of the work done on this release was to make it a more enterprise-ready product. As an example, VMware added authentication support for Microsoft's Active Directory for easier integration into an enterprise environment. This eliminates the need for multiple logins/passwords and allows for seamless integration into an organization's pre-existing identity management architecture.

Saturday, January 25, 2014


Today, a small shift in my weekly tech blog.  Yeps, itz about my College Golden jubilee and Department Silver jubilee.

Last week, we're honored to be part of Department alumni Meet. One of the key day in my life, to spend with my old UG friends, teachers where I was groomed 25 years back. We ran few interactive and interesting sessions with lot of fun and knowledge sharing.

PayBack - keyword in our life.  As God transitioned me from match box child labor to wall street, time to pay back to my ladders-motivating friends, molding teachers, sacrificing institute. Without their commitments and dedications, I'm nothing in my life after the parents.

During this event, a short film was show cased to motivate the students.  Enjoy it at

Saturday, January 4, 2014


Dear readers, Happy New Year 2014.  Let us have a look back at 2013, on the year Big Data work began widely in the industry.

The biggest accomplishment of the Apache Hadoop Community, as a whole, is the delivery and acceptance of Hadoop 2.0. It takes Hadoop beyond a single-use data platform for batch processing to a multi-use platform that enables batch, interactive, online and stream processing.

Hortonworks was first out of the gate with its HDP 2 Platform that leverages Hadoop 2, and there’s no doubt that the company’s growing list of partners.

Cloudera announced Cloudera Enterprise 5, which is fundamental to its newly announced “Data Hub” strategy. The market had strong reactions to that.

Microsoft made a huge Hadoop-related proclamation and announced that the company now had its own Hadoop distribution, but also his plan to deliver Big Data to 1 billion users.

And MapR kept marching on its mission to make its Hadoop distro faster, safer and more secure for its customers.