Ganesan Senthilvel: August 2014

Saturday, August 30, 2014

Amazon Kinesis

Last week, I wrote about Google's Clould Data Flow design. It seems Amazon's Kinesis is the competitor for Google's Cloud Dataflow. Kinesis as a to a managed service designed for real-time data streaming developed by industry leaders Amazon Web Services.

Kinesis allows you to write applications for processing data in real-time, and works in conjunction with other AWS products such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, or Amazon Redshift.

Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. It can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.

Three key observed Use Cases are:
1. Application and Operational Logs
2. Real-Time Clickstream Analytics
3. Machine learning-based recommendation and ranking

Attached diagram represents second use case.

Sunday, August 24, 2014

Google Cloud Dataflow

Google Cloud Dataflow is designed so the user can focus on devising proper analysis, without worrying about setting up and maintaining the underlying data piping and processing infrastructure.

It could be used for live sentiment analysis, for instance, where an organization estimates the popular sentiment around a product by scanning social networks such as Twitter. It could also be used as a security tool to watch activity logs for unusual activity. It could also be used an alternative to commercial ETL (extract, transform and load) programs, widely used to prepare data for analysis by business intelligence software.

MapReduce's limitation is that it can only analyze data in batch mode, which means all the data must be collected before it can be analyzed. A number of new software programs have been developed to get around the limitation of batch processing, such as Twitter Storm and Apache Spark, which are both available as open source and can run on Hadoop.

Google's own approach to live data analysis uses a number of technologies built by the company, notably Flume and MillWheel. Flume aggregates large amounts of data and MillWheel provides a platform for low-latency data processing.

The service provides a software development kit that can be used to build complex pipelines and analysis. Like MapReduce, Cloud Dataflow will initially use the Java programming language. In the future, other languages may be supported.

The pipelines can ingest data from external sources and use them for a variety of things. The service provides a library to prepare and reformat data for further analysis, and users can write their own transformations.

The treated dataset can be queried against using Google's BigQuery service. Or the user can write modules to examine the data as it crosses the wire, to look for aberrant behavior or trends in real-time.

Wednesday, August 13, 2014

Hadoop in Teradata

Teradata has bought the assets of Revelytix and Hadapt in a bid to grow out its capabilities for the Hadoop big-data processing framework

Revelytix developed Loom, a metadata management system compatible with a number of Hadoop distributions, including those from Cloudera, Hortonworks, MapR, Pivotal, IBM and Apache, according to its website. Loom is geared at helping data scientists prepare information in Hadoop for analysis more quickly.

Hadapt is known for its software that integrates the SQL programming language with Hadoop. SQL is a common skill set among database administrators, who may not be familiar with Hadoop.

Last month, Teradata adds data-prep, data-management, and data-analysis capabilities by buying two notable independents in the big data arena.

Saturday, August 9, 2014

Google Mesa

Google has found a way to stretch a data warehouse across multiple data centers, using an architecture its engineers developed that could pave the way for much larger, more reliable and more responsive cloud-based analysis systems.Google's latest big-data tool is named as Mesa, which aims for speed.

For Google, Mesa solved a number of operational issues that traditional enterprise data warehouses and other data analysis systems could not. Google also needed a strong consistency for its queries, meaning a query should produce the same result from the same source each time, no matter which data center fields the query.

Mesa relies on a number of other technologies developed by the company, including the Colossus distributed file system, the BigTable distributed data storage system and the MapReduce data analysis framework. To help with consistency, Google engineers deployed a homegrown technology called Paxos, a distributed synchronization protocol.

In addition to scalability and consistency, Mesa offers another advantage in that it can run be run on generic servers, which eliminates the need for specialized, expensive hardware. As a result, Mesa can be run as a cloud service and easily scaled up or down to meet the job requirements.

Mesa is the latest in a series of novel data-processing applications and architectures that Google has developed to serve its business.