Ganesan Senthilvel: BigData

Showing posts with label BigData. Show all posts

Monday, May 17, 2021

Data Prediction

Indian Institute of Technology (IIT) Kanpur built the Covid prediction data model for India's second wave.

Our scientists are working on the SUTRA model for charting the trajectory of COVID-19. They have been working on a mathematical model to predict the spread of the virus.

It is important to note that a mathematical model can only predict future with some certainty so long as virus dynamics and its transmissibility don’t change substantially over time. Mathematical models can also provide a mechanism to predicting alternate scenarios corresponding to various policy decisions such as non-pharmaceutical interventions.

It is quite accurate and in sync with the current date. Ref: https://www.sutra-india.in/

Sunday, January 24, 2021

AWS Personalize

Amazon Personalize enables developers to build applications with the same machine learning (ML) technology used by Amazon.com for real-time personalized recommendations – no ML expertise required.

It is a fully managed machine learning service that goes beyond rigid static rule based recommendation systems and trains, tunes, and deploys custom ML models to deliver highly customized recommendations to customers across industries such as retail and media and entertainment.

Amazon Personalize provisions the necessary infrastructure and manages the entire ML pipeline, including processing the data, identifying features, using the best algorithms, and training, optimizing, and hosting the models. You will receive results via an Application Programming Interface (API) and only pay for what you use, with no minimum fees or upfront commitments.

All data is encrypted to be private and secure, and is only used to create recommendations for your users.

Sunday, April 26, 2020

AI Prediction

I'm always passionate and proud to be associated with Computers and its evolution.

In theory of computing power, few emerging technologies (like Artificial Intelligence, Big Data Prediction, Data Visualization, etc.) are amazing. It was not only academic but also beneficial to human life. Here is a recent and relevant use case for the entire world.

This week, Singapore University built data-driven AI predictions and visualization portal for Covid-19 closure.

Browse it at https://ddi.sutd.edu.sg/when-will-covid-19-end/

Saturday, December 8, 2018

DataStax Enterprise 6.7

This week, DSE 6.7 has been launched with multi-workload support for operational analytics, geospatial search, increased data protection in the cloud, better performance insights, Docker production support, and connectivity to Apache Kafka.

Top-5 improvements of DataStax Enterprise 6.7 includes:

Production-ready Kafka and Docker integration
Easier, more scalable operational analytics for today’s cloud applications
Simplified enterprise search for geospatial applications
Improved data protection with smart cloud backup/restore support
Improved performance diagnostics with new insights engine and third-party integration

DSE 6.7 and updated versions of OpsCenter, Studio, and DSE Drivers are available for download, as is updated documentation to guide installation and upgrading.

Data sheet of DSE 6.7, is available at https://www.datastax.com/wp-content/uploads/resources/datasheets/DataStax-DS-DataStax-Enterprise-6-7.pdf

Saturday, November 10, 2018

Talend Stitch

Five years back, I came to know about Talend as open source ETL product in conjunction with Pentaho. It was part of my earlier assignment to evluate the open source ETL product against Informatica. Talend expands its business base in greater focus.

Yday, there was an industry news that Talend is buying Stitch, a 2-year old Philadelphia-based spinoff of RJ Metrics for $60 million in cash.

Stitch offers a cloud-based self-service offering that automates data ingestion pipelines into the cloud. It's an emerging space where the closest competitors are Alooma and Fivetran, but also where Confluent and StreamSets play. In its two years, Stitch has already built a customer base exceeding 1000 customers

Talend has not yet announced a closing date for the deal.

Sunday, February 25, 2018

Redshift Spectrum

Redshift Spectrum helps to run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. It achieves without loading data or resizing the Amazon Redshift cluster based on growing data volumes.

Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. It scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. It is possible to query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.

Top 3 performance features are:

Short Query Acceleration - speed up execution of queries such as reports, dashboards, and interactive analysis
Results Caching - deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools
Late Materialization - reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column

AWS Summit video at https://www.youtube.com/watch?v=gchd2sDhSuY

Monday, July 3, 2017

Amazon Athena

As illustrated in the architecture diagram above, any changes made to the items in DynamoDB will be captured and processed using DynamoDB Streams.

Next, a Lambda function will be invoked by a trigger that is configured to respond to events in DynamoDB Streams.

The Lambda function processes the data prior to pushing to Amazon Kinesis Firehose, which will output to Amazon S3.

Finally, you use Amazon Athena to analyze the streaming data landing in Amazon S3. The result can be explored and visualized in Amazon QuickSight for your company’s business analytics.

Friday, April 14, 2017

Hadoop 2 vs 3

After Doug Cutting released Hadoop - working with Yahoo, Big Data industry had the breaking point to adapt easily and fast. Even after a decade, Hadoop is highly adapted in the industry with so much major releases.

Here is an interesting learning material to compare Hadoop 2.x and 3.x by DataFlair. Ref: http://data-flair.training/blogs/comparison-difference-between-hadoop-2-x-vs-hadoop-3-x/

Tuesday, March 7, 2017

Making Time

One of the famous quote is ringing back in my mind “It is not about having time; it is about making time”

College of Engineering Guindy (CEG) is a top most engineering institute in my state Tamilnadu, INDIA. Itz not only brightest but also one of the oldest engineering college, founded in 1794.

CEG used to conduct the national level technical symposium during the first quarter of every year. This year, it was scheduled between Mar 1 and Mar 3 at CEG campus of Anna University, Chennai.

This well renowned national level event is a platform for the student/research communities to gain insight of the contemporary technical contraptions. My strong belief is that Industry and Institutes must have stronger collaboration on day to day business. Apparently, it provides an opportunity to triumph in this fast pacing disruptive digital world.

As every engineering department was hosting their relevant areas, CS (Computer Science) engineering executed their sessions for IT geeks to take a plunge on IT innovation. It creates an opportunity for the demanding industry needs. CS department had TWO days workshop on IoT (Internet of Things), which is an emerging technology in the current industry.

Back to me. I always crazy about prime institutes, since my childhood days. CEG is one among them. With high degree of aspiration / passion, became CEG alumni through my degree between 2008 and 2010. Today, Time promoted me to be the chief guest of this great institute with high degree of blessings. Thanks Almighty!

Being hot core IT product engineer, blessed with multiple opportunities / challenges to work around the emerging disruptive technologies like IoT and BigData, for last 5 years; prior to my .NET space. With great amount of learning from my beloved colleagues/mentors, drafted a practical approach presentation entitled “Big Data weds IoT”. CEG session was well received by the enthusiastic researchers, which got reflected in multiple smart shoots of questions. Happy learning to me!

With this nostalgic feeling, time to appreciate all time, environment, mentors for my transition. Thanks with closing note: “Time is always with the people who have courage to fly.“

Thursday, October 6, 2016

Digital India

As an ordinary citizen, I always wonder/worry about the clean function of my government. I've tons of questions about the benefits of Digitization to the common citizens.

Digital India
Digital India is an initiative of Government of India to integrate the government departments and the people of India. The primary aim is to ensure the government services are made available to citizens electronically by reducing the paperwork. Also, it aims to connect the rural areas with high speed internet networks.

The project is slated for the completion by 2019. This initiative is to offer the public healthcare, education, judicial services by all ministries, with 9 pillars as depicted above.

Use Case
Herez one of the best use case - Electric Power Utilization through Digital India.
Union power ministry intends to provide energy efficient lighting, covering the entire country by 2019 and distribute 77 crore LED bulbs under the Unnat Jyoti by Affordable LEDs for All (UJALA) scheme. It aims an annual reduction in electricity bills of Indian consumers by over Rs 42,000 crore.

Technology Benefits
As the blessed tech geek, astonished with the below Top-5 factors on browsing Indian govt power website

Transparency
Technology driven
Data driven
Visibility

Real time update
Enjoy at http://www.ujala.gov.in/

Closing Note
To be honest, lot of IT firms are struggling to achieve/leverage the emerging technology. Awesome achievement by Indian Government !!

Monday, June 27, 2016

Zeppelin

Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results.

Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. Zeppelin uses the Spark settings on your cluster and can use Spark’s dynamic allocation of executors to let YARN estimate the optimal resource consumption.

To run the prediction analysis, you need to create notebooks that generate prediction % and are scheduled to run daily. As part of the prediction analysis, we needed to connect to multiple data sources, like MySQL and Vertica for data ingestion and error rate generation. This enabled us to aggregate data across multiple dimensions, thus exposing underlying issues and anomalies at a glance.

Using Zeppelin, we applied many A/B models by replaying our raw data in AWS S3 to generate different prediction reports, which in turn helped us move in the right direction and provide better forecasting.

Zeppelin helps us to turn the huge amounts of raw data, often from across different data stores, into consumable information with useful insights.

Slide share reference is available at http://www.slideshare.net/prajods/big-data-visualization-with-apache-spark-and-zeppelin

Monday, March 28, 2016

BI Capabilities

As the result of week end's reading, herez an interesting Gartner's update on "Critical Capabilities for BI (Business Intelligence) Analytic Platform" .

Antiquity
The very first use of what we now mostly call business intelligence was in 1951, at Lyons Electronic Office, powered by over 6,000 vacuum tubes. Itz about “meeting business needs through actionable information”. Linear equation of BI growth, is depicted as the attachment.

The BI and analytic platform market has undergone a fundamental shift. During the past 10 years, BI platform investments have mostly been in IT-led consolidation and standardization projects for large-scale system-of-record reporting.

Current Trend
As demand from business users for pervasive access to data discovery capabilities grows, IT wants to deliver on this requirement without sacrificing governance — in a managed or governed data discovery mode.

Business analytic of tomorrow is focused on the future (Predictive) and tries to answer (Prescriptive) the questions: What will happen? How can we make it happen?

Predictive analytic encompasses a variety of techniques from statistics, data mining, and game theory that analyze current and historical facts to make predictions about future events.

Findings

BI has passed a tipping point as it shifts away from IT-centric, reporting-based platforms
Early entrants to the data discovery market may have strong capabilities in interactive visual data discovery
Higher differentiation score by using the emerging capabilities — such as search, embedded analytics, collaboration, self-service data preparation and big data.

Road Map
Few predictions to define BI Analytic Road Map:

By 2018, data discovery and data management evolution will drive most organizations to augment centralized analytic architectures with decentralized approaches.
By 2018, smart, governed, Hadoop-based, search-based and visual-based data discovery will converge into a single set of next-generation components.
By 2020, 80% of all enterprise reporting will be based on modern business intelligence and analytics platforms; remaining 20% will still be on IT-centric, reporting-based platforms because the risk to change outweighs value.

Closure Note
As per Gartner report, BI market has shifted to more user-driven, agile development of visual, interactive dashboards with data from a broader range of sources. With my rich experience on building the financial enterprise data hub, I can sense the breath & depth of "broader range of sources"

Thursday, March 24, 2016

Altiscale Insight Cloud

Hadoop-as-a-Service (HaaS) vendor Altiscale is moving up the stack with a new service called Altiscale Insight Cloud, which sits on top of existing service Altiscale Data Cloud. How it works?

Ingest services consist of a user interface over jobs that run on Apache Oozie, and allow the definition of validation rules on the ingested data. Analysis functionality is provided by an OEM'd implementation of Alation, a product which acts as a data catalog. Underneath Alation, Altiscale has configured the Hadoop cluster such that Hive and Spark SQL point to exactly the same data files, and either technology be used to satisfy queries.

Insight Cloud nicely finishes off the raw infrastructure of Altiscale Data Cloud with some basic functionality to make the combination of Hadoop and Spark more usable, but without reinventing the wheels that BI and Big Data analytics players have in-market already.

Altiscale says Insight Cloud is a Hadoop/Spark offering that is very BI tool-ready, so that users of Tableau, Excel or other common self-service tools can more readily attach to and analyze Big Data.

Pricing is consumption driven, and at $9,000/month for 20TB of storage and 10,000 "task hours". Having Insight Cloud in-market makes Altiscale more competitive with fellow HaaS provider Qubole.

Saturday, February 6, 2016

NATS

With 2 decades of my industry experience, I'm recollecting the initial days of nostalgia 'C' procedural coding @ Bell Labs, up to Today's highly distributed 'Spark' coding. To me, high performance at scale is critically important for anybody to build highly distributed systems, Today.

Why Communication is vital?
Three buzz words in Today's Architecture world

API (Application Programming Interface) or Micro Service model
Highly distributed Cloud Infra
IoT (Internet of Things) network of all devices

Today, the components of application have been built in Service Oriented Architecture (SoA), new term called API style. As per Cloud theory, pieces of any given service might be spread across physical or virtual infrastructure. IoT thrives to comprise thousands or even millions of devices.

But, to the end user they need to operate seamlessly, as if they are one entity. This requires extremely fast (Performance), lightweight (Portable), always-on (Availability) communication.

What is NATS?
Open Source NATS is an extremely lightweight, and massively scalable Publish/Subscribe (PubSub) messaging system for Cloud Native applications, IoT device messaging, etc. Home Page: http://nats.io/

History of NATS
NATS was originally created by Derek Collison as the messaging layer inside of Cloud Foundry, when he was designing that product. The original version of NATS was written in Ruby, but was ported over to Go. The Go implementation of the NATS server is called gnatsd, and immediately offered performance well in excess of Ruby-nats, Brief video at: https://ttps://blogs.msdn.microsoft.com/dotnet/2016/02/01/on-net-1262016-nats-with-brian-flannery-and-colin-sullivan/

NATS Objective
NATS is clustered mode, auto pruning of interest graph & text based protocol. Itz not intended as a traditional enterprise messaging system - you can think of it more as an ephemeral nervous system, that is always on, and always available.

By sticking to the core tenets of simplicity and speed, NATS - much like Go - provides an excellent foundation for delivering modern distributed systems at scale.

Core Principles
NATS supports 3 key messaging models as listed below:

1. PubSub
NATS implements a publish subscribe (PubSub) messaging model, with fire-and-forget messaging system. It means that if a subscriber is not listening on the subject (no subject match), or is not active when the message is sent, the message is not received

2. RequestReply
NATS supports two flavors of request reply messaging: point-to-point or one-to-many. In a request-response exchange, publish request operation publishes a message with a reply subject expecting a response on that reply subject. You can request to automatically wait for a response inline.

3. Queuing
Queue subscribers can be asynchronous, in which case the message handler callback function processes the delivered message. Synchronous queue subscribers must build in logic to processes the message.

Benchmark
Performance Comparison (between NATS, Kafka, Active MQ, Redis, NSQ, RabbitMQ, etc) chart is depreciated with throughput payloads as shown in the attached chart.

As the most performant cloud native messaging platform, NATS can send up to 6 million messages per second. Are you ready to leverage throughput benefit from NATS?

Monday, February 1, 2016

Hadoop 10th Year

Doug Cutting - Father of Hadoop, well known person in Big Data Industry.

2016 marks the 10th Anniversary of Hadoop.

In his note, Dough mentioned that Ten years ago, digital business was limited to a few sectors, like e-commerce and media. Since then, we had seen digital technology become essential to nearly every industry.

Every industry is becoming data driven, built around its information systems. Big data tools like Hadoop enable industries to best benefit from all the data they generate. Hadoop did not cause digital transformation, but it is a critical component of this larger story.

Enjoy the exclusive video by Doug at https://youtu.be/XHz_R33QnsI

Happy 2 digit birthday to Hadoop !!!

Saturday, December 19, 2015

Apache Flink

Apache Flink is an open source platform for distributed stream and batch data processing.
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Flink includes several APIs for creating applications that use the Flink engine:

DataSet API for static data embedded in Java, Scala, and Python,
DataStream API for unbounded streams embedded in Java and Scala, and
Table API with a SQL-like expression language embedded in Java and Scala.

Flink also bundles libraries for domain-specific use cases:

Machine Learning library, and
Gelly, a graph processing API and library.

You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.

Flink's data streaming runtime achieves high throughput rates and low latency with little configuration. The charts below show the performance of a distributed item counting task, requiring streaming data shuffles.

Flink programs can be written in Java or Scala and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment. Flink does not provide its own data storage system, input data must be stored in a distributed storage system like HDFS or HBase. For data stream processing, Flink consumes data from (reliable) message queues like Kafka.

Wednesday, November 25, 2015

Apache Phoenix

Apache Phoenix is an efficient SQL skin for Apache HBase that has created a lot of buzz. Many companies are successfully using this technology, including Salesforce.com, where Phoenix first started.

Internally, Phoenix takes your SQL query, compiles it into a series of native HBase API calls, and pushes as much work as possible onto the cluster for parallel execution. It automatically creates a metadata repository that provides typed access to data stored in HBase tables. Phoenix’s direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Regardless of these helpful features, Phoenix is not a drop-in RDBMS replacement. There are some limitations:

Phoenix doesn’t support cross-row transactions yet.
Itz query optimizer and join mechanisms are less sophisticated than most COTS DBMSs.
As secondary indexes are implemented using a separate index table, they can get out of sync with the primary table (although perhaps only for very short periods.) These indexes are therefore not fully-ACID compliant.
Multi-tenancy is constrained—internally, Phoenix uses a single HBase table.

Unlike Impala, however, Phoenix is intended to operate exclusively on HBase data; its design and implementation are heavily customized to leverage HBase features including coprocessors and skip scans.

Top-3 differences between Impala and Phoenix

The main goal of Phoenix is to provide a high-performance relational database layer over HBase for low-latency applications. Impala’s primary focus is to enable interactive exploration of large data sets by providing high-performance, low-latency SQL queries on data stored in popular Hadoop file formats. Hive is mainly concerned with providing data warehouse infrastructure, especially for long-running batch-oriented tasks.
Phoenix is a good choice, for example, in CRUD applications where you need the scalability of HBase along with the facility of SQL access. In contrast, Impala is a better option for strictly analytic workloads and Hive is well suited for batch-oriented tasks like ETL.
Phoenix is comparatively lightweight since it doesn’t need an additional server.

Next Cloudera release CDH 5.5 will be packaged with Apache Phoenix to leverage. I'm waiting !!!

Tuesday, October 6, 2015

Airbnb Airpal

We know that Airbnb is a popular website for people to list, find, and rent lodging. It has over 1,500,000 listings in 34,000 cities and 190 countries.

Recently, Airbnb launched Big Data tool Airpal, a web-based query execution tool that leverages Facebook’s PrestoDB to facilitate data analysis by authoring queries and retrieving results simple for users.

Key features of Airpal:

optional access controls for users
ability to search and find tables
see metadata, partitions, schemas, and sample rows
write queries in an easy-to-read editor
submit queries through a web interface
track query progress
get the results back through the browser as a CSV
create new Hive table based on the results of a query
save queries once written
searchable history of all queries run within the tool

Requirements are:

Java 7 or higher
MySQL database
Presto 0.77 or higher
S3 bucket (to store CSVs)
Gradle 2.2 or higher

On keeping with the spirit of Presto, they have tried to make it simple to install Airpal by providing a local storage option for people who would like to test it out without any overhead or cost.

For more detailed information, visit the GitHub page here: https://github.com/airbnb/airpal

Wednesday, September 30, 2015

Coursera Architecture

Coursera is a venture backed for-profit educational technology company that offers massive open online courses (MOOCs).

It works with top universities and organizations to make some of their courses available online, and offers courses in physics, engineering, humanities, medicine, biology, social sciences, mathematics, business, computer science, digital marketing, data science and other subjects. Itz an online educational startup with over 14 million learners across the globe to offer more than 1000 courses from over 120 top universities.

At Coursera, Amazon Redshift is used as primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. AWS Data Pipeline is used to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for ETL processing.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management.

At Coursera, 150+ pipelines were executed to pull the data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. 300+ tables are loaded every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The attached image below illustrates the data flow at Coursera.

Monday, September 28, 2015

ScyllaDB

ScyllaDB is the world's fastest NoSQL column store database, which is written in C++. Itz fully compatible with Apache Cassandra at 10x throughput and jaw dropping low latency.

Scylla will work with existing Cassandra command line CQL clients. However, mixed clusters of Scylla and Cassandra nodes are not supported. A Scylla node cannot join a Cassandra cluster, and a Cassandra node cannot join a Scylla cluster.

To share the benchmark between Scylla and Cassandra, both throughput on a single multi core server is evaluated with the Hardward specification:
2x Xeon E5-2695v3: 2.3GHz base, 35M cache,
14 core -> 28 core with HT
64GB RAM
2x 400GB Intel NVMe P3700 SSD
Intel Ethernet CNA XL710-QDA1

In terms of software Scylla 0.8 & Cassandra 3.0, is enabled as TestBed.

In the attached image, average throughput for the test is presented as lines, latency as bars.

Scylla’s measured latency of less than 1 ms for the 99th percentile is significantly lower than Cassandra’s, while providing significantly higher throughput (the single client machine could not fully load the server). The lack of garbage collection means that there are no externally imposed latency sources, so Scylla latency can be brought even lower.

Scylla’s order of magnitude improvement in performance opens a wide range of possibilities. Instead of designing a complex data model to achieve adequate performance, use a straightforward data model, eliminate complexity, and finish your NoSQL project in less time with fewer bugs.