Monday, June 27, 2016


Apache Zeppelin is an open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results.

Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. Zeppelin uses the Spark settings on your cluster and can use Spark’s dynamic allocation of executors to let YARN estimate the optimal resource consumption.

To run the prediction analysis, you need to create notebooks that generate prediction % and are scheduled to run daily. As part of the prediction analysis, we needed to connect to multiple data sources, like MySQL and Vertica for data ingestion and error rate generation. This enabled us to aggregate data across multiple dimensions, thus exposing underlying issues and anomalies at a glance.

Using Zeppelin, we applied many A/B models by replaying our raw data in AWS S3 to generate different prediction reports, which in turn helped us move in the right direction and provide better forecasting.

Zeppelin helps us to turn the huge amounts of raw data, often from across different data stores, into consumable information with useful insights.

Slide share reference is available at

No comments:

Post a Comment