Wednesday, September 30, 2015

Coursera Architecture

Coursera is a venture backed for-profit educational technology company that offers massive open online courses (MOOCs).

It works with top universities and organizations to make some of their courses available online, and offers courses in physics, engineering, humanities, medicine, biology, social sciences, mathematics, business, computer science, digital marketing, data science and other subjects.  Itz an online educational startup with over 14 million learners across the globe to offer more than 1000 courses from over 120 top universities.

At Coursera, Amazon Redshift is used as primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. AWS Data Pipeline is used to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for ETL processing.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management.

At Coursera, 150+ pipelines were executed to pull the data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. 300+ tables are loaded every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The attached image below illustrates the data flow at Coursera.

Monday, September 28, 2015


ScyllaDB is the world's fastest NoSQL column store database, which is written in C++. Itz fully compatible with Apache Cassandra at 10x throughput and jaw dropping low latency.

Scylla will work with existing Cassandra command line CQL clients. However, mixed clusters of Scylla and Cassandra nodes are not supported. A Scylla node cannot join a Cassandra cluster, and a Cassandra node cannot join a Scylla cluster.

To share the benchmark between Scylla and Cassandra, both throughput on a single multi core server is evaluated with the Hardward specification:
2x Xeon E5-2695v3: 2.3GHz base, 35M cache,
14 core -> 28 core with HT
2x 400GB Intel NVMe P3700 SSD
Intel Ethernet CNA XL710-QDA1

In terms of software Scylla 0.8 & Cassandra 3.0, is enabled as TestBed.

In the attached image, average throughput for the test is presented as lines, latency as bars.

Scylla’s measured latency of less than 1 ms for the 99th percentile is significantly lower than Cassandra’s, while providing significantly higher throughput (the single client machine could not fully load the server).  The lack of garbage collection means that there are no externally imposed latency sources, so Scylla latency can be brought even lower.

Scylla’s order of magnitude improvement in performance opens a wide range of possibilities.  Instead of designing a complex data model to achieve adequate performance, use a straightforward data model, eliminate complexity, and finish your NoSQL project in less time with fewer bugs.

Thursday, September 10, 2015

MapReduce Authors

We know that MapReduce is the ice breaker for the traditional computing model by introducing Scale Out technology in the easy way. Google has a separate research web page on Google's MapReduce at . Authors are Sanjay Ghemawat & Jeff Dean from Google Inc.

As the research scholar, I liked the motivation of their research paper - Large Scale Data Processing. It was achievable with super computing on earlier days. But key difference is parallel execution of hundreds or thousands of CPUs, with commodity box and easy mode.  More over, MapReduce provides:
  1. Automatic parallelization and distribution
  2. Fault-tolerance
  3. I/O scheduling
  4. Status and monitoring

Fault-tolerance is handled via re-execution.  On worker failure:
  • Detect failure via periodic heartbeats
  • Re-execute completed and in-progress map tasks
  • Re-execute in progress reduce tasks
  • Task completion committed through master

Data Locality Optimization, Skipping Bad Records and Compression of intermediate data are their few refinement technique to boost the performance on large scale data.

In their research paper, the use case was listed in August 2004 with the below metric:
  • Number of jobs 29,423
  • Average job completion time 634 secs
  • Machine days used 79,186 days
  • Input data read 3,288 TB
  • Intermediate data produced 758 TB
  • Output data written 193 TB
  • Average worker machines per job 157
  • Average worker deaths per job 1.2
  • Average map tasks per job 3,351
  • Average reduce tasks per job 55
  • Unique map implementations 395
  • Unique reduce implementations 269
  • Unique map/reduce combinations 

Amazing and game changing methodology with easiness, as the result of great minds research from Google.  Herez an opportunity for me to highlight the authors of MapReduce - Sanjay Ghemawat & Jeff Dean