Wednesday, September 30, 2015

Coursera Architecture


Coursera is a venture backed for-profit educational technology company that offers massive open online courses (MOOCs).

It works with top universities and organizations to make some of their courses available online, and offers courses in physics, engineering, humanities, medicine, biology, social sciences, mathematics, business, computer science, digital marketing, data science and other subjects.  Itz an online educational startup with over 14 million learners across the globe to offer more than 1000 courses from over 120 top universities.

At Coursera, Amazon Redshift is used as primary data warehouse because it provides a standard SQL interface and has fast and reliable performance. AWS Data Pipeline is used to extract, transform, and load (ETL) data into the warehouse. Data Pipeline provides fault tolerance, scheduling, resource management and an easy-to-extend API for ETL processing.

Dataduct is a Python-based framework built on top of Data Pipeline that lets users create custom reusable components and patterns to be shared across multiple pipelines. This boosts developer productivity and simplifies ETL management.

At Coursera, 150+ pipelines were executed to pull the data from 15 data sources such as Amazon RDS, Cassandra, log streams, and third-party APIs. 300+ tables are loaded every day into Amazon Redshift, processing several terabytes of data. Subsequent pipelines push data back into Cassandra to power our recommendations, search, and other data products.

The attached image below illustrates the data flow at Coursera.

1 comment: