Ganesan Senthilvel: Redshift ETL

In ETL world, Amazon Redshift is revolutionary to make the developer life simple. It is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

The proposed ETL process has 4 key steps to execute:

Step 1: Extract from the RDBMS source to a S3 bucketIn this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files.

Step 2: Stage data to the Amazon Redshift table for cleansing
Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to deduce any files if needed.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables
Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables.

Step 4: Unload the daily dataset to populate the S3 data lake bucketThe transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.

Saturday, January 27, 2018

Redshift ETL

25 comments:

Blog Archive

Followers