Data Pipeline from s3 to Amazon Redshift

Anonymous · ‎2018-10-01

Hello Everyone,

I am in process to build a data pipline which loads the data from s3 to Amazon redshift, I have a s3 bucket with layered folder structure(e.g.-
Amazon S3>bucket-name/10849813427/2.0/2018/08/16/10958160321) and my files are placed in the last directory(10958160321).

I have thousands of these files, I want to do a one time upload of these files and Redshift and then I want to do Incremental update in my redshift clster for the data I receive on daily basis in form of these files.The data has to be cleansed and transformed as well before loading in Redshift.

I found a way to copy files from all the folders inside this s3 bucket to redshift using "tDBBulkExec", but cannot load the complete data in redshift staging on daily basis because data volume is very high.

Has anyone tried and found a way to load this data incrementally in redshift on daily basis(For files uploaded in last two days).

There are two options I can think of :

Populating the Key name in "tDBBulkExec" component dynamically, with the file not older then two days, which I am not sure how to do.

Or to find a way to copy files uploaded in just last two days and put those in another S3 bucket.

Then I can copy those files to redshift staging table and do the transformation and cleansing after that.

Please let me know if anyone had tried doing anything similar, I am open for any other approaches as well.

Thanks!

Anonymous · ‎2018-10-19

Hello,

Are you using Talend Data Streams product?

Best regards

Sabrina

Data Prep

Data Quality

v6.x