Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I googled and found some info around this topic, could not find the best matching answer to my question and so finally though of posting it here.
Problem: Our ETL is going to be split into multiple jobs. So we want to understand what would be the best way to pass data from job1 to job2 and so on.
Considerations:
1. Some cases data volumes would be 10s of GBs
2. Some cases data volumes would be 100s of MBs
3. We could have jobs that are running in parallel and then once they are finished, a single job to consolidate these all to load to Redshift
Questions:
1. Will it be better to stage intermediate output in Amazon Redshift or will take the performance down?
2. Is something available using which data can be made available to next job without landing on the disc?
3. Also what would be restartability options with the solutions? I mean can we achieve something like from the design of say 10 jobs, if first 5 fail, next time we only run aborted job and remaining jobs?
Any inputs will be appreciated.
- Kirti