Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
hello Team,
I have a daily job which loads ~6 million data daily(truncate reload). My target database if Amazon redshift. I have used tRedshift_output as the output component. presently the jobs takes somewhere between 4 to 5 hrs to complete. Is there a way to improve the performance of this job? Can i use any other component to improve the performance. My source is a little complex SQL ( 4 full outer join statements)
Have you read this error? It's quite obvious reason. There is no date like 2018-53-06. Please fix your source data, or change data type to string to load this data
@Kamil Kowalczyk : you are right. I have fixed the source data. the issue was with date fields .I changed to string and then the job worked fine. ALso, the Secret key was changed and now the job is working fine. The only issue/query that i have now is it takes ~ 65 mins to commit 5 million records( it reads the data in 12 to 13 mins), but takes a very long time to commit this transaction and insert into my Target table. IS this Normal acceptable behavior or i can tune this further? if yes, then how? can you pls suggest
The only issue/query that i have now is it takes ~ 65 mins to commit 5 million records( it reads the data in 12 to 13 mins), but takes a very long time to commit this transaction and insert into my Target table. IS this Normal acceptable behavior or i can tune this further? if yes, then how? can you pls suggest
I think this article can help: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
tRedshiftOutputBulkExec has some stepst inside:
* building local file from input flow
* putting this file in S3
* loading data from s3 to Redshift.
Maybe if you will use diffrent components to do above steps, you will locate which step has poor performance, and focus on it.
What is your source? Redshift, or another DB? If redshift, you can use UNLOAD with Parallel and GZIP to imrpvoe loading data.
@Kamil Kowalczyk : thanks for your valuable feedback. Yes, source is Redshift only. I will try other components also and see if that helps to improve the timings. I know where the bottlenect is- it is while writing the data from S3 to Redshift table.
Hi sushantk19 (Customer),
that's great:) So try to do unload with parallel and gzip, then do copy with gzip parameter and let me know if it helep 🙂
@Kamil Kowalczyk : I used to below 2 components instead of bulkexec:
1. Bulkoutput load
2.tRedshiftunload.
The time reduces a bit. but still it is more than 60 mins for 6 million records. Can i improve this further by any other means? i see an option for parallel execution disabled. Will it help to enable that?
@Kamil Kowalczyk : My unload process is slow( and not copy from S3 to Redshift DB). It takes a very long time to write the data to S3 which is output of complex query( 4 full outer joins). Can you suggest how can i make it faster? presently it takes around 70 mins for this process.
Hello Friends,
My unload process is slow( and not copy from S3 to Redshift DB). It takes a very long time to write the data to S3 which is output of complex query( 4 full outer joins). Can you suggest how can i make it faster? presently it takes around 70 mins for this process.