Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
sushantk19
Creator
Creator

Improving the data load speed to Amazon redshift database

hello Team,

I have a daily job which loads ~6 million data daily(truncate reload). My target database if Amazon redshift. I have used tRedshift_output as the output component. presently the jobs takes somewhere between 4 to 5 hrs to complete. Is there a way to improve the performance of this job? Can i use any other component to improve the performance. My source is a little complex SQL ( 4 full outer join statements)

20 Replies
pakapi_
Creator
Creator

Have you read this error? It's quite obvious reason. There is no date like 2018-53-06. Please fix your source data, or change data type to string to load this data

sushantk19
Creator
Creator
Author

@Kamil Kowalczyk​ : you are right. I have fixed the source data. the issue was with date fields .I changed to string and then the job worked fine. ALso, the Secret key was changed and now the job is working fine. The only issue/query that i have now is it takes ~ 65 mins to commit 5 million records( it reads the data in 12 to 13 mins), but takes a very long time to commit this transaction and insert into my Target table. IS this Normal acceptable behavior or i can tune this further? if yes, then how? can you pls suggest

sushantk19
Creator
Creator
Author

The only issue/query that i have now is it takes ~ 65 mins to commit 5 million records( it reads the data in 12 to 13 mins), but takes a very long time to commit this transaction and insert into my Target table. IS this Normal acceptable behavior or i can tune this further? if yes, then how? can you pls suggest

pakapi_
Creator
Creator

I think this article can help: https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

tRedshiftOutputBulkExec has some stepst inside:

* building local file from input flow

* putting this file in S3

* loading data from s3 to Redshift.

Maybe if you will use diffrent components to do above steps, you will locate which step has poor performance, and focus on it.

 

What is your source? Redshift, or another DB? If redshift, you can use UNLOAD with Parallel and GZIP to imrpvoe loading data.

 

 

 

 

 

 

 

 

 

sushantk19
Creator
Creator
Author

@Kamil Kowalczyk​ : thanks for your valuable feedback. Yes, source is Redshift only. I will try other components also and see if that helps to improve the timings. I know where the bottlenect is- it is while writing the data from S3 to Redshift table.

pakapi_
Creator
Creator

Hi sushantk19 (Customer),

 

that's great:) So try to do unload with parallel and gzip, then do copy with gzip parameter and let me know if it helep 🙂

 

 

sushantk19
Creator
Creator
Author

@Kamil Kowalczyk​ : I used to below 2 components instead of bulkexec:

1. Bulkoutput load

2.tRedshiftunload.

 

The time reduces a bit. but still it is more than 60 mins for 6 million records. Can i improve this further by any other means? i see an option for parallel execution disabled. Will it help to enable that?

 

 

 

sushantk19
Creator
Creator
Author

@Kamil Kowalczyk​ : My unload process is slow( and not copy from S3 to Redshift DB). It takes a very long time to write the data to S3 which is output of complex query( 4 full outer joins). Can you suggest how can i make it faster? presently it takes around 70 mins for this process.

sushantk19
Creator
Creator
Author

Hello Friends,

 

My unload process is slow( and not copy from S3 to Redshift DB). It takes a very long time to write the data to S3 which is output of complex query( 4 full outer joins). Can you suggest how can i make it faster? presently it takes around 70 mins for this process.

pakapi_
Creator
Creator

Hi sushantk19,

do you have ENABLED PARALLEL unload with GZIP option ? It improves unloads a lot.