Slowness in Data load from Talend to AWS Redshift.

sushantk19 · ‎2023-03-23

Hello,

Is anyone facing issues with slowness in data load via talend studio or TMC since yesterday morning 5 am CET. I checked with my IT Team and they rebooted all the possible servers, but still the issue persists. our target DB is Amazon Redhsift DB.

the output for some of the SCD1/SCD2 jobs is like 2500 rows in 2.5 hrs. usually it used to take just 30 t0 40 mins to process this much data. anyone facing the same issue please let me know OR can suggest a quick resolution as our prod DW loads are stuck up.

Regards,

Sushant

Anonymous · ‎2023-03-24

Hello Sushant,

It's probably the network performance issue or job design issue. other customer met the issue before.

please refer to the below post

https://community.talend.com/s/feed/0D53p00007vCmXFCA0

Best regards

Aiming

sushantk19 · ‎2023-03-24

@achen: These jobs are running successfully in prod for last 3 years. no new jobs were jobs deployed. all jobs are tested for performance in lower environment and then deployed. i tried a few things from my side like running the job from studio, increasing memory size etc, but it did not work. Our IT team also cleaned up logs files / tmp files and restarted the RE, but it did not help. we also checked the disk space and found it be be fine.

if its a network issue how and from from do we find this out? is it AWS redshift or talend network issue? can you pls suggest

Anonymous · ‎2023-03-24

I don't think the network issue come from the AWS redshift as it's public database.

You can login the talend RE vm to check the network by the command #speedtest-cli--simple

see https://www.omglinux.com/test-internet-speed-from-the-command-line

also, increase the parameter 'Number of rows per insert' in tRedshiftOutput's advanced settings will improve the performance.

is it possible to share your job screenshots for more investigation?

sushantk19 · ‎2023-03-24

thanks for suggestion @Aiming Chen . For point 1, my IT team will check. For point 2, let me inform you these are SCD1/SCD2 jobs which are suddenly running very slow and not insert or truncate reload( S3) jobs. I have attached the screenshot from source and target side transformation for your ref. Please check.

sushantk19 · ‎2023-03-24

@Aiming Chen : target side screenshot.let me know if you need any more details. Note: these jobs have been running fine in prod till now for almost 3 years.

sushantk19 · ‎2023-03-24

@Aiming Chen : please see the network performance from our IT team. they say its good.

Anonymous · ‎2023-03-26

@Sushant Kapoor , can you confirm the performance issue come from the tRedShiftInput component? if yes,

the problem is that as the data size grow up, the query performance become slower, to improve the performance.

please try to do :

decrease the Cursor size to 1000 or not using the Cursor for tRedshitInput component in the job
setup sort key for the redshift table like : alter Table xxx alter SORTKEY("xxx") , see "Sort key recommendation" in the below article https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

sushantk19 · ‎2023-03-27

@Aiming Chen : surely, i will try this. But what confuses me the most is that most of these slow running jobs are SCD1/SCD2( tpostgresql output component). Their throughput seems to have dropped ro 2 rows/s. earlier it used to be much higher. i checked and we have sort key already defined for them, but the cursor size is 100000 to improve the performance of data. i will reduce to 1000 and see if that makes a difference.

Also, how can some of the SCD2 job which ran fine 5 days back, suddenly drop in performance ? if its one job which is running slow, then we can obviously fix it. but the point is how can all the jobs start running slowly from next day. this is very confusing. any thoughts on this?

Anonymous · ‎2023-03-27

@Sushant Kapoor as all the jobs start running slowly suddenly, I guess the issue come from the db side.

You said the tpostgresql output droppped to 2 rows/s, so please try to tuning the backend postgresql db

e.g. increase the below parameters :

max_connections
shared_buffers

...

see https://www.revsys.com/writings/postgresql-performance.html

Azure

Studio

v7.x