topic tDenormalizing taking too long and too much memory to run in Talend Studio

tDenormalizing taking too long and too much memory to run

Anonymous — Sat, 16 Nov 2024 03:21:17 GMT

Hi,

I am using the tDenormalizing component to denormalize two columns in 1.3kk rows and it's taking more than 2h to run and it needs 12GB of RAM. I'd like to know what is the complexity of the algorithm and if there's a way to improve the performance for high volumes of data.

Thanks!

Re: tDenormalizing taking too long and too much memory to run

Anonymous — Wed, 05 Feb 2020 13:27:15 GMT

Denormalising needs to keep all of the data in memory while looking over your 1.3 million records to see if any links between those records exist. That is not going to be easy or efficient. Is there a way that you could group the data and chunk it before trying to denormalise each chunk? That would speed this up I am sure.

Re: tDenormalizing taking too long and too much memory to run

Anonymous — Thu, 06 Feb 2020 11:53:06 GMT

I ended up separating the portion of the data that needed to be denormalized and it was better. The algorithm seems to have really high complexity, which I think could be improved.

Thanks!

Re: tDenormalizing taking too long and too much memory to run

Anonymous — Thu, 06 Feb 2020 13:39:39 GMT

Unfortunately the problem requires that every row be potentially linked to every other row or no rows at all. That means that everything has to go into memory. You are essentially dealing with 1,690,000,000,000 comparisons with your dataset of 1,300,000 records. I'm not sure that you can avoid that number of comparisons unless you build heuristics into the algorithm that you would only know about if you know the dataset. It's the job of the developer to build in those heuristics.