Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I am using the tDenormalizing component to denormalize two columns in 1.3kk rows and it's taking more than 2h to run and it needs 12GB of RAM. I'd like to know what is the complexity of the algorithm and if there's a way to improve the performance for high volumes of data.
Thanks!
Denormalising needs to keep all of the data in memory while looking over your 1.3 million records to see if any links between those records exist. That is not going to be easy or efficient. Is there a way that you could group the data and chunk it before trying to denormalise each chunk? That would speed this up I am sure.
Denormalising needs to keep all of the data in memory while looking over your 1.3 million records to see if any links between those records exist. That is not going to be easy or efficient. Is there a way that you could group the data and chunk it before trying to denormalise each chunk? That would speed this up I am sure.
I ended up separating the portion of the data that needed to be denormalized and it was better. The algorithm seems to have really high complexity, which I think could be improved.
Thanks!
Unfortunately the problem requires that every row be potentially linked to every other row or no rows at all. That means that everything has to go into memory. You are essentially dealing with 1,690,000,000,000 comparisons with your dataset of 1,300,000 records. I'm not sure that you can avoid that number of comparisons unless you build heuristics into the algorithm that you would only know about if you know the dataset. It's the job of the developer to build in those heuristics.