unique on large/huge file

Anonymous · ‎2018-11-30

Hi

I have a file having 100M records, have to do unique on all columns. What's best way to do in terms of performance. I have memory setup like 30gb-50gb. but still too much time.

Thanks!!

Anonymous · ‎2018-12-02

Hi,

Considering the data volume, you will have to allocate temp disk space to mark the data interim for comparison.

Please refer the advanced tab to setup this configuration.

If the answer has helped you, could you please mark the topic as resolved? Kudos are also welcome 🙂

Warm Regards,

Nikhil Thampi

vapukov · ‎2018-12-03

using disk - do not increase speed (as it was in question)

generally - volumes problem possible resolve only by "force".

first of all - talend (Java) - good utilize cpu for sorting, and disk speed not very critical until you do not use disk for store temp data

so solutions could be:

- when disk usage enabled - use fastest disk as possible, standard HDD - 150Mb/s, SSD - 500Mb/s, NVMe - 3300Mb/s. For example - AWS provides NVMe disks, Azure - not.

- when all "in memory"- memory speed and cpu (speed, cache) is important. it is complicated, but not always 4.7Ghz cpu win over 2.7Ghz, many other parameters affected, like an on-chip cache size, memory bus wide, frequency, number of clocks and etc

in both cases - whenever it possible reduce the number of columns for sorting ("check unique" it kind of sorting)

Talend Data Integration

v7.x