topic Re: Sorting before tuniq... in Talend Studio

Sorting before tuniq...

Anonymous — Sat, 16 Nov 2024 07:35:12 GMT

I have a MySQL table of 60m rows and need to dedupe the table keying on all 6 columns. Do I need to sort using tSortRow first then follow with a tUniqRow or can I go straight into a tUniqRow and let the component deal with it.

Any advice on whether this is the right approach or if there's a better way would be great!

Thanks

Re: Sorting before tuniq...

Anonymous — Tue, 25 Sep 2018 21:35:43 GMT

Why not dedupe and sort in your database? That is what a database is good at. If you have 60m rows where only a third are duped, that is 20m rows that you unnecessarily send to Talend for them to be thrown away. While Talend is a great ETL tool, it uses Java. Java is good at many things, but it isn't as quick as a database at sorting and filtering.

I'd recommend sorting and filtering your data in your database by writing a query to do that in your DB component. This way only the necessary data will enter your job and in the correct order. After that your job will have a lot less work to do.

Re: Sorting before tuniq...

manodwhb — Wed, 26 Sep 2018 08:59:41 GMT

@DaveG2008,since your source is DB right,i will suggest you to do in the DB level to remove duplicates.

if you feel your DB server will not able to handle then go with tUniqRow.