topic Re: [resolved] Deduplication in Talend Studio

[resolved] Deduplication

peterko — Fri, 08 Aug 2014 09:15:10 GMT

Hello,
how can I do to cut duplication of data. For example, I have 4 columns. And I have 3 duplicates at the level of the first three columns. The fourth column is the date.
I want to keep the record that is older.
tUniq Row removes that record, the first or second? Or rather, what technique to use it?
Thanks

Re: [resolved] Deduplication

Anonymous — Fri, 08 Aug 2014 09:37:32 GMT

Hi,
How will you define older? Do you have a select query which could achieve your purpose, we can translate it using talend... Can you take sample scenario with some data and your business logic to be implemented...
Vaibhav

Re: [resolved] Deduplication

Anonymous — Fri, 08 Aug 2014 09:37:40 GMT

Hi pantolik,

Could you please elaborate your case with an example with input and expected output values?

Best regards
Sabrina

Re: [resolved] Deduplication

gorotman — Fri, 08 Aug 2014 09:55:58 GMT

Hi,
I think that's is possibile using an aggregation (tAggregateRow or tAggregateSortedRow): if you want the older you can use max.
So if record contains only these four columns, you have you result.
If you have more columns, you can first use that methods to calculate keys needed (based on the three key columns + older date) and then use this intermediate result to filter and extract complete records.
I hope this help you.

Re: [resolved] Deduplication

peterko — Fri, 08 Aug 2014 10:04:41 GMT

Hey. So I have sales data.
Data have columns id, name, number_of_items, change_date
But some data are duplicated at the level of id, name, number_of_items and they only differ in the change_date. I need to store in the database only unique information, but according to the rule that always saves the oldest duplicate record.
In this moment it does so, it retrieves the data, sort it by id, name, number_of_items , change_date. Then I make deduplication at level id, name, number_of_items . Problem is, I've never found a specification of how Talend selects for Unique Rows of dupicit. Selects a unique first found?

Re: [resolved] Deduplication

peterko — Fri, 08 Aug 2014 10:12:19 GMT

Then I still needed to take the duplicated data and set the attributes of the data quality in unique row of duplicate was found for a unique row.

Re: [resolved] Deduplication

Anonymous — Fri, 08 Aug 2014 19:15:02 GMT

I'm still not sure I fully understand, but I think gorotman has your solution. Use a tAggregateRow. Put columns id, name, number_of_items in the Group By section. In the Operations section, use a Max function on the change_date column.
If you need to know how many duplicates existed, you can add another column to the schema. Then add another row in Operations and do a Count on the fields.
Doing this would return unique rows for id, name, number_of_items, choose the most recent date for the change_date column, and give you a count of how many rows were grouped together. 1 = no duplicates. 2 or more = duplicates existed.