Hello,
how can I do to cut duplication of data. For example, I have 4 columns. And I have 3 duplicates at the level of the first three columns. The fourth column is the date.
I want to keep the record that is older.
tUniq Row removes that record, the first or second? Or rather, what technique to use it?
Thanks
Hi, How will you define older? Do you have a select query which could achieve your purpose, we can translate it using talend... Can you take sample scenario with some data and your business logic to be implemented... Vaibhav
Hi,
I think that's is possibile using an aggregation (tAggregateRow or tAggregateSortedRow): if you want the older you can use max.
So if record contains only these four columns, you have you result.
If you have more columns, you can first use that methods to calculate keys needed (based on the three key columns + older date) and then use this intermediate result to filter and extract complete records.
I hope this help you.
Hey. So I have sales data.
Data have columns
id, name, number_of_items, change_date But some data are duplicated at the level of
id, name, number_of_items and they only differ in the
change_date. I need to store in the database only unique information, but according to the rule that always saves the oldest duplicate record.
In this moment it does so, it retrieves the data, sort it by
id, name, number_of_items , change_date. Then I make deduplication at level
id, name, number_of_items . Problem is, I've never found a specification of how Talend selects for Unique Rows of dupicit. Selects a unique first found?
I'm still not sure I fully understand, but I think gorotman has your solution. Use a tAggregateRow. Put columns id, name, number_of_items in the Group By section. In the Operations section, use a Max function on the change_date column.
If you need to know how many duplicates existed, you can add another column to the schema. Then add another row in Operations and do a Count on the fields.
Doing this would return unique rows for id, name, number_of_items, choose the most recent date for the change_date column, and give you a count of how many rows were grouped together. 1 = no duplicates. 2 or more = duplicates existed.