Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
peterko
Contributor III
Contributor III

[resolved] Deduplication

Hello, 
how can I do to cut duplication of data. For example, I have 4 columns. And I have 3 duplicates at the level of the first three columns. The fourth column is the date. 
I want to keep the record that is older. 
tUniq Row removes that record, the first or second? Or rather, what technique to use it?
Thanks
Labels (2)
6 Replies
Anonymous
Not applicable

Hi,
How will you define older? Do you have a select query which could achieve your purpose, we can translate it using talend... Can you take sample scenario with some data and your business logic to be implemented...
Vaibhav
Anonymous
Not applicable

Hi pantolik,



Could you please elaborate your case with an example with input and expected output values?


Best regards
Sabrina
gorotman
Creator II
Creator II

Hi,
I think that's is possibile using an aggregation (tAggregateRow or tAggregateSortedRow): if you want the older you can use max.
So if record contains only these four columns, you have you result.
If you have more columns, you can first use that methods to calculate keys needed (based on the three key columns + older date) and then use this intermediate result to filter and extract complete records.
I hope this help you.
peterko
Contributor III
Contributor III
Author

Hey. So I have sales data. 
Data have columns id, name, number_of_items, change_date 
But some data are duplicated at the level of id, name, number_of_items and they only differ in the  change_date. I need to store in the database only unique information, but according to the rule that always saves the oldest duplicate record. 
In this moment it does so, it retrieves the data, sort it by id, name, number_of_items , change_date. Then I make deduplication  at level id, name, number_of_items . Problem is, I've never found a specification of how Talend selects for Unique Rows of dupicit. Selects a unique first found?
peterko
Contributor III
Contributor III
Author

Then I still needed to take the duplicated data and set the attributes of the data quality in unique row of duplicate was found for a unique row.
Anonymous
Not applicable

I'm still not sure I fully understand, but I think gorotman has your solution.  Use a tAggregateRow.  Put columns id, name, number_of_items in the Group By section.  In the Operations section, use a Max function on the change_date column.
If you need to know how many duplicates existed, you can add another column to the schema.  Then add another row in Operations and do a Count on the fields. 
Doing this would return unique rows for id, name, number_of_items, choose the most recent date for the change_date column, and give you a count of how many rows were grouped together.  1 = no duplicates.  2 or more = duplicates existed.