<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: [resolved] Deduplication in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300693#M72897</link>
    <description>Hi,&lt;BR /&gt;How will you define older? Do you have a select query which could achieve your purpose, we can translate it using talend... Can you take sample scenario with some data and your business logic to be implemented...&lt;BR /&gt;Vaibhav</description>
    <pubDate>Fri, 08 Aug 2014 09:37:32 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2014-08-08T09:37:32Z</dc:date>
    <item>
      <title>[resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300692#M72896</link>
      <description>Hello,&amp;nbsp;
&lt;BR /&gt;how can I do to cut duplication of data. For example, I have 4&amp;nbsp;columns. And I have 3 duplicates&amp;nbsp;at the level of the first three columns. The fourth column is the date.&amp;nbsp;
&lt;BR /&gt;I want to keep the record that is older.&amp;nbsp;
&lt;BR /&gt;tUniq Row removes that record, the first or second? Or rather, what technique to use it?
&lt;BR /&gt;Thanks</description>
      <pubDate>Fri, 08 Aug 2014 09:15:10 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300692#M72896</guid>
      <dc:creator>peterko</dc:creator>
      <dc:date>2014-08-08T09:15:10Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300693#M72897</link>
      <description>Hi,&lt;BR /&gt;How will you define older? Do you have a select query which could achieve your purpose, we can translate it using talend... Can you take sample scenario with some data and your business logic to be implemented...&lt;BR /&gt;Vaibhav</description>
      <pubDate>Fri, 08 Aug 2014 09:37:32 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300693#M72897</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-08-08T09:37:32Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300694#M72898</link>
      <description>Hi &lt;B&gt;&lt;A href="http://www.talendforge.org/forum/profile.php?id=207884" target="_blank" rel="nofollow noopener noreferrer"&gt;pantolik&lt;/A&gt;, &lt;BR /&gt;&lt;/B&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;&lt;FONT face="Calibri"&gt;Could you please elaborate your case with an example with input and expected output values? &lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;&lt;FONT face="Calibri"&gt;Best regards&lt;/FONT&gt;&lt;/FONT&gt;&lt;BR /&gt;&lt;FONT size="2"&gt;&lt;FONT face="Calibri"&gt;Sabrina&lt;BR /&gt;&lt;/FONT&gt;&lt;/FONT&gt;</description>
      <pubDate>Fri, 08 Aug 2014 09:37:40 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300694#M72898</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-08-08T09:37:40Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300695#M72899</link>
      <description>Hi,
&lt;BR /&gt;I think that's is possibile using an aggregation (tAggregateRow or tAggregateSortedRow): if you want the older you can use max.
&lt;BR /&gt;So if record contains only these four columns, you have you result.
&lt;BR /&gt;If you have more columns, you can first use that methods to calculate keys needed (based on the three key columns + older date) and then use this intermediate result to filter and extract complete records.
&lt;BR /&gt;I hope this help you.</description>
      <pubDate>Fri, 08 Aug 2014 09:55:58 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300695#M72899</guid>
      <dc:creator>gorotman</dc:creator>
      <dc:date>2014-08-08T09:55:58Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300696#M72900</link>
      <description>Hey. So I have sales data.&amp;nbsp; 
&lt;BR /&gt;Data have columns 
&lt;I&gt;id, name, number_of_items, change_date&amp;nbsp;&lt;/I&gt; 
&lt;BR /&gt;But some data are duplicated at the level of 
&lt;I&gt;id, name,&amp;nbsp;number_of_items&lt;/I&gt;&amp;nbsp;and they only differ in the&amp;nbsp; 
&lt;I&gt;change_date&lt;/I&gt;. I need to store in the database only unique information,&amp;nbsp;but according to the rule that always saves the oldest duplicate record.&amp;nbsp; 
&lt;BR /&gt;In this moment it does so, it retrieves the data, sort it by 
&lt;I&gt;id, name,&amp;nbsp;number_of_items&amp;nbsp;,&amp;nbsp;change_date&lt;/I&gt;. Then I make&amp;nbsp;deduplication&amp;nbsp;&amp;nbsp;at level 
&lt;I&gt;id,&amp;nbsp;name,&amp;nbsp;number_of_items&lt;/I&gt;&amp;nbsp;. Problem is, I've never found a specification of how Talend selects for Unique Rows of dupicit. Selects a unique first found?</description>
      <pubDate>Fri, 08 Aug 2014 10:04:41 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300696#M72900</guid>
      <dc:creator>peterko</dc:creator>
      <dc:date>2014-08-08T10:04:41Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300697#M72901</link>
      <description>Then I still needed to take the duplicated data and set the attributes of the data quality in unique row of duplicate was found for a unique row.</description>
      <pubDate>Fri, 08 Aug 2014 10:12:19 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300697#M72901</guid>
      <dc:creator>peterko</dc:creator>
      <dc:date>2014-08-08T10:12:19Z</dc:date>
    </item>
    <item>
      <title>Re: [resolved] Deduplication</title>
      <link>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300698#M72902</link>
      <description>I'm still not sure I fully understand, but I think gorotman has your solution.&amp;nbsp; Use a tAggregateRow.&amp;nbsp; Put columns id, name, number_of_items in the Group By section.&amp;nbsp; In the Operations section, use a Max function on the change_date column. 
&lt;BR /&gt;If you need to know how many duplicates existed, you can add another column to the schema.&amp;nbsp; Then add another row in Operations and do a Count on the fields.&amp;nbsp; 
&lt;BR /&gt;Doing this would return unique rows for id, name, number_of_items, choose the most recent date for the change_date column, and give you a count of how many rows were grouped together.&amp;nbsp; 1 = no duplicates.&amp;nbsp; 2 or more = duplicates existed.</description>
      <pubDate>Fri, 08 Aug 2014 19:15:02 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/resolved-Deduplication/m-p/2300698#M72902</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2014-08-08T19:15:02Z</dc:date>
    </item>
  </channel>
</rss>

