<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: unique on large/huge file in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201903#M3810</link>
    <description>&lt;P&gt;using disk - do not increase speed (as it was in question)&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;generally - volumes problem possible resolve only by "force".&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;first of all - &amp;nbsp;talend&amp;nbsp;(Java) - good utilize cpu for sorting, and disk speed not very critical until you do not use disk for store temp data&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;so solutions could be:&lt;/P&gt; 
&lt;P&gt;- when disk usage enabled - use fastest disk as possible, standard HDD - 150Mb/s, SSD - 500Mb/s, NVMe - 3300Mb/s. For example - AWS provides NVMe disks, Azure - not.&lt;/P&gt; 
&lt;P&gt;- when all "in memory"- memory speed and cpu (speed, cache) is important. it is complicated, but not always 4.7Ghz&amp;nbsp;cpu win over 2.7Ghz, many other parameters affected, like an on-chip cache size, memory bus wide, frequency, number of clocks and etc&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;in both cases - whenever&amp;nbsp;it possible reduce the number&amp;nbsp;of columns for sorting ("check&amp;nbsp;unique" it kind of sorting)&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 03 Dec 2018 19:23:48 GMT</pubDate>
    <dc:creator>vapukov</dc:creator>
    <dc:date>2018-12-03T19:23:48Z</dc:date>
    <item>
      <title>unique on large/huge file</title>
      <link>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201901#M3808</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;I have a file having 100M records, have to do unique on all columns. What's best way to do in terms of performance. I have memory setup like 30gb-50gb. but still too much time.&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Capture.PNG" style="width: 484px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009M1BY.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/151780i5FAB07CE8418C4BB/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009M1BY.png" alt="0683p000009M1BY.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Thanks!!&lt;/P&gt;</description>
      <pubDate>Fri, 30 Nov 2018 19:19:01 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201901#M3808</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-11-30T19:19:01Z</dc:date>
    </item>
    <item>
      <title>Re: unique on large/huge file</title>
      <link>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201902#M3809</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp; &amp;nbsp; Considering the data volume, you will have to allocate temp disk space to mark the data interim for comparison.&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp; &amp;nbsp; Please refer the advanced tab to setup this configuration.&lt;/P&gt; 
&lt;P&gt;&lt;SPAN class="lia-inline-image-display-wrapper lia-image-align-center" image-alt="image.png" style="width: 999px;"&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0683p000009M1BZ.png"&gt;&lt;img src="https://community.qlik.com/t5/image/serverpage/image-id/148940i08AF86824E81C04D/image-size/large?v=v2&amp;amp;px=999" role="button" title="0683p000009M1BZ.png" alt="0683p000009M1BZ.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;If the answer has helped you, could you please mark the topic as resolved? Kudos are also welcome &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Warm Regards,&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;Nikhil Thampi&lt;/P&gt;</description>
      <pubDate>Mon, 03 Dec 2018 04:10:21 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201902#M3809</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2018-12-03T04:10:21Z</dc:date>
    </item>
    <item>
      <title>Re: unique on large/huge file</title>
      <link>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201903#M3810</link>
      <description>&lt;P&gt;using disk - do not increase speed (as it was in question)&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;generally - volumes problem possible resolve only by "force".&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;first of all - &amp;nbsp;talend&amp;nbsp;(Java) - good utilize cpu for sorting, and disk speed not very critical until you do not use disk for store temp data&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;so solutions could be:&lt;/P&gt; 
&lt;P&gt;- when disk usage enabled - use fastest disk as possible, standard HDD - 150Mb/s, SSD - 500Mb/s, NVMe - 3300Mb/s. For example - AWS provides NVMe disks, Azure - not.&lt;/P&gt; 
&lt;P&gt;- when all "in memory"- memory speed and cpu (speed, cache) is important. it is complicated, but not always 4.7Ghz&amp;nbsp;cpu win over 2.7Ghz, many other parameters affected, like an on-chip cache size, memory bus wide, frequency, number of clocks and etc&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt; 
&lt;P&gt;in both cases - whenever&amp;nbsp;it possible reduce the number&amp;nbsp;of columns for sorting ("check&amp;nbsp;unique" it kind of sorting)&lt;/P&gt; 
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Dec 2018 19:23:48 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/unique-on-large-huge-file/m-p/2201903#M3810</guid>
      <dc:creator>vapukov</dc:creator>
      <dc:date>2018-12-03T19:23:48Z</dc:date>
    </item>
  </channel>
</rss>

