<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimizing joins in Talend spark batch jobs in Talend Studio</title>
    <link>https://community.qlik.com/t5/Talend-Studio/Optimizing-joins-in-Talend-spark-batch-jobs/m-p/2215089#M11652</link>
    <description>Hi, 
&lt;BR /&gt;Have you tried to&amp;nbsp; 
&lt;FONT size="2"&gt;&lt;FONT face="noto, Helvetica, Arial, sans-serif"&gt;allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store&amp;nbsp;&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;the data on disk instead of memory on tMap?&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;BR /&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;Best regards&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;BR /&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;Sabrina&lt;/FONT&gt;&lt;/FONT&gt;</description>
    <pubDate>Mon, 27 Feb 2017 07:38:47 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2017-02-27T07:38:47Z</dc:date>
    <item>
      <title>Optimizing joins in Talend spark batch jobs</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Optimizing-joins-in-Talend-spark-batch-jobs/m-p/2215088#M11651</link>
      <description>Hi,
&lt;BR /&gt;I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space. 
&lt;BR /&gt;Why does talend handles the joins that way ? Is it possible to optimise it ?
&lt;BR /&gt;Walid.</description>
      <pubDate>Sat, 16 Nov 2024 10:03:50 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Optimizing-joins-in-Talend-spark-batch-jobs/m-p/2215088#M11651</guid>
      <dc:creator>_AnonymousUser</dc:creator>
      <dc:date>2024-11-16T10:03:50Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing joins in Talend spark batch jobs</title>
      <link>https://community.qlik.com/t5/Talend-Studio/Optimizing-joins-in-Talend-spark-batch-jobs/m-p/2215089#M11652</link>
      <description>Hi, 
&lt;BR /&gt;Have you tried to&amp;nbsp; 
&lt;FONT size="2"&gt;&lt;FONT face="noto, Helvetica, Arial, sans-serif"&gt;allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store&amp;nbsp;&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;the data on disk instead of memory on tMap?&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;BR /&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;Best regards&lt;/FONT&gt;&lt;/FONT&gt; 
&lt;BR /&gt; 
&lt;FONT size="2"&gt;&lt;FONT face="Calibri, sans-serif"&gt;Sabrina&lt;/FONT&gt;&lt;/FONT&gt;</description>
      <pubDate>Mon, 27 Feb 2017 07:38:47 GMT</pubDate>
      <guid>https://community.qlik.com/t5/Talend-Studio/Optimizing-joins-in-Talend-spark-batch-jobs/m-p/2215089#M11652</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2017-02-27T07:38:47Z</dc:date>
    </item>
  </channel>
</rss>

