Optimizing joins in Talend spark batch jobs

_AnonymousUser — Sat, 16 Nov 2024 10:03:50 GMT

Hi,
I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space.
Why does talend handles the joins that way ? Is it possible to optimise it ?
Walid.

Re: Optimizing joins in Talend spark batch jobs

Anonymous — Mon, 27 Feb 2017 07:38:47 GMT

Hi,
Have you tried to allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store the data on disk instead of memory on tMap?
Best regards
Sabrina

topic Re: Optimizing joins in Talend spark batch jobs in Talend Studio

Optimizing joins in Talend spark batch jobs

Re: Optimizing joins in Talend spark batch jobs