Hi,
I'm having an issue on a Spark batch job with talend. The job is pretty simple : it reads a file from HDFS, performs a left outer join with another file on HDFS (using a tMap) on a single key, and finally writes the result on HDFS. What I have noticed is weird : the resulting spark job performs a cogroup at one point and tries to gather all the dataset on a single task before writing it into HDFS ! Thus, if the dataset is big enough It results in an OutOfMemory error : java heap space.
Why does talend handles the joins that way ? Is it possible to optimise it ?
Walid.
Hi,
Have you tried to
allocate more memory to a Job execution by setting the -Xmx Java VM parameter and store
the data on disk instead of memory on tMap?
Best regards
Sabrina