topic Re: Iteration in Spark Job in Talend Studio

Iteration in Spark Job

ankushd — Sat, 16 Nov 2024 07:16:20 GMT

HI All,

We have a requirement to read multiple hdfs files and convert them into parquet. the input files will be present in different directories and recursive path.

We want to iterate all the files and pass it to output file component. do we have any component that can iterate files and hold the file name as global variable?

Re: Iteration in Spark Job

Anonymous — Mon, 19 Nov 2018 09:25:46 GMT

Hi,

You can do all the control part with a DI job and can trigger the BD job using independent child process option selected as on.

Warm Regards,

Nikhil Thampi

Re: Iteration in Spark Job

ankushd — Mon, 19 Nov 2018 10:26:10 GMT

Thanks Nikhil. we designed our job with same logic but we are facing processing slowness when we use standard job.

We are using below operations in master job

1. Download file from S3 to local & copy to hdfs

2. Convert csv file to parquet hdfs

3. Copy hdfs file to local & upload to S3

Currently we are not able to run more than 10 parallel flows. job server is 8 cpu machine and accepting only 8 tRunJob flows. do we have any solution for increase the parallel threads.

As as we are getting slowness, we decided to use pure big data jobs.