topic TFileInputDelimited for Big Data dpark cannot read mutiple gz files in Talend Studio

TFileInputDelimited for Big Data dpark cannot read mutiple gz files

Anonymous — Sat, 16 Nov 2024 05:41:47 GMT

Hi, I'm using the Talend Big Data studio Enterprise edition and need to read (extract) multiple gz files and then apply transformations on them.

On normal DI I used tFileUnarchive for this purpose but it's not present in Spark Big Data.

I know that tFileInputDelimited for big data can read gz files by default but I've yet to find a way to allow it to take multiple files as input

My files are in the format

File1-00001.out.gz
File1-00002.out.gz
.
.
.
File1-0075.out.gz
File2-00001.out.gz
File2-00002.out.gz
.
.
.
.
File2-00075.out.gz

Re: TFileInputDelimited for Big Data dpark cannot read mutiple gz files

Anonymous — Fri, 31 May 2019 17:22:46 GMT

Please, what do you mean by spark big data?

Re: TFileInputDelimited for Big Data dpark cannot read mutiple gz files

Anonymous — Fri, 31 May 2019 17:35:17 GMT

Sorry for sounding so vague, I'm new to Talend.

I'm talking about the Big Data Batch jobs which run on Spark framework. The Standard Batch jobs has components like tFileList and tFileUnarchive but a Big Data Batch job doesn't.

Re: TFileInputDelimited for Big Data dpark cannot read mutiple gz files

Anonymous — Fri, 31 May 2019 21:22:01 GMT

Hi,

You will have to do an orchestration using DI job and BD job to solve this problem. Why don't you try to give each gz file as parameter to a BD job where it will perform the balance steps. The parent job which will send one file at time can be a normal DI job.

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂