Re: TFileInputDelimited for Big Data dpark cannot ... - Qlik Community

Anonymous · ‎2019-05-30

Hi, I'm using the Talend Big Data studio Enterprise edition and need to read (extract) multiple gz files and then apply transformations on them.

On normal DI I used tFileUnarchive for this purpose but it's not present in Spark Big Data.

I know that tFileInputDelimited for big data can read gz files by default but I've yet to find a way to allow it to take multiple files as input

My files are in the format

File1-00001.out.gz
File1-00002.out.gz
.
.
.
File1-0075.out.gz
File2-00001.out.gz
File2-00002.out.gz
.
.
.
.
File2-00075.out.gz

Anonymous · ‎2019-05-31

Please, what do you mean by spark big data?

Anonymous · ‎2019-05-31

Sorry for sounding so vague, I'm new to Talend.

I'm talking about the Big Data Batch jobs which run on Spark framework. The Standard Batch jobs has components like tFileList and tFileUnarchive but a Big Data Batch job doesn't.

Anonymous · ‎2019-05-31

Hi,

You will have to do an orchestration using DI job and BD job to solve this problem. Why don't you try to give each gz file as parameter to a BD job where it will perform the balance steps. The parent job which will send one file at time can be a normal DI job.

Warm Regards,
Nikhil Thampi

Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂

TFileInputDelimited for Big Data dpark cannot read mutiple gz files

Talend Big Data