Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
cdhemant
Contributor
Contributor

merge multiple parquet files to single or multiple files

Hello,

I have multiple 1000 parquet files say of 1MB each. Want to merge them in to single or multiple files.

  1. Say 200 files in file1.parquet,
  2. next 200 files in file2.parquet

so on. I was looking a component, however haven't found it.

Is there a way by which we can do this ? there custom java library or python scripts are available, however was looking for Talend component

Thanks

cdhemant

Labels (2)
3 Replies
Anonymous
Not applicable

Hi

Using a tFileList to iterate each file, here is a demo job, see below.

0695b00000hsUjpAAE.pngtJava: define a dynamic output file name.

int i=((Integer)globalMap.get("tFileList_1_NB_FILE"));

context.filename="out"+i/10+".parquet";

tFileInputParquet_1: read the current parquet file, set the file path as:

((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))

tDBOutput_1: store the data into DB table.

Set the condition of runIF connector as:

((Integer)globalMap.get("tFileList_1_NB_FILE"))%200==0

//whenever 200 files are read, read all data from DB table and write them to a new parquet file.

 

on FileOutputParquet_1: set the file path as:

"D:/files/temp/output/"+context.filename

 

tDBRow: truncate the table.

 

Can ​you try and let me know if you have any questions?

 

Regards

Shong

cdhemant
Contributor
Contributor
Author

Thanks Shong.

 

Definitely this a solution for the problem, however it adds up a new infra component Database which have cost and maintenance involved.

 

I am trying to have python script which will try to create single file.

 

Thanks

 

Anonymous
Not applicable

Yes, you can also store the data into local file instead of DB, but you has lot of files to process and the files are big, I'm afraid the performance is poor.