
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
merge multiple parquet files to single or multiple files
Hello,
I have multiple 1000 parquet files say of 1MB each. Want to merge them in to single or multiple files.
- Say 200 files in file1.parquet,
- next 200 files in file2.parquet
so on. I was looking a component, however haven't found it.
Is there a way by which we can do this ? there custom java library or python scripts are available, however was looking for Talend component
Thanks
cdhemant

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
Using a tFileList to iterate each file, here is a demo job, see below.
tJava: define a dynamic output file name.
int i=((Integer)globalMap.get("tFileList_1_NB_FILE"));
context.filename="out"+i/10+".parquet";
tFileInputParquet_1: read the current parquet file, set the file path as:
((String)globalMap.get("tFileList_1_CURRENT_FILEPATH"))
tDBOutput_1: store the data into DB table.
Set the condition of runIF connector as:
((Integer)globalMap.get("tFileList_1_NB_FILE"))%200==0
//whenever 200 files are read, read all data from DB table and write them to a new parquet file.
on FileOutputParquet_1: set the file path as:
"D:/files/temp/output/"+context.filename
tDBRow: truncate the table.
Can you try and let me know if you have any questions?
Regards
Shong

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Shong.
Definitely this a solution for the problem, however it adds up a new infra component Database which have cost and maintenance involved.
I am trying to have python script which will try to create single file.
Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can also store the data into local file instead of DB, but you has lot of files to process and the files are big, I'm afraid the performance is poor.
