Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
varadharaj
Contributor II
Contributor II

Limit Number of parquet files when writing to HDFS

I need to limit the number of parquet files when writing to HDFS .

Currently in provisioning area compose task is writing large number of small files.

Kindly advise is there anyway to write parquet files as small number of large file size.

Is there any configuration available to limit the files as above.

1 Solution

Accepted Solutions
TimGarrod
Employee
Employee

Hi - you can set this on a provisioning task by using the Spark settings.   This can be applied to storage or provisioning tasks.  The setting is a RUN-TIME setting (ie. doesn't require generation of code for storage tasks). 

Highlight the task, click Settings - navigate to the Spark tab and enter - 

attunity.compose.coalesce=N; 

 where N is a number (not "N" 🙂 ) which depicts the number of files.

(For example below - I've set the provisioning task to create 1 file per entity in my provisioning task) 

TimGarrod_0-1587141019700.png

By default Spark typically delivers 200 files for each process as its a multi-threaded execution.   When this setting is seen Compose applies the spark dataframe coalesce setting to all the tables in the task.

If you have multiple tables that require different sizes you can define multiple provisioning tasks with subsets of tables (going to the same data lake storage bucket/filesystem/folder etc and same metastore database). 

Hope that helps.

 

 

 

View solution in original post

1 Reply
TimGarrod
Employee
Employee

Hi - you can set this on a provisioning task by using the Spark settings.   This can be applied to storage or provisioning tasks.  The setting is a RUN-TIME setting (ie. doesn't require generation of code for storage tasks). 

Highlight the task, click Settings - navigate to the Spark tab and enter - 

attunity.compose.coalesce=N; 

 where N is a number (not "N" 🙂 ) which depicts the number of files.

(For example below - I've set the provisioning task to create 1 file per entity in my provisioning task) 

TimGarrod_0-1587141019700.png

By default Spark typically delivers 200 files for each process as its a multi-threaded execution.   When this setting is seen Compose applies the spark dataframe coalesce setting to all the tables in the task.

If you have multiple tables that require different sizes you can define multiple provisioning tasks with subsets of tables (going to the same data lake storage bucket/filesystem/folder etc and same metastore database). 

Hope that helps.