Skip to main content
Announcements
Global Transformation Awards submissions are open! SUBMIT YOUR STORY
cancel
Showing results for 
Search instead for 
Did you mean: 
varadharaj
Contributor II

How to include partition/bucket parameter when writing file to HDFS

Currently we have been using compose to transfer the data from storage layer to provision layer(HDFS in parquet format)

By default it is using the below command for writing the file to HDFS.

<D_F>.write
.mode("Overwrite")
.format("PARQUET")
.save("hdfs:///....")

We need to include partition/bucket parameter based on some column to write the parquet files in HDFS based on partition/bucket key.

Kindly advise if there any way to do.

 

1 Solution

Accepted Solutions
TimGarrod
Employee

Currently, Compose does not support specifying bucketing or partitioning for Spark projects. 

This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks. 

 

If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community.  (In the left menu of this page   <<<<   you should see this icon and you can put in requests)

TimGarrod_0-1591050748879.png

 

View solution in original post

6 Replies
John_Park
Employee

I think you can modify the generated Scripts generated by Compose to add additional parameters.

What version of Compose4DL and What version Hadoop ?

john.park | john.park@qlik.com
John_Park
Employee

Correction you cannot modified the generated scripts.

Partitioning is not supported with Spark based projects with HWX/EMR.

john.park | john.park@qlik.com
varadharaj
Contributor II
Author

So Partitioning is not supported with spark (hortonworks).

Is there any way to bucket the hdfs files with spark option in compose

varadharaj
Contributor II
Author

Can someone reply whether bucketing supports with spark based projects

TimGarrod
Employee

Currently, Compose does not support specifying bucketing or partitioning for Spark projects. 

This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks. 

 

If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community.  (In the left menu of this page   <<<<   you should see this icon and you can put in requests)

TimGarrod_0-1591050748879.png

 

jacobfrey121
Contributor

You can run a mock S3 server (there are many projects that can do this, have a google and choose one you like) and then point Spark at the server by setting the fs.s3a.endpoint property.

The fs.s3a... properties are Hadoop properties, you can set them directly in core-site.xml. If you want to set them dynamically in your spark context, all properties are prefixed with spark.hadoop.

So to set the new endpoint in your test code:

val spark = SparkSession.builder
.master("local")
.appName("test suite")
.config("spark.hadoop.fs.s3a.endpoint", "localhost:9090")
.getOrCreate()