Solved: How to include partition/bucket parameter when wri... - Qlik Community

varadharaj · ‎2020-04-17

Currently we have been using compose to transfer the data from storage layer to provision layer(HDFS in parquet format)

By default it is using the below command for writing the file to HDFS.

<D_F>.write
.mode("Overwrite")
.format("PARQUET")
.save("hdfs:///....")

We need to include partition/bucket parameter based on some column to write the parquet files in HDFS based on partition/bucket key.

Kindly advise if there any way to do.

TimGarrod · ‎2020-06-01

Currently, Compose does not support specifying bucketing or partitioning for Spark projects.

This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.

If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)

View solution in original post

John_Park · ‎2020-04-28

I think you can modify the generated Scripts generated by Compose to add additional parameters.

What version of Compose4DL and What version Hadoop ?

john.park | john.park@qlik.com

John_Park · ‎2020-04-28

Correction you cannot modified the generated scripts.

Partitioning is not supported with Spark based projects with HWX/EMR.

john.park | john.park@qlik.com

varadharaj · ‎2020-04-29

So Partitioning is not supported with spark (hortonworks).

Is there any way to bucket the hdfs files with spark option in compose

varadharaj · ‎2020-05-20

Can someone reply whether bucketing supports with spark based projects

TimGarrod · ‎2020-06-01

Currently, Compose does not support specifying bucketing or partitioning for Spark projects.

This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.

If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)

jacobfrey121 · ‎2020-06-05

You can run a mock S3 server (there are many projects that can do this, have a google and choose one you like) and then point Spark at the server by setting the fs.s3a.endpoint property.

The fs.s3a... properties are Hadoop properties, you can set them directly in core-site.xml. If you want to set them dynamically in your spark context, all properties are prefixed with spark.hadoop.

So to set the new endpoint in your test code:

val spark = SparkSession.builder
.master("local")
.appName("test suite")
.config("spark.hadoop.fs.s3a.endpoint", "localhost:9090")
.getOrCreate()

usps tracking showbox speed test

How to include partition/bucket parameter when writing file to HDFS