Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Currently we have been using compose to transfer the data from storage layer to provision layer(HDFS in parquet format)
By default it is using the below command for writing the file to HDFS.
<D_F>.write
.mode("Overwrite")
.format("PARQUET")
.save("hdfs:///....")
We need to include partition/bucket parameter based on some column to write the parquet files in HDFS based on partition/bucket key.
Kindly advise if there any way to do.
Currently, Compose does not support specifying bucketing or partitioning for Spark projects.
This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.
If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)
I think you can modify the generated Scripts generated by Compose to add additional parameters.
What version of Compose4DL and What version Hadoop ?
Correction you cannot modified the generated scripts.
Partitioning is not supported with Spark based projects with HWX/EMR.
So Partitioning is not supported with spark (hortonworks).
Is there any way to bucket the hdfs files with spark option in compose
Can someone reply whether bucketing supports with spark based projects
Currently, Compose does not support specifying bucketing or partitioning for Spark projects.
This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.
If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)
You can run a mock S3 server (there are many projects that can do this, have a google and choose one you like) and then point Spark at the server by setting the fs.s3a.endpoint property.
The fs.s3a... properties are Hadoop properties, you can set them directly in core-site.xml. If you want to set them dynamically in your spark context, all properties are prefixed with spark.hadoop.
So to set the new endpoint in your test code:
val spark = SparkSession.builder
.master("local")
.appName("test suite")
.config("spark.hadoop.fs.s3a.endpoint", "localhost:9090")
.getOrCreate()