- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How to include partition/bucket parameter when writing file to HDFS
Currently we have been using compose to transfer the data from storage layer to provision layer(HDFS in parquet format)
By default it is using the below command for writing the file to HDFS.
<D_F>.write
.mode("Overwrite")
.format("PARQUET")
.save("hdfs:///....")
We need to include partition/bucket parameter based on some column to write the parquet files in HDFS based on partition/bucket key.
Kindly advise if there any way to do.
- Tags:
- compose
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently, Compose does not support specifying bucketing or partitioning for Spark projects.
This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.
If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you can modify the generated Scripts generated by Compose to add additional parameters.
What version of Compose4DL and What version Hadoop ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Correction you cannot modified the generated scripts.
Partitioning is not supported with Spark based projects with HWX/EMR.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So Partitioning is not supported with spark (hortonworks).
Is there any way to bucket the hdfs files with spark option in compose
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can someone reply whether bucketing supports with spark based projects
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Currently, Compose does not support specifying bucketing or partitioning for Spark projects.
This is supported natively in Hive projects, and can be applied to databricks projects by simply altering the DDL for databricks.
If this is a feature you'd like to see in the product, I suggest creating an "Idea" in the Qlik Product Insight & Ideas section of the community. (In the left menu of this page <<<< you should see this icon and you can put in requests)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can run a mock S3 server (there are many projects that can do this, have a google and choose one you like) and then point Spark at the server by setting the fs.s3a.endpoint property.
The fs.s3a... properties are Hadoop properties, you can set them directly in core-site.xml. If you want to set them dynamically in your spark context, all properties are prefixed with spark.hadoop.
So to set the new endpoint in your test code:
val spark = SparkSession.builder
.master("local")
.appName("test suite")
.config("spark.hadoop.fs.s3a.endpoint", "localhost:9090")
.getOrCreate()