Qlik Replicate parquet output format for S3 endpoint
Our Data Analytics, Data Science, and AI Factory teams all work with Parquet files as their preferred and current output formats. We are a new Data Integrations team trying to automate unified data pipelines for these teams while also building a persistence layer. Without the ability to deliver data to these teams in their preferred industry standard format directly from Qlik Replicate, we will not be able to use the product and be forced to find alternate technology to do that. JSON and CSV is fine for some things, but not having parquet as an output directly is a blocker for us delivering data to our most important customers. Is parquet output for S3 endpoints (and others) on the near term roadmap?
This is actually something that we've been discussing with Qlik as well. The feedback we got though was that the process flow they've settled on is to push data into a write-optimised format and then process the deltas to Parquet using Compose.
From your side, what are the issues with the Replicate ---> Compose ---> S3-Parquet as opposed to being able to do Replicate --->S3-Parquet? Is it a performance-related concern?
Thank you for the suggestion. We would like to get feedback from others as well and will consider for a future release. We will also need to consider performance aspects to having Replicate generate Parquet files. Have you considered Compose as a solution here?
@Nathan1 , the orchestration components needed to use Compose for this is a little outside the scope of what our team does as strictly an Integration and Automation team. We don't manage what consumers do downstream. So for us, we don't have components like Databricks or other EMR solutions in our workspaces and at this point have no other need to manage these components. I completely understand the perspective of going to a write-optimized format though.