Solved: Re: Project using apache spark/Hive/databricks - Qlik Community

Ray0801 · ‎2020-11-27

can someone explain how spark/hive/Databricks explains in moving my data from landing zone to any storage zone such as HDFS or Google storage or why is it neccesary.

TimGarrod · ‎2020-11-30

Hi again @Ray0801 :)... As mentioned in my reply to your other question on hadoop with C4DL (Compose for Data Lakes) - Compose requires the use of a processing engine to be able to process data at scale and provide "big data formatted" data lake assets like parquet files.

In the Qlik Data Integration architecture for data lakes - Replicate captures changes and delivers the change information to your data lake (S3/ADLS/Google storage / HDFS). Since these cloud object stores and HDFS are really append only type files systems that don't provide the capability to update data in place (at least not at any reasonable speed), Replicate provides the change info. INSERT / UPDATE / DELETE transactions are written to the object store, but not applied. (Known as Store Changes setting in Replicate - see docs for more details )

The exception to the above is Databricks, where Replicate 7.0 has the ability to apply changes directly to Databricks Delta tables (in the same fashion it applies to Snowflake).

We could certainly leave the data in that state, but then you'd have to determine how to apply the changes - which often means writing Spark code to manage transactions in the lake. This is what Compose for Data Lakes is for. C4DL provides different project types - Hive, Spark or Databricks. These different types of projects provide options for data processing and data architecture (eg. Hive ACID to update data in place v Spark to deliver parquet files with overwrite characteristics). Since Compose for Data Lakes is leveraging Hive or Spark for processing, it requires the use of cluster resources to process the data.

Hope that helps - but if more clarity is needed let me know.

View solution in original post

TimGarrod · ‎2020-12-02

Hi @Ray0801 - absolutely.

Lets start with the easy one -

Databricks projects.
- These are supported for either Azure or AWS based databricks deployments.
- Replicate delivers data to the storage layer (S3/ADLS) and Compose then applies the data to DELTA tables within databricks.
- While Replicate also supports DELTA (on Azure for now) as a direct endpoint, if you don't have the need to continuously update the data (and therefore running a cluster for that compute with Replicate) then Compose provides the ability to save those transactions and apply in a more batch oriented fashion.
- There are additional features coming for Compose around databricks that I can't talk about on here but you will see more from us there as we continue to invest in that partnership.
- These projects leverage a 2 tier architecture as seen below (Landing -> Provisioning/Consumption)
- Since Compose is leveraging DELTA lake features in databricks the data is updated in place.

Hive projects
- As you would expect these leverage Hive as the processing engine on supported hadoop / cloud hadoop env (EMR/HDInsight/DataProc etc.).
- Hive projects support ODS and HDS data sets (Type 2 history) and are useful when you want the ability to update data in place throughout the day.
- To do this Compose leverages HIVE ACID transaction features (which then require ORC data sets / files in the lake storage layer).
- These projects also leverage a 2 tier data architecture (Landing -> Provisioning/Consumption)

Spark projects
- leverage Spark for data processing (and can use Hive ACID also for an in-place ODS update).
- This project type supports Hadoop type env's (not databricks) and requires a lightweight orchestration agent to be running on the cluster to submit the Spark jobs.
- Spark projects deploy a 3 tier architecture: Landing -> Storage -> Provisioning /Consumption.
- This is because when using only Spark - there is no standard "in-place update" functionality.
- Therefore when generating an HDS or a current or point in time Snapshot Spark needs to perform an overwrite. To ensure this can be handled at any point in time the Storage layer manages the history of your source transactions as parquet files.
- The provisioned data sets can be Parquet / Avro / ORC and either HDS / ODS or snapshots of data.

From a use case perspective - the general / quick thought process should be -

Is databricks a core component of my architecture? If Yes - use the databricks projects
If not and you are using a different type of "hadoop" distro -
- Consider data refresh requirements - in place updates <Hive> versus complete overwrites <Spark>
- Consider data format requirements - parquet, orc

Hope this help!

View solution in original post

TimGarrod · ‎2020-11-30

Hi again @Ray0801 :)... As mentioned in my reply to your other question on hadoop with C4DL (Compose for Data Lakes) - Compose requires the use of a processing engine to be able to process data at scale and provide "big data formatted" data lake assets like parquet files.

In the Qlik Data Integration architecture for data lakes - Replicate captures changes and delivers the change information to your data lake (S3/ADLS/Google storage / HDFS). Since these cloud object stores and HDFS are really append only type files systems that don't provide the capability to update data in place (at least not at any reasonable speed), Replicate provides the change info. INSERT / UPDATE / DELETE transactions are written to the object store, but not applied. (Known as Store Changes setting in Replicate - see docs for more details )

The exception to the above is Databricks, where Replicate 7.0 has the ability to apply changes directly to Databricks Delta tables (in the same fashion it applies to Snowflake).

We could certainly leave the data in that state, but then you'd have to determine how to apply the changes - which often means writing Spark code to manage transactions in the lake. This is what Compose for Data Lakes is for. C4DL provides different project types - Hive, Spark or Databricks. These different types of projects provide options for data processing and data architecture (eg. Hive ACID to update data in place v Spark to deliver parquet files with overwrite characteristics). Since Compose for Data Lakes is leveraging Hive or Spark for processing, it requires the use of cluster resources to process the data.

Hope that helps - but if more clarity is needed let me know.

Ray0801 · ‎2020-12-01

Hi Tim,

Thank You @TimGarrod

Ray0801 · ‎2020-12-01

HI @TimGarrod Can you explain about on which scenarios has each project could used?

TimGarrod · ‎2020-12-02

Hi @Ray0801 - absolutely.

Lets start with the easy one -

Databricks projects.
- These are supported for either Azure or AWS based databricks deployments.
- Replicate delivers data to the storage layer (S3/ADLS) and Compose then applies the data to DELTA tables within databricks.
- While Replicate also supports DELTA (on Azure for now) as a direct endpoint, if you don't have the need to continuously update the data (and therefore running a cluster for that compute with Replicate) then Compose provides the ability to save those transactions and apply in a more batch oriented fashion.
- There are additional features coming for Compose around databricks that I can't talk about on here but you will see more from us there as we continue to invest in that partnership.
- These projects leverage a 2 tier architecture as seen below (Landing -> Provisioning/Consumption)
- Since Compose is leveraging DELTA lake features in databricks the data is updated in place.

Hive projects
- As you would expect these leverage Hive as the processing engine on supported hadoop / cloud hadoop env (EMR/HDInsight/DataProc etc.).
- Hive projects support ODS and HDS data sets (Type 2 history) and are useful when you want the ability to update data in place throughout the day.
- To do this Compose leverages HIVE ACID transaction features (which then require ORC data sets / files in the lake storage layer).
- These projects also leverage a 2 tier data architecture (Landing -> Provisioning/Consumption)

Spark projects
- leverage Spark for data processing (and can use Hive ACID also for an in-place ODS update).
- This project type supports Hadoop type env's (not databricks) and requires a lightweight orchestration agent to be running on the cluster to submit the Spark jobs.
- Spark projects deploy a 3 tier architecture: Landing -> Storage -> Provisioning /Consumption.
- This is because when using only Spark - there is no standard "in-place update" functionality.
- Therefore when generating an HDS or a current or point in time Snapshot Spark needs to perform an overwrite. To ensure this can be handled at any point in time the Storage layer manages the history of your source transactions as parquet files.
- The provisioned data sets can be Parquet / Avro / ORC and either HDS / ODS or snapshots of data.

From a use case perspective - the general / quick thought process should be -

Is databricks a core component of my architecture? If Yes - use the databricks projects
If not and you are using a different type of "hadoop" distro -
- Consider data refresh requirements - in place updates <Hive> versus complete overwrites <Spark>
- Consider data format requirements - parquet, orc

Hope this help!

Ray0801 · ‎2020-12-02

@TimGarrod Thank You so much!

Donald147 · ‎2021-01-08

can someone explain how spark/hive/Databricks explains in moving my data from landing zone to any storage zone such as HDFS or Google storage or why is it neccesary.

Access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.

MyGiftCardSite

Project using apache spark/Hive/databricks

Functionality