Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
can someone explain how spark/hive/Databricks explains in moving my data from landing zone to any storage zone such as HDFS or Google storage or why is it neccesary.
Hi again @Ray0801 :)... As mentioned in my reply to your other question on hadoop with C4DL (Compose for Data Lakes) - Compose requires the use of a processing engine to be able to process data at scale and provide "big data formatted" data lake assets like parquet files.
In the Qlik Data Integration architecture for data lakes - Replicate captures changes and delivers the change information to your data lake (S3/ADLS/Google storage / HDFS). Since these cloud object stores and HDFS are really append only type files systems that don't provide the capability to update data in place (at least not at any reasonable speed), Replicate provides the change info. INSERT / UPDATE / DELETE transactions are written to the object store, but not applied. (Known as Store Changes setting in Replicate - see docs for more details )
The exception to the above is Databricks, where Replicate 7.0 has the ability to apply changes directly to Databricks Delta tables (in the same fashion it applies to Snowflake).
We could certainly leave the data in that state, but then you'd have to determine how to apply the changes - which often means writing Spark code to manage transactions in the lake. This is what Compose for Data Lakes is for. C4DL provides different project types - Hive, Spark or Databricks. These different types of projects provide options for data processing and data architecture (eg. Hive ACID to update data in place v Spark to deliver parquet files with overwrite characteristics). Since Compose for Data Lakes is leveraging Hive or Spark for processing, it requires the use of cluster resources to process the data.
Hope that helps - but if more clarity is needed let me know.
Hi @Ray0801 - absolutely.
Lets start with the easy one -
From a use case perspective - the general / quick thought process should be -
Hope this help!
Hi again @Ray0801 :)... As mentioned in my reply to your other question on hadoop with C4DL (Compose for Data Lakes) - Compose requires the use of a processing engine to be able to process data at scale and provide "big data formatted" data lake assets like parquet files.
In the Qlik Data Integration architecture for data lakes - Replicate captures changes and delivers the change information to your data lake (S3/ADLS/Google storage / HDFS). Since these cloud object stores and HDFS are really append only type files systems that don't provide the capability to update data in place (at least not at any reasonable speed), Replicate provides the change info. INSERT / UPDATE / DELETE transactions are written to the object store, but not applied. (Known as Store Changes setting in Replicate - see docs for more details )
The exception to the above is Databricks, where Replicate 7.0 has the ability to apply changes directly to Databricks Delta tables (in the same fashion it applies to Snowflake).
We could certainly leave the data in that state, but then you'd have to determine how to apply the changes - which often means writing Spark code to manage transactions in the lake. This is what Compose for Data Lakes is for. C4DL provides different project types - Hive, Spark or Databricks. These different types of projects provide options for data processing and data architecture (eg. Hive ACID to update data in place v Spark to deliver parquet files with overwrite characteristics). Since Compose for Data Lakes is leveraging Hive or Spark for processing, it requires the use of cluster resources to process the data.
Hope that helps - but if more clarity is needed let me know.
Hi Tim,
Thank You @TimGarrod
HI @TimGarrod Can you explain about on which scenarios has each project could used?
Hi @Ray0801 - absolutely.
Lets start with the easy one -
From a use case perspective - the general / quick thought process should be -
Hope this help!
@TimGarrod Thank You so much!
can someone explain how spark/hive/Databricks explains in moving my data from landing zone to any storage zone such as HDFS or Google storage or why is it neccesary.
Access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.