Data Lake and Primary key constraints

Balax · ‎2022-12-08

Data catalyst allows to import or specify key constraints in properties for an entity. When installed on a Hadoop edge node in AWS , what purpose / benefit other than data lineage would those serve. There are after all , no PK,s in Hadoop as such.

Is there any performance benefit when preparing data?

Will this enforce referential integrity?

I couldn't find a straight answer elsewhere so I thought I bother the forum with this.

bagga3 · ‎2023-09-07

One of the reasons is that these constraints are already supposed to be enforced in OLTP which is the source of OLAP data most of the time so adding this check again in the warehouse would introduce unnecessary overhead.

Another reason I can think of is that the data is not always loaded in warehouse tables in a synchronous manner to reduce the execution time of pipelines. So checking if the primary record exists for the fk column in the current table being loaded would never succeed (again based on the assumption that OLTP already has the correct data).

Speed Test https://vidmate.bid/

Qlik Catalog