Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I used the SCD component for Postgres to implement a slowly changing dimension and though it works, it only seems to run efficiently without any Java memory heap errors on small data sets. My dimension is roughly 30M records and when I try to run the job on a 10M size dimension with 50 incoming type 2 changes, the job will take hours to run only to eventually give a Java memory heap error.
So, my question is, is this SCD component only meant for small tables and just a quick convenient way to implement it for this specific purpose or is this supposed to work for big data sets as well? If not, is the way to go a custom made solution consisting of lookup and conditional splits for type 1 and 2 fields with more efficient bulk updates instead of row by row which the SCD component perhaps does?
Roland
Thanks. I had initially look at the code where the update and insert statements strings were created, but this is definitely better. Apparently for this simple example, it created the following insert statements for the type 2 change (see below).
The first query I can relate to more as the product code is being compared between the two tables, but then there is the condition
d."valid_to_dt" = '2020-06-24 09:45:06.601' which is applied to the existing record in the dimension table, which is odd since for an existing record I've put the valid_to_dt by default to 2199 or so and therefore this query does not result in anything. The second query is also weird since there is a where clause of the source key being null.
INSERT INTO "test"."dim_acnt_scd"("acnt_id", "open_dt", "prod_code", "valid_from_dt", "valid_to_dt")
SELECT t."acnt_id", t."open_dt", t."prod_code", '2020-06-24 09:45:06.601', NULL
FROM "test"."stg_acnt" t, "test"."dim_acnt_scd" d
WHERE d."acnt_id" = t."acnt_id" AND
( COALESCE(d."prod_code"<> t."prod_code", (d."prod_code" is null and not t."prod_code" is null) or (not d."prod_code" is null and t."prod_code" is null)))
AND d."valid_to_dt" = '2020-06-24 09:45:06.601'
INSERT INTO "test"."dim_acnt_scd"("acnt_id", "open_dt", "prod_code", "valid_from_dt", "valid_to_dt")
SELECT t."acnt_id", t."open_dt", t."prod_code", '2020-06-24 09:45:06.601', NULL
FROM "test"."stg_acnt" t
LEFT JOIN "test"."dim_acnt_scd" d
ON d."acnt_id" = t."acnt_id"
WHERE ( d."acnt_id" IS NULL)
Okay, I now that the insert statement is preceded by an update statement that sets the valid_to_date.
UPDATE "test"."dim_acnt_scd" d
SET "valid_to_dt" = '2020-06-24 09:45:06.601'
FROM "test"."stg_acnt" t
WHERE d."acnt_id" = t."acnt_id"
AND ( COALESCE(d."prod_code"<> t."prod_code", (d."prod_code" is null and not t."prod_code" is null) or (not d."prod_code" is null and t."prod_code" is NULL)))
AND d."valid_to_dt" IS NULL
It also requires the existing valid_to_dt to be null instead of 2199.
The simple example is working now after I updated the valid_to_dt field to null. Great...now the other one