Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I used the SCD component for Postgres to implement a slowly changing dimension and though it works, it only seems to run efficiently without any Java memory heap errors on small data sets. My dimension is roughly 30M records and when I try to run the job on a 10M size dimension with 50 incoming type 2 changes, the job will take hours to run only to eventually give a Java memory heap error.
So, my question is, is this SCD component only meant for small tables and just a quick convenient way to implement it for this specific purpose or is this supposed to work for big data sets as well? If not, is the way to go a custom made solution consisting of lookup and conditional splits for type 1 and 2 fields with more efficient bulk updates instead of row by row which the SCD component perhaps does?
Roland
The world is not ideal 🙂
For better understanding how it works with the Talend way, you can enable statements logging on PostgreSQL and check actual queries which Talend generate
for you, you will see:
really there inconsistency and incompatibility of approaches and technologies between ETL and ELT components, this is why often people prefer to write their own sql code, load all into staging table than just run 3-4 queries
regards, Vlad
in my personal opinion, yes - SCD components could be used only for relatively small tables with a small number of changes.
But SCDELT, do all work on the target database with just 4 generated queries. The only one issue for SCDELT component - it require both tables (staging and target) to be in the same schema, this is not always useful, but could be managed by creating and delete staging table.
Hi,
Thanks for the reply. I'll look into that. The shortcoming of having both the staging and target in one schema is a strange one. I'll have to accommodate for that as I'm using a persistent staging area in a different schema currently.
Roland
possibly it is a bug, when I try to use table name with schema job return error
you can test
Okay, I'll try something out.
Roland
I guess another condition I noticed is that all column names of the source table needs to be the same as the target dimension. I actually had a tmap object initially between the source and target that renamed the column names and changed some of the data types, but I suppose I'll have to first create a separate intermediate staging table for that.
Hi, I've looked into the ELT solution, but I don't understand how it works.
I created a separate simplified job to get a better understanding of the mechanics.
It consist of an incoming stg_acnt table (see screenshot) with 1 record and another dim_acnt table with 2 records (screenshot).
Below the create and insert statements as well.
CREATE TABLE test.stg_acnt
(acnt_id VARCHAR (50)
, prod_code VARCHAR (10)
, open_dt integer
, valid_from_dt TIMESTAMP
, valid_to_dt TIMESTAMP
);
INSERT INTO test.stg_acnt VALUES ('A511', 'BBB', 20200623, CURRENT_DATE, '2199-12-31 23:59:00');
COMMIT;
CREATE TABLE test.dim_acnt_scd
( dim_acnt_id SERIAL
, acnt_id VARCHAR (50)
, prod_code VARCHAR (10)
, open_dt integer
, valid_from_dt TIMESTAMP
, valid_to_dt TIMESTAMP
);
INSERT INTO test.dim_acnt_scd VALUES (1, 'A511', 'AAA', 20200602, '2020-06-02 00:00:00', '2199-12-31 23:59:00');
INSERT INTO test.dim_acnt_scd VALUES (2, 'B777', 'AAA', 20170515, '2017-05-15 00:00:00', '2199-12-31 23:59:00');
COMMIT;
I created the job as follows (see screenshots). Postgres DB input object to read in the staging table connected to the ELT object.
source table: "stg_acnt"
table: "dim_acnt_scd"
surrogate key: dim_acnt_id
creation: auto increment
Source key: acnt_id
SCD type 2 field: prod_code
SCD type 1: open_dt
start date: valid_from_dt
end date: valid_to_dt
Based on account id 'A511' and a change in product code which I classified as SCD Type 2,I expected one record to be added to the dim_acnt_scd table, which did not happen.
Roland
The world is not ideal 🙂
For better understanding how it works with the Talend way, you can enable statements logging on PostgreSQL and check actual queries which Talend generate
for you, you will see:
really there inconsistency and incompatibility of approaches and technologies between ETL and ELT components, this is why often people prefer to write their own sql code, load all into staging table than just run 3-4 queries
regards, Vlad
It certainly isn't!
Where do I enable the logging statements? I don't see those options under the basic and debug run.
this is PostgreSQL option, you need to enable it on PostgreSQL server, section logging of postgresql.conf:
logging_collector = on log_statement = 'all' # none, ddl, mod, all