Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Slowly changing dimension - PostgresQL - scalability issue

Hi,

I used the SCD component for Postgres to implement a slowly changing dimension and though it works, it only seems to run efficiently without any Java memory heap errors on small data sets. My dimension is roughly 30M records and when I try to run the job on a 10M size dimension with 50 incoming type 2 changes, the job will take hours to run only to eventually give a Java memory heap error. 

So, my question is, is this SCD component only meant for small tables and just a quick convenient way to implement it for this specific purpose or is this supposed to work for big data sets as well? If not, is the way to go a custom made solution consisting of lookup and conditional splits for type 1 and 2 fields with more efficient bulk updates instead of row by row which the SCD component perhaps does?

 

Roland

Labels (2)
1 Solution

Accepted Solutions
vapukov
Master II
Master II

The world is not ideal 🙂

 

For better understanding how it works with the Talend way, you can enable statements logging on PostgreSQL and check actual queries which Talend generate 

for you, you will see:

  • columns valid_from and valid_to - located and managed in the target table
  • Talend will use as valid from current timestamp, and as valid_to - null

really there inconsistency and incompatibility of approaches and technologies between ETL and ELT components, this is why often people prefer to write their own sql code, load all into staging table than just run 3-4 queries

 

regards, Vlad

 

View solution in original post

13 Replies
vapukov
Master II
Master II

in my personal opinion, yes - SCD components could be used only for relatively small tables with a small number of changes.

 

 

But SCDELT, do all work on the target database with just 4 generated queries. The only one issue for SCDELT component - it require both tables (staging and target) to be in the same schema, this is not always useful, but could be managed by creating and delete staging table.

Anonymous
Not applicable
Author

Hi,

Thanks for the reply. I'll look into that. The shortcoming of having both the staging and target in one schema is a strange one. I'll have to accommodate for that as I'm using a persistent staging area in a different schema currently.

 

Roland

vapukov
Master II
Master II

possibly it is a bug, when I try to use table name with schema job return error

you can test

Anonymous
Not applicable
Author

Okay, I'll try something out.

Roland

Anonymous
Not applicable
Author

I guess another condition I noticed is that all column names of the source table needs to be the same as the target dimension. I actually had a tmap object initially between the source and target that renamed the column names and changed some of the data types, but I suppose I'll have to first create a separate intermediate staging table for that.

Anonymous
Not applicable
Author

Hi, I've looked into the ELT solution, but I don't understand how it works.

I created a separate simplified job to get a better understanding of the mechanics.


It consist of an incoming stg_acnt table (see screenshot) with 1 record and another dim_acnt table with 2 records (screenshot).
Below the create and insert statements as well.

 

CREATE TABLE test.stg_acnt
(acnt_id VARCHAR (50)
, prod_code VARCHAR (10)
, open_dt integer
, valid_from_dt TIMESTAMP
, valid_to_dt TIMESTAMP
);

INSERT INTO test.stg_acnt VALUES ('A511', 'BBB', 20200623, CURRENT_DATE, '2199-12-31 23:59:00');
COMMIT;

CREATE TABLE test.dim_acnt_scd
( dim_acnt_id SERIAL
, acnt_id VARCHAR (50)
, prod_code VARCHAR (10)
, open_dt integer
, valid_from_dt TIMESTAMP
, valid_to_dt TIMESTAMP
);


INSERT INTO test.dim_acnt_scd VALUES (1, 'A511', 'AAA', 20200602, '2020-06-02 00:00:00', '2199-12-31 23:59:00');
INSERT INTO test.dim_acnt_scd VALUES (2, 'B777', 'AAA', 20170515, '2017-05-15 00:00:00', '2199-12-31 23:59:00');
COMMIT;


I created the job as follows (see screenshots). Postgres DB input object to read in the staging table connected to the ELT object.


source table: "stg_acnt"
table: "dim_acnt_scd"
surrogate key: dim_acnt_id
creation: auto increment
Source key: acnt_id
SCD type 2 field: prod_code
SCD type 1: open_dt
start date: valid_from_dt
end date: valid_to_dt

 

Based on account id 'A511' and a change in product code which I classified as SCD Type 2,I expected one record to be added to the dim_acnt_scd table, which did not happen.

 

Roland


tDBInput.png
tDBSCDELT_1 part1.png
tDBSCDELT_1 part2.png
test_dim_acnt_scd.png
test_stg_acnt.png
vapukov
Master II
Master II

The world is not ideal 🙂

 

For better understanding how it works with the Talend way, you can enable statements logging on PostgreSQL and check actual queries which Talend generate 

for you, you will see:

  • columns valid_from and valid_to - located and managed in the target table
  • Talend will use as valid from current timestamp, and as valid_to - null

really there inconsistency and incompatibility of approaches and technologies between ETL and ELT components, this is why often people prefer to write their own sql code, load all into staging table than just run 3-4 queries

 

regards, Vlad

 

Anonymous
Not applicable
Author

It certainly isn't! Smiley Happy

Where do I enable the logging statements? I don't see those options under the basic and debug run.

vapukov
Master II
Master II

this is PostgreSQL option, you need to enable it on PostgreSQL server, section logging of postgresql.conf:

logging_collector = on
log_statement = 'all'                   # none, ddl, mod, all