Implement SCD Type 2 in Talend

Anonymous · ‎2020-01-02

I need to create a process that imports data from a Relational database on to Hive/HDFS incrementally. The trick is that, on Hive we need to maintain history of transactions for each primary key. This is what is called, 'Type 2 SCD'. In other words, if primary key (PK) is new, we will simply insert a row on Hive but if it's old then we need to make the old row 'In Active' & update its 'End_Timestamp' AND then insert a new row with 'Active Flag' set to 'true'.

Currently, we are doing this by using a MERGE query in Hive which requires Compaction & Analyze; otherwise we run into OutOfMemory errors. We are thinking of replacing that with a Spark program BUT we're wondering if this can be done in Talend without writing code - OR - writing minimal code.

Please let me know. Thanks.

tnewbie · ‎2020-01-03

If the destination environment is a Relational DB we have components tDBSCD which will suffice the need. Your destination environment being HDFS you can still do this using the designated components. I was able to fetch a post that is pretty close to your requirement...please read through this post

https://community.talend.com/t5/Design-and-Development/SCD-implementation-in-hive-hbase-using-Talend...

Anonymous · ‎2020-01-03

Thanks for the reply. There are few issues with your post:

1) The image is too small. I can't see its contents. Tried downloading etc.
2) The zip file is no longer available.

By the way, can you please share some performance numbers for your solution? Say for 10 Million rows, 100 Million rows, 1 Billion rows etc. Also, your answer is over 5 years old. Is this still the best way to do it?

In any case, thanks for the reply. Will look into your solution.

Anonymous · ‎2020-01-03

One more concern... your solution requires creation of a 'Delta' table, right? That's extra writing to disk. If we do it in Spark it will use memory for this...well.. to an extend. I would really like to know how well your solution performs. Thanks.

tnewbie · ‎2020-01-03

I guess my response was not clear enough to say that I have not worked on SCD/CDC to HDFC, i have only used SCD when the target environments were RDBMS. That was neither my solution nor my recommendation, I got the URL on googling and shared with you with a hope that it might help you.

Big Data

Cloud

Java

JDBC

v6.x

v7.x