tmap -Single transaction delete before insert

gt0731 · ‎2017-02-05

We tried to figure out how to do the following within a tamp mapping?
Scenario :
For each record read
Start transaction
Identify existing records
Delete the existing records
Insert new records from source
End transaction
Job flow :
DeleteFlow
|
read from sql server(source) -->write into HDFS -->tmap -->insert Flow -->tgreenplumoutput
|
Lookup target file

Source to target Lookup model screenshot :
source table left join Target table -->Output Flow--> Delete complete record otherwise Insert flow all data again (when Matched ID is null)

Tested :
When i am pulling 6 records from source it is inserting 6 well as output. but then if i increase `1 record in source job goes to infinite loop it is not even throwing error. It should first delete the 6 already inserted and then insert all 7 again. but it is not working as expected.
Any help on it would be much appreciated
Job flow screenshot:

If we did not have the requirement of a single transaction, I would build two separate jobs, one to delete, the other to insert. But the requirement for a single transaction is making this hard. Also will need to guarantee that we Delete before we Insert.

Anonymous · ‎2017-02-05

Hi
Do an inner join to get exiting records and new records, cache the new records or old records in memory for used in next subjob, eg:
t...connection
|
source data--...-->tMap---eixsting records-- tgreenplumoutput (for deleting)
--new records--tHashOutput
|
onsubjobok
|
tHashInput--main-- tgreenplumoutput (for inserting)
-onsubjobok--t...commit
-onsubjoberror--t..Rollback.

Regards
Shong

gt0731 · ‎2017-02-06

Hi shong,
First of all thanks for data flow. it worked for me.
But also when I Executed same as you mentioned. for ~90k records.
job execution time =18 min.
Goal is :
How do i get incremental load in greenplum data warehouse. please guide because 2 flow based current approach is not improving the performance as expected.
it is protocol to write source data into hadoop -source data ---> hdfs (for archive purpose)-->tgreenplumoutput.
Any help on it would be much appreciated .

Anonymous · ‎2017-02-06

Do you really need to check if the same records already exist in target table? or just delete all the data in target table and insert the source data. Cache the small data in memory, if the lookup data set is a little large, cache the data on disk instead of memory on tMap.

gt0731 · ‎2017-02-06

Thanks in advance shong !
I would like to write the Just delete all from target and then insert all again.However, Lookup Contains huge amount of data. When i turned on Stored data on disk =TRUE .
Still ,getting the same execution time because Job visual clearly says data batch while outputting into tgreenplumOutput is slow. it consist of default batch value =10000.

Anonymous · ‎2017-02-06

If you want to delete all data in target table before inserting, you can use a tGreenplumRow to execute a delete statement to delete all data.

t...connection
|
tGreenPlumRow (delete statement)--oncomponentok--source data--...write into HDFS --oncomponentok-->-tgreenplumoutput (for inserting)
-onsubjobok--t...commit
-onsubjoberror--t..Rollback.

gt0731 · ‎2017-02-07

Thanks shong !
I have tried this flow. This looks better But again it is full load.
Also trying for Incremental Load.
In that,I am not able to Identify changed / new row. However,I cannot perform CDC from source system. As i have read access only for extracting the data from source system. Is that possible to perform Incremental Load with given pipeline. it is difficult to me track new/updated rows and output only this into greenplum. Instead of Full load.
I'm hoping you can guide me with.

Anonymous · ‎2017-02-07

Without CDC, you have to compare the source data and target data to get respectively rows that need to be updated and inserted, that is to do an inner join between source data and target data like you did, obviously, the performance of this way is much lower than CDC if the target data volume is big.

Anonymous · ‎2017-02-27

Hi Shong,
I was not able to find how to start a new topic so I have to do this way. I am currently working on a CDC job that should run in both sequences depending upon the timestamp of changes or logs. Sequences are a job that runs in the order of Insert update delete or in the order of delete update insert job run on two conditions. This is to avoid constraint error. Because if row being delete and reinsert with the same primary key will throw error or lose records. I am also passing many context variables to those jobs. I am not able to work with conditional run using tjavarow or tjava with if statements on string. Is there a way to do with if else condition on trunjob and also pass values to the tmap from that conditions.
Thanks,
kmv

Soumya_M · ‎2023-03-03

In the Delete scenario, firstly, I need to compare date of source and target (by extracting date from timestamp) - delete the matching records, then I've to insert the new records (for the deleted date) to the target. How can i achieve this?

source - .csv file, target - postgresql database

Big Data

v6.x