Scenario based Question in Talend???

Anonymous · ‎2015-01-12

I have one scenario based question to all the Talend forum members. It come like this as given below :-
I)Suppose you are having source text file named "A.txt" which contains billons of records.
The task is to extract all the data from source file and loads into target table named " Table D".
When millions of records are inserted into your target table, in the middle of your job for the particular record/data you get error.
After resolving the error, when you execute the job again I want my job to do the following :-
The already inserted records should not be inserted again in the "target table D" keeping the performance of your job in mind.
How can we do?
Consider the cases : 1) Source file may contain unique records
and 2) It may contains duplicate records
B) If you the use of tmemorize components, how we can achieve/do this task without using the components?
Thanks for your help in advance.
Regards
SWABHOSA

Anonymous · ‎2015-01-12

Hi,
Have you checked "insert or update" option in t<DB>output. On schema of t<DB>Output, check the 'key' box of the columns which are the primary key in database, if you select 'insert or update' option, the data will be inserted if it does not exist yet, otherwise, it is updated.
Best regards
Sabrina

Anonymous · ‎2015-01-12

Hello Sabrina,
Thanks for your reply. I am aware of that "insert and update" option in t<db> component.
What if you have to insert only their will be no use of the "Update" options.
When I re-run my job, it should insert the new records only, not the already inserted records.
Best Regards:-swabhosa

Anonymous · ‎2015-01-13

If it has a key column.
You can simply look up the data in "Target D" and join it with your input file using a tmap. Records that match can be filtered out.

Anonymous · ‎2015-01-14

Hello Egoetsch,
Thanks for reply!!!
It does not have an key column.
Regards
Swabhosa

Viswa560 · ‎2017-02-02

Any answer for this post????

_AnonymousUser · ‎2017-03-23

can tell me what could be the error ? if it is data mismatch error, you can always do the schema level checks then pull out the rejected records into the file and only correct data can taken further for inserting into database

vaibs_parab · ‎2018-05-21

1. In case of Source file may contain unique records:

==> In this case, you can import all the files to the Destination table. And before inserting into Destination table use tMap to lookup with the source file unique column. Map the primary key from the inserting main flow to the primary unique column of lookup source file. And keep the 'Inner Join Reject' setting to True in tMap at the output side.

This will insert only those records which are not in the destination table if few of the records are already there is destination table, then it will reject based on an inner join.

1. In case of Source file may contain duplicate records:

==> I am assuming here the whole row does not have duplicated values, only a few columns have duplicated values.

2 John 25 USA

5 John 28 USA

8 Jack 32 UAE

For above case use tAddCRCRow component while inserting data into dest table. And apply/check CRC for all columns, so each record will get a unique CRC value. After that before inserting data into Dest. table join the main flow with source file CRC value as lookup and keep the 'Inner Join Reject' to True at the output side in tMap.

This will insert only those records which are not in the destination table if few of the records are already there is destination table, then it will reject based on an inner join.

Don't forget to give kudos if this reply is worked for you. Also, suggestions/ improvements for this solutions are appreciated.

Anonymous · ‎2020-06-26

Hi @swapneel9218,

In this scenario we can create hashkeys for each record not for key column using MDM Hash Key Routine.

after creating hash keys for already inserted table and newly inserting table lookup those keys in tMap then we will get only the filtered new columns.

Example:

Data1: Data2:

Id Name Id Name

100 krane 101 Lunarg (updated)

101 Lunar 103 Harry (duplicate)

103 Harry 104 potter (new record)

105 Henry (new record)

so lets think there is no primary key then we create hash key for whole record,

then the hashkeys for example

Data1: | Data2:

100 krane 1005867 | 101 Lunarg (updated) 5456878

101 Lunar 1239878 | 103 Harry (duplicate) 4354587

103 Harry 4354587 | 104 potter (new record) 5650112

| 105 Henry (new record) 1008867

so in the above example you can see the the hashkeys for 103 record it is same.

for updated record 101 it is different hash key. so based on this we lookup the hashkeys in tMap get the rejected columns in output.i.e, in target Table D.

Don't forget to give kudos if this reply is worked for you. Also, suggestions/ improvements for this solutions are appreciated.

Talend Data Integration

Related Topics