Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I would like to insert in my database only the new data. So I used incremental loading by comparing my source (set of files) and my target (sql server table) with inner join but since the number of rows inserted in the database is huge this solution is not feasible.
So I thought of doing the CDC by date comparison (last date of run and my current date)
Unfortunately I don't know how to do it.
Someone can help me please !
It could be because - tMSSQLInput not run before tMap
In Your case - this process - independent ... somewhere in parallel world
You can do like:
it just example, but be careful about order - first read value, than use it
Hi,
Think (to be confirmed) CDC is available only for Enterprise version, not Community.
However, what do you mean by "huge", billions rows?
These links may help:
- https://www.talend.com/blog/2017/01/05/talend-job-design-patterns-best-practices-part-4/
- http://bekwam.blogspot.fr/2011/07/complex-joins-out-of-tmap-in-talend.html
Also this topic https://community.talend.com/t5/Design-and-Development/tMap-compare-integer-before-UPDATE-in-express...
Thanks for your replay.
I have 80,000,000 million line in my table so when I made inner join on tMsSqlInput to capture the new line I no longer have memory (It consumes almost 90% of physical memory) and my job is crashing !
The solution described here on the secttion "tMap Lookups" may helps.
It gives you the key to not have to load all 80,000,000 records at a time.
I applied this solution but it takes a lot of time because it looks and compares each file line by the result of my query in the composant tMsSQLInput !
For this reason I want to change this solution and use the comparison by timestamp!
Can you share your job?
Hi,
Here is an example of my job.
Usually tMySqlInput allows me to read the rows of the database to do lookup but as they are many lines (80000000 lines) the job crashes. So I need another soltuion for incremental loading !
You should refer yourself to this discussion.
The idea is to split the large table to smallest pieces then, using a master job, to run the chld job as many times as necessary depending on the number of files generated when the table content is splited.
For example, if each file contains 10,000,000 records, you need to process 8 times.
The solution that you recommended to me makes it possible to run the job in Thread but for me i need a different solution than the internal join and lookup, which allows me to insert only the new lines in my database !
@INESBK wrote:
So I thought of doing the CDC by date comparison (last date of run and my current date)
Unfortunately I don't know how to do it.
Someone can help me please !
why in this case not read MAX date from database and than filter new data with date bigger than?
it could be also ID, or other incremental properties.
if You not store timestamp in database, but use it for lookup - You can store in additional table - last_insert, than read and use
in case when Lookup more complicate - try use insert ignore or insert/update
would be good if You little more explain Your Job logic