Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi team,
When i run a subjob, the main table has more than 200+ thousands records. I used "Lookup Model:Reload at each row" to filter Lookup table in tMap. Also added conditions in main table. However, the whole of job still will take about 20min.
I did some filers through globalmap in lookup table.
In tMap, implement "Left Out Join" and "Reload at each row" lookup model.
Complete to run the whole of job will take so long time(20min).
Do you have any idea to do the performance tunning?
BTW are there any good ways to use connection pool in Talend?
Thank you in advance!
Now we resolve the performance issue is to add more filters in lookup table which is simplest. After i modified it that speed up 50%. It's more better than before.
Many thanks to your help
First of all. are you getting the data from the same database? If so, you should consider joining and filtering in the DB. There is little point bringing in more data than you need to only throw it away.
The data is from different database.
I think i have already added enough conditions to filter in the DB, at the same time, did some deals in the tMap, so i have no idea how to improve the performance through other ways!
OK, it looks like you are doing the sort of things I would try to speed things up. Another thing to test is where the bottleneck is. Can you disconnect your DB write/update components and just run the code where the join is happening. Does that massively speed things up? If so (as I have seen before) this could very well be a insert/update issue. Also, are you inserting or updating.....or both?
I just tested the joining part and disconnected the update/insert. I think the bottleneck is from the joining part.
OK. Somewhere where you might get a slight improvement is by removing the joins on the columns you are filtering your lookup on. They are not needed. You will only be returning rows that hold the same values as the row that is in coming in in your main flow of data. Therefore you can remove those from the query and from the join. This shouldn't resolve this issue, but will be more efficient.
However, before doing that are you sure that loading the lookup on every row is the most efficient way of doing this? Sometimes it is without question, but if your main flow is 200,000 rows, that is 200,000 queries fired to the DB. I am assuming you have probably looked into this, but it is worth considering if you haven't....and easier to test before removing the joins I advised you to remove above.
I tried to remove the joins, however looks like there is no more efficient.
And i think i have to load the lookup on every row because the lookup table has 428,634,831 rows. Or else it should be more slower.
OK, do you know how many rows are returned for every row that the lookup is fired for? In theory you should only return the rows that you need if you are going to fire 200,000 queries. If you are returning more rows and having to filter them in Java, maybe you can tailor your filtering in your query?
Another solution to this will require some playing around, but may just work. I suspect that you can filter preemptively find some values to filter your lookup query on, before you carry out your lookup once. To do this you will need to change the job to dump your main flow into a tHash component (store it in memory). You have some columns that you need to join on or filter by. Now, if you load the main flow preemptively you *may* be able to find some data to use to filter the lookup by. For example, lets say that your lookup data has a column called "alphabet" which holds every letter from a to z. However, your main flow only returns "a", "g" and "y". If you know this before running the lookup query, you can add an "IN" filter to the query and pass in your comma separated list of "a", "g" and "y" to the SQL query. This might remove 3/4 of yoru lookup data. If you are only loading this once and into memory, you will lose the latency of firing the query for every row and the number of rows to be checked each time is also massively reduced.
Another thing I have just noticed, do you need to match ALL ROWS on the match model? If you don't then switch this off. It will mean that all data will need to be checked for every incoming row.