Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello,
I have a job that takes data from multiple ODS tables join them with multiple tMap and insert them to a table. I should have around 80GB of data and the main flow has around 85 000 000 rows (around 15 GB).
All the lookup tables are stored in temp files and RAM available is 25 GB for this job.
The insert are in batch and manual commit.
Even with this, the job is quit slow and turns for several days without ending yet.
Is there another kind of optimization I can do beside changing Talend maps to sql code ?
The problem is clearly not coming from the SQL engine.
What do you think is the average time for Talend to manage 80 to 100 Gb of data ?
Thanks in advance
Regards,
Sofiane
Hi,
I'd look at:
Hi,
I would break the flow in slightly different way. I will merge all the initial data to a temporary table and it will be my stage 1 (if the processes inside the merge process is taking time, you can do them parallel fashion using tparallelize and merge them later).
Now this temp table will be my source and each of the look stages will be handled by fetching the data by joining with lookup tables within DB itself and push the result set to a new temp table. This means that you are not extracting full lookup table information and we are doing the filtering at the source itself.
This approach means there will be lot of writes to temp tables and this can be made faster by using Bulk components instead of normal ones (considering the high input volume).
In a way, we are bypassing the full lookup table download to disk within Talend (which is time consuming) and all the interim table writes are also made faster using Bulk components.
Could you please create a duplicate job current flow using this approach and let us know the results?
Warm Regards,
Nikhil Thampi
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂
Hi,
Thanks for your replies. I've tried to split the job and follow your suggestion but I still have the memory error. The main job is too big even for 25GB of RAM.
Any other idea to manage the memory ?
Thanks in advance.
Sofiane.
Hi
All of the tMap components should have the “Store on disk” enabled, along with a directory path and sensible number of rows. Say, 1,000,000
Hello,
Thanks for replying.
The flow is already set this way. After 2 days running, I've got this error : java.lang.RuntimeException:java.io.IOException
Let's retry.
Hi,
So, being connected for an extended duration of time is probably now your issue.
You could try output the data locally to a file, find some kind of blocking key in the data (say year and month) and iterate through the file multiple times, reading from the file filtering on the blocking key and writing to the destination table, connecting and disconnecting the DB connection.
Hello everybody,
I've splitted the tMaps into multiple sub-jobs and reduced the batch and commit size from 100000 to 50000. This way, at least, the job turns without closing connection.
It is currently turning for two days.
Have a good day.
Sofiane.
Hello everybody,
I again face the java.sql.SQLException:Invalid state, the Connection object is closed. issue while everything is ok in the sql server.
Is there a parameter or a timeout set somewhere in talend ?
The flow is super simple reading from a temporary table with 2 tmaps and writing into another table. There is around 110000000 rows witch is nothing for a DWH.
Thanx for your help.
Sofiane
Hi Sofiane,
Did you check the maximum connection open time allowed for your SQL Server? Usually DBAs set a value to stop connections to remain open for very long time. This could have prompted the error.
Considering your data size, could you please use bulk components to load the data instead of using normal tDBOutput? This could change the whole equation of processing.
Warm Regards,
Nikhil Thampi
Please appreciate our Talend community members by giving Kudos for sharing their time for your query. If your query is answered, please mark the topic as resolved 🙂