topic Re: Processing of great data volume in Talend Studio

Processing of great data volume

Anonymous — Sat, 16 Nov 2024 14:21:07 GMT

Hello,
I am wondering if it would be possible to avoid loading in memory all data from main input flow. An other solution would be to process only a limited number (this could be a parameter in database output components) of data records from the main input at the same time to avoid lack of memory issues.
Let me explain a little :
In case of big amount of data to process (several million records) in ETL flows, Talend needs a server with a lot of memory because all source data records are loaded in the server memory otherwise we get an out of memory error.
In addition, in case data volumes increase, it is not guaranteed that the allowed memory on the server for Talend will be sufficient... which is not really safe for daily enterprise night batch.
In a global point of view, this causes a limitation on the number of records that can be processed by a job because of the server memory limitations.
Is a such enhancement possible in Talend ?
BR
Evagelos

Re: Processing of great data volume

amaumont — Thu, 29 May 2008 13:48:01 GMT

Talend provide since TOS 2.4 RC1 the option "Stored on disk" visible on each table lookups into tMap.
This option allow you to load into lookup as many rows as you want without memory limit, the only one limit is your disk size for temporary data.
Don't forget to set a valid path for temporary files into "Properties view" of tMap.
You are welcomed to test this functionality as soon as possible 🙂

Re: Processing of great data volume

Anonymous — Thu, 29 May 2008 14:29:32 GMT

Hello,
Thanks for your quick answer.
I was aware of this new functionnality on v2.4 but it is only available for lookups. My concern and the point I was underlining in my previous message is on the main output flow: in fact data for lookup flows are stored in memory (in previous TOS versions) but also data from the main flow.
Thanks
Evagelos

Re: Processing of great data volume

Anonymous — Fri, 30 May 2008 00:48:31 GMT

Hello,
In general, main flows are not put in memory. There are only 2 or 3 exceptions with specific components like tSortRow or tAggregateRow.
tSortRow already have the Sort on disk option in 2.3 (see advanced settings)
tAggregateRow only put aggregate output in memory. In most of cases this is not a strong limitation. We can add the same "Sort on disk" in the component if required.
Regards,

Re: Processing of great data volume

Anonymous — Thu, 07 Jan 2010 12:56:01 GMT

Hi,
sorry that i have to reactivate this thread agains. I think evagelos has right. Even "Store on disk" option can prevent the outOf Memory error it can slows down the process. Is it not better when you just load a limit number of datas in memory and process them. It is faster and will make TOS more scalable.