topic Re: Big files (tFileInputPositional) in Talend Studio

Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 11:23:37 GMT

Dear Talend Support Team,
We have a huge input file with more than 4 mio. rows in it. This file is read by tFileInputPositional and afterwards its data flow is linked
to tMap. There are in addition lookups with database tables but theses tables don't contain many rows. The problem is the
enormous memory consumption. We need a way to keep the memory moderately. Is there a way to read the huge input file in parts and
than process it and after all read the rest?

Kind regards,
Hilderich

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 11:39:25 GMT

Hi Hilderich,
In order to solve memory problem, in tMap you can save the records in file system. Any ways when tfileinput component reads the file, it can't read all the rows at a time. It reads in chunks of records and then goes to tMap. Your tMap component is the one who collects all the records in memory/file system, works on join operation and pass it to next component after processing. Storing intermediate records in file system will help you to solve the memory problem.
This option is available in property setting in the input section of tMap (top third icon from left at input side)
Thanks
Vaibhav

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 12:01:00 GMT

Hello Vaibhav,
Thanks for your answer. I forgot to mention that this option (store temp data to file) is already in use. Unfortunately the memory consumption has not improved.
When the job is in process I can observe the temp files written to disk but the consumption is still on its maximum. The problem might be the last tMap component before
the data are stored into the database. But on this final tMap there is no lookup designed and therefore I cannot save the flow temporarily to disk again. Any other ideas?
Kind regards,
Hilderich

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 12:52:33 GMT

Hi,
you can try disabling part of job which will help you to understand which component or section of job is consuming memory..or also you can try to break one job into small subjobs and pass data from parent to child or use files in between processing... Performing all tasks in single job is not optimized way to deal with large amount of data and joins... even you can distribute join processing in stages if possible.
Vaibhav

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 13:06:36 GMT

Are you sure it makes a difference to split it into two different jobs? Finally the second job has also the task to process 4 mio. rows transmitted from the job before.

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 14:56:04 GMT

The bottleneck is component tDenormalize. Without this there is no memory consumption up to its limit. Any suggestions how to replace it by a more efficiently approach?.
btw: Your image attachment function here is defect - I cannot attach any images anymore.

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 15:07:34 GMT

Yes, what you are trying to do with tDenormalize?

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 15:15:13 GMT

We need to group the data structure but we skip field "LKZ" from grouping. By this we get the values for "LKZ" comma separated and that is what we want.
This all can be done and is realized already by tDenormalize in the job above.

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 15:21:01 GMT

--- just an idea...
you can put a tfilterrow component before tdenormalize and distribute rows based on particular key value which does not oppose the grouping functionality required by tdenormalize... then you can have two tdenormalize component in main and reject flow... there by dividing the memory usage onto two components... also can use sort component before tdenormalize to give him sorted data so as to process quickly...

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 15:45:14 GMT

Thank you for your help and your suggestions. As far as I know tSortRow is also a memory killer. I could imagine tSortRow in combination with tDenormalize would blow up the memory. 🙂

Re: Big files (tFileInputPositional)

rbaldwin — Thu, 25 Sep 2014 16:05:55 GMT

tSortRow can sort on disk under advanced settings. You can then use the tAggregateSortedRow and the list function to denormalize it and reduce the memory consumption.

Re: Big files (tFileInputPositional)

Anonymous — Thu, 25 Sep 2014 16:18:40 GMT

Hello rbaldwin,
That sounds good. I am going to try it tomorrow and give you feedback right here. I am going home now.
Kind regards,
Hilderich

Re: Big files (tFileInputPositional)

Anonymous — Mon, 27 Oct 2014 02:49:51 GMT

Hi hilderich,

Is there any feedback for your issue?
Best regards
Sabrina

Re: Big files (tFileInputPositional)

Anonymous — Wed, 28 Jan 2015 11:55:31 GMT

This approach was helpful and it is in use.