topic Re: tStandardizeRow Usage? in Talend Studio

tStandardizeRow Usage?

_AnonymousUser — Wed, 19 Feb 2014 15:49:56 GMT

Hello i have a file not delimited and i would like to parse it
Would it be possible to split my file(according to row lengths) by using RegEx ?
For exemple i want to say:
the 1st row is from 1 to 7 char, the 2nd is from 8 to 12 ...
Is it possible? Where can i configure it?
Than you in advance

Re: tStandardizeRow Usage?

Anonymous — Thu, 20 Feb 2014 02:33:01 GMT

Hi,
Regarding your previous post https://community.talend.com/t5/Design-and-Development/Big-Data-Positional-File/td-p/85416, it seems you have to use MapReduce job.
If so, TalendHelpCenter:tFileInputRegex haven't supported for MapReduce yet.
Here is a solution for your use case: Put your file into Hadoop firstly then tHDFSInput ---> tMap(tHDFSInput---> tJavaMR).
Best regards
Sabrina

Re: tStandardizeRow Usage?

_AnonymousUser — Thu, 20 Feb 2014 09:16:25 GMT

Hi Sabrina,
Thank you for your attention,
So, i will use tHDFSInput (with a single column schema , raw string)-> a tjavaMR (with my csv real columns ) -> tlogRow

Is there something wrong according to you?

Re: tStandardizeRow Usage?

_AnonymousUser — Thu, 20 Feb 2014 10:34:59 GMT

Finally,
i have used a tHDFSinput followed by a tMap.
The tmap does a substring on input rows.
Do you think it is a good solution?
I am working with very big file (90gb)

Best regards

Re: tStandardizeRow Usage?

Anonymous — Fri, 21 Feb 2014 04:34:05 GMT

Hi,
In case there is any memory issue caused by big file for your job , could you please take a look at the online KB article
TalendHelpCenter:ExceptionoutOfMemory.
Best regards
Sabrina

Re: tStandardizeRow Usage?

_AnonymousUser — Fri, 21 Feb 2014 09:21:42 GMT

Thank you Sabrina.
Can you confirm to me a last thibg?
Indeed, mapreduce jobs are played in my cluster, aren't they?

So the memory exception should happen because of the tlog? If i directly insert the data in a database. It shouldn't happen no?

Thank you a lot for your help Sabrina.

Re: tStandardizeRow Usage?

Anonymous — Fri, 21 Feb 2014 09:39:39 GMT

Hi,
The tMap component is cache component consuming two much memory. You'd better store temp data on disk.

If i directly insert the data in a database. It shouldn't happen no?

It depends on your input data and your design.
There are several possible reasons for an outOfMemory Java exception to occur. Most common reasons for it include:
1:Running a Job which contains a number of buffer components such as tSortRow, tFilterRow, tMap, tAggregateRow, tHashOutput for example
2.Running a Job which processes a very large amount of data.

Best regards
Sabrina