tStandardizeRow Usage?

_AnonymousUser · ‎2014-02-19

Hello i have a file not delimited and i would like to parse it
Would it be possible to split my file(according to row lengths) by using RegEx ?
For exemple i want to say:
the 1st row is from 1 to 7 char, the 2nd is from 8 to 12 ...
Is it possible? Where can i configure it?
Than you in advance

Anonymous · ‎2014-02-19

Hi,
Regarding your previous post https://community.talend.com/t5/Design-and-Development/Big-Data-Positional-File/td-p/85416, it seems you have to use MapReduce job.
If so, TalendHelpCenter:tFileInputRegex haven't supported for MapReduce yet.
Here is a solution for your use case: Put your file into Hadoop firstly then tHDFSInput ---> tMap(tHDFSInput---> tJavaMR).
Best regards
Sabrina

_AnonymousUser · ‎2014-02-20

Hi Sabrina,
Thank you for your attention,
So, i will use tHDFSInput (with a single column schema , raw string)-> a tjavaMR (with my csv real columns ) -> tlogRow

Is there something wrong according to you?

_AnonymousUser · ‎2014-02-20

Finally,
i have used a tHDFSinput followed by a tMap.
The tmap does a substring on input rows.
Do you think it is a good solution?
I am working with very big file (90gb)

Best regards

Anonymous · ‎2014-02-20

Hi,
In case there is any memory issue caused by big file for your job , could you please take a look at the online KB article
TalendHelpCenter:ExceptionoutOfMemory.
Best regards
Sabrina

_AnonymousUser · ‎2014-02-21

Thank you Sabrina.
Can you confirm to me a last thibg?
Indeed, mapreduce jobs are played in my cluster, aren't they?

So the memory exception should happen because of the tlog? If i directly insert the data in a database. It shouldn't happen no?

Thank you a lot for your help Sabrina.

Anonymous · ‎2014-02-21

Hi,
The tMap component is cache component consuming two much memory. You'd better store temp data on disk.

If i directly insert the data in a database. It shouldn't happen no?

It depends on your input data and your design.
There are several possible reasons for an outOfMemory Java exception to occur. Most common reasons for it include:
1:Running a Job which contains a number of buffer components such as tSortRow, tFilterRow, tMap, tAggregateRow, tHashOutput for example
2.Running a Job which processes a very large amount of data.

Best regards
Sabrina

Talend Data Integration

v5.x