Solved: Re: [resolved] Split a large XML file into small f... - Qlik Community

Anonymous · ‎2014-06-02

Hi,
I am trying to integrate data from a large XML file (300 Mo). Is there a way to do it with talend ?

willm1 · ‎2014-06-02

Seiif - Can you do a simple job where you use a tFileInputFullRow to read the XML file and spit out to a tLogRow? If that works - which means your job will run, you can parse it using the 'cruder' Talend-specific solution that I mentioned above.
Let me know if you can do this...

View solution in original post

Anonymous · ‎2014-06-02

What is the problem that you are facing in doing this?
Vaibhav

Anonymous · ‎2014-06-02

The problem is that I can't load the XML File (300Mo ) to the medatadata XML.
Every time I try to do this talend craches

Anonymous · ‎2014-06-02

I have done this using the Perl library TWIG and just used a tSystem to call perl/twig and split the XML.

Anonymous · ‎2014-06-02

Jholman , Coud you give me more details about this please

willm1 · ‎2014-06-02

Hi Seiif - Before suggesting alternatives (below), have you changed your XML parser to SAX in tFileInput, increased your heap size for the job and tried it? DOM parser is very memory intensive whereas SAX is not...

Like jholman, I've done this using sed utility in a shell script (.sh) on the filesystem, called from a tSystem. Using sed, I looked for a particular tag (open tag for the XML), and wherever I found it, I extracted the text between.
Another cruder method I did recently was reading the file as plain text (tFullRow), looking for these markers in the XML, marking them with an increment counter (sequence), and then split the file using tMap. This was for queue data that needed to be processed for each 'row'.

Anonymous · ‎2014-06-02

Hi Willm, I have chcnaged my XML parser to SAX in tFileInput , and I incresased the heap size for the job , but I still have the same problem.
thanks for your precious suggestion

willm1 · ‎2014-06-02

Seiif - Can you do a simple job where you use a tFileInputFullRow to read the XML file and spit out to a tLogRow? If that works - which means your job will run, you can parse it using the 'cruder' Talend-specific solution that I mentioned above.
Let me know if you can do this...

Anonymous · ‎2014-06-02

It works with tFileInputFullRow. I will try the cruder and tell you about the results. Thanks Willm

Anonymous · ‎2014-06-02

Please see the relevant documentation for Twig here : http://search.cpan.org/dist/XML-Twig/tools/xml_split/xml_split
It also provides a mechanism for merging them back together again.

[resolved] Split a large XML file into small files with talend

Talend Data Integration

v5.x

XML