topic Re: Read huge xml in Talend Studio

Read huge xml

Anonymous — Sat, 16 Nov 2024 10:13:02 GMT

Hi,
I have a huge xml that I want to read. As it is an SDMX file, I wanted to imported as it is because I don't know how to specify it in the metadata otherwise. Obviously, it didn't work very well. As the file is more than 4 Gb, it crashes TOS. What would you have done in this case? Is any example to specify SDMXs in the metadata (xml files)?
Thanks in advance

Re: Read huge xml

vapukov — Thu, 08 Dec 2016 21:31:52 GMT

what You mean - "I wanted imported as it is"? import to where? how?

Re: Read huge xml

Anonymous — Fri, 09 Dec 2016 01:08:24 GMT

You could try the tFileInputXML and select SAX parsing on the advanced settings. SAX is much quicker than DOM and doesn't need to load it into memory, but you will not be able to use look ahead or look back xpath functions.

Re: Read huge xml

Anonymous — Fri, 09 Dec 2016 17:22:56 GMT

what You mean - "I wanted imported as it is"? import to where? how?

Re: Read huge xml

Anonymous — Fri, 09 Dec 2016 17:24:51 GMT

You could try the tFileInputXML and select SAX parsing on the advanced settings. SAX is much quicker than DOM and doesn't need to load it into memory, but you will not be able to use look ahead or look back xpath functions.

Hi rhall,
I did try with tFileInputXML, selecting SAX in the advanced settings. The output is a tfileoutputdelimited that I split each 1000 lines.
Nothing happens, it gets stuck in "Starting".
What would you recommend?
Thanks in advance

Re: Read huge xml

vapukov — Sun, 11 Dec 2016 00:45:39 GMT

what You mean - "I wanted imported as it is"? import to where? how?

Hi Vapukov,
I wanted to create the xml. I used a sample xml that contained only 1 row in the loop, at the end.
I have tried everything, but nothing works, not even SAX, so I don't know what is the approach I could use in this case...
Sorry, hard to understand - what You try to achieve?
in one post You tell, You are want to write XML file, in next You write csv file from XML
So, what is the global task? What steps? may be some pictures from Studio and etc
What structure of Your XML file? as it huge - why not try to split it for several files?

Re: Read huge xml

Anonymous — Sun, 11 Dec 2016 13:00:03 GMT

Hi Vapukov,
My main issue is to read the huge xml. Even if I want to split it, Talend will have to read it first. This step is the bottleneck. I have tried to change the .ini file to increase the java arguments with -Xms1024m and -Xmx9208m. As well, I have tried to increase the jvm settings of the job runner using specific JVM arguments (-Xms1024m and -Xmx9208m). I have tried with Talend Open Studio 5.6.2 MDM edition and 6.3.0 BigData edition The computer I use has an SSD hard disk and a total RAM of 16Gb. After 6 hours of running the job, it is still in the status "Starting". The CPU usage is 100%. The memory usage is 14.6Gb.
It is important to mention that I use the generation mode "fast, with low memory consumption SAX".
This is the xml structure that I have use to create the structure in the metadata:
<?xml version='1.0' encoding='UTF-8'?> <m:GenericData xmlns:footer="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message/footer"

To see the whole post, download it here
OriginalPost.pdf

Re: Read huge xml

vapukov — Sun, 11 Dec 2016 18:41:02 GMT

Hi!
when wrote "Split" I mean real split using one of the command line utilities, such as:

http://xponentsoftware.com/xmlSplit.aspx
https://github.com/acfr/comma/wiki/XML-Utilities
https://gist.github.com/benallard/8042835

then process folder with all XML one-by-one, Talend it excellent tools, but it not mean we must trust only for single tools, it never will do all what all users want.