I am extracting fields from huge xml files (1.5GB) using tFileInputXML component. In order to parse such a huge file, we had to set the xml generation mode to SAX. DOM4j mode allows for a maximum of 1500MB heap size and this crashes our job due to insufficient heap memory.
The problem with SAX mode is that it does not seem to recognize our xpath queries. Example xml input segment;
<analysis_result analysis="peptideprophet">
<peptideprophet_result probability="0.3920" all_ntt_prob="(0.0000,0.0000,0.3920)">
<search_score_summary>
<parameter name="fval" value="0.1900"/>
<parameter name="ntt" value="2"/>
<parameter name="nmc" value="0"/>
<parameter name="massd" value="-0.242"/>
</search_score_summary>
</peptideprophet_result>
</analysis_result>
To extract the nmc parameter value, we were previously using the xpath query;
search_score_summary/parameter/@value
This works in DOM4J mode. SAX mode returns null values.
QUESTIONS:
1. Is there another method of extracting data from huge xml files in Talend other than tFileInputXML in SAX mode?
2. How can we get the values for each of such separate parameters?
Any suggestions pointing me in the right direction are very much welcome. Thank You.
Hi
I have passed your issue along to the Dev team. But
Documentation says there is a limitation on SAX generation mode with the "Get Nodes" option as this mode doesn't support namespaces. Not sure whether this is related or not.
Thanks for your feedback... If required, I can send you some data to help the developers to troubleshoot this issue... (screen shot, XML file, XPATH queries, ...). For info : - I use the component tFileInputMSXML, and "Enable XPATH is column 'Schema XPATH loop'.... ' is not ticked. To be honest, I don't see any difference whenever it's ticked or not.. Strange, because I use the "Schema XPATH loop" column.... - If tFileInputMSXSXML cannot be used to stream huge XML files using "Schema XPATH loop" column, what's the alternative for such file ? I don't want to read the file several times.....
I also find the same thing. I think that xpath expressions aren't allowed in "Xpath Query". This is documented for tFileInputMSXML, but not for tFileInputXML.
On the other hand, I saw this:
https://jira.talendforge.org/browse/TDI-547 That sounded like this feature had been added in 2007.
I have a job that runs fine, gets expected results with an xpath like "/a/b", when I use DOM mode, but if I change to SAX, it doesn't find any results, and doesn 't give any error.
Is the component supposed to work if you set it to use SAX mode and have an xpath query? If not, I'd suggest that Talend clarify that in the UI.
Levin
one way to solve this is to split the large file into smaller ones (there will be up to 200~290 file) not exeeding 6mb using tjava to call a routine you create (i found this 6mb size have the fastest parsing time) and then do a tfilelist to iterate on them using DOM4j wich is faster than sax and is better with Xpath query's
see
https://community.talend.com/t5/Design-and-Development/OutOfMemoryError-GC-overhead-limit-exceeded-o...
Thanks asouini, I will consider that alternative. I wonder if an xslt transform could be used instead of the java code.
Can anyone confirm that Talend's designers don't intend to support Xpath expressions when using the SAX model?
Thanks