Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in NYC Sept 4th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Sax and Xpath Expressions

I am extracting fields from huge xml files (1.5GB) using tFileInputXML component. In order to parse such a huge file, we had to set the xml generation mode to SAX. DOM4j mode allows for a maximum of 1500MB heap size and this crashes our job due to insufficient heap memory.
The problem with SAX mode is that it does not seem to recognize our xpath queries. Example xml input segment;
<analysis_result analysis="peptideprophet">
<peptideprophet_result probability="0.3920" all_ntt_prob="(0.0000,0.0000,0.3920)">
<search_score_summary>
<parameter name="fval" value="0.1900"/>
<parameter name="ntt" value="2"/>
<parameter name="nmc" value="0"/>
<parameter name="massd" value="-0.242"/>
</search_score_summary>
</peptideprophet_result>
</analysis_result>
To extract the nmc parameter value, we were previously using the xpath query;
search_score_summary/parameter/@value
This works in DOM4J mode. SAX mode returns null values.
QUESTIONS:
1. Is there another method of extracting data from huge xml files in Talend other than tFileInputXML in SAX mode?
2. How can we get the values for each of such separate parameters?
Any suggestions pointing me in the right direction are very much welcome. Thank You.
Labels (3)
7 Replies
Anonymous
Not applicable
Author

this is an excellent question, we are having similar issues. Can someone from Talend please shed some light
thanks
Anonymous
Not applicable
Author

I also have the same problem... Any feedback received ? I'm stuck in my development ....
Anonymous
Not applicable
Author

Hi
I have passed your issue along to the Dev team. But Documentation says there is a limitation on SAX generation mode with the "Get Nodes" option as this mode doesn't support namespaces. Not sure whether this is related or not.
Anonymous
Not applicable
Author

Thanks for your feedback...
If required, I can send you some data to help the developers to troubleshoot this issue... (screen shot, XML file, XPATH queries, ...).
For info :
- I use the component tFileInputMSXML, and "Enable XPATH is column 'Schema XPATH loop'.... ' is not ticked. To be honest, I don't see any difference whenever it's ticked or not.. Strange, because I use the "Schema XPATH loop" column....
- If tFileInputMSXSXML cannot be used to stream huge XML files using "Schema XPATH loop" column, what's the alternative for such file ? I don't want to read the file several times.....
Anonymous
Not applicable
Author

I also find the same thing. I think that xpath expressions aren't allowed in "Xpath Query". This is documented for tFileInputMSXML, but not for tFileInputXML.
On the other hand, I saw this: https://jira.talendforge.org/browse/TDI-547
That sounded like this feature had been added in 2007.
I have a job that runs fine, gets expected results with an xpath like "/a/b", when I use DOM mode, but if I change to SAX, it doesn't find any results, and doesn 't give any error.
Is the component supposed to work if you set it to use SAX mode and have an xpath query? If not, I'd suggest that Talend clarify that in the UI.
Levin
Anonymous
Not applicable
Author

one way to solve this is to split the large file into smaller ones (there will be up to 200~290 file) not exeeding 6mb using tjava to call a routine you create (i found this 6mb size have the fastest parsing time) and then do a tfilelist to iterate on them using DOM4j wich is faster than sax and is better with Xpath query's
see https://community.talend.com/t5/Design-and-Development/OutOfMemoryError-GC-overhead-limit-exceeded-o...
Anonymous
Not applicable
Author

Thanks asouini, I will consider that alternative. I wonder if an xslt transform could be used instead of the java code.
Can anyone confirm that Talend's designers don't intend to support Xpath expressions when using the SAX model?
Thanks