Skip to main content
Announcements
July 15, NEW Customer Portal: Initial launch will improve how you submit Support Cases. IMPORTANT DETAILS
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

XPath query not working in tFileInputXML and other XML components

Hello Talend Team,
First, thanks for your efforts on Talend Studio development!
The concept and most components are great and well thought.
As with any software product, there is of course room for improvement and I'm sure you are aiming it.
So, if you want, I may share my impressions and the issues I've met during my work with Talend.
1. XPath
The biggest issue I've met is the lack of XPath in XML related components (tFileInputXML, tXMLmap, etc.).
I know tFileInputXML has "XPath query" where you can enter XPath, but it does not work when SAX parser is chosen, 
(which will be the case for real world usage where you have big documents and loading/parsing them in memory is just not possible).
I also know XPath requires does not go native with SAX, but there an easy and elegant solution to that (please read bellow).
Here is an basic example illustrating the problem.
Imaging you have very simple XML:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<ID>1</ID>
<name>product 1</name>
<attribute id="color">red</attribute>
<attribute id="size">S</attribute>
</product>
<product>
<ID>2</ID>
<name>product 2</name>
<attribute id="color">green</attribute>
<attribute id="size">M</attribute>
</product>
<product>
<ID>3</ID>
<name>product 3</name>
<attribute id="color">blue</attribute>
<attribute id="size">L</attribute>
</product>
</products>

You'd likely want to extract the data, using the following simple job:
0683p000009MDqm.png 
And expect something like:
|=-+-----+---=|
|ID|color|size|
|=-+-----+---=|
|1 |red  |S   |
|2 |green|M   |
|3 |blue |L   |
'--+-----+----'

Well, unfortunately this simple task is not possible!
If the XML file is big and if you switch to SAX parser you get:
(with Dom4J you get the exptected result)
|=-+-----+---=|
|ID|color|size|
|=-+-----+---=|
|1 |null |null|
|2 |null |null|
|3 |null |null|
'--+-----+----'

The job:
Here I'm attaching the job for convenience:  TestProduct.zip

The solution: 
(there could be a way with by fetching all attributes as different records and then try to apply some filter and aggregate functions but this would be bad both in convenience and performance, so I'm skipping this option)
An elegant solution which I'd implement as native processing algorithm Talend would be to:
Parse with SAX (I'm not sure even that support of other is worth) and 
when you get the "Lookup XPath query" document, then perform the user "XPath query"s against it.

This way you get the benefits of both SAX + XPath while keeping perfect performance (the difference is negligible).
That's also what I currently do but with custom code (i.e. configure the component to fetch just the whole loop document and next parse if and perform the needed xpath queries in subsequent tJavaFlex component.)

I'd be glad to hear your thoughts.
Best Regards,
Mirko
Labels (5)
13 Replies
Anonymous
Not applicable
Author

Hi Mirko,
Could you please indicate the build version you are using? Have you tried to use file xml metadata to read your input xml file to see if it works?
Best regards
Sabrina
Anonymous
Not applicable
Author

Hello Xdshi,
I'm using Talend Open Studio for Data Integration v6.1.1
Unfortunately using XML metadata made no difference.
(i.e. if you select SAX parser, you get null values, as shown in the result tables above)
Best Regards,
Mirko
Anonymous
Not applicable
Author

I have just tried a simple test in v562 using a tFileInputXML and a tLogRow. I have set it to use SAX and have tested a simple xpath iterating through loops and it works. However, it doesn't work when you need to lookup data that is outside of the current loop. This is perfectly reasonable given the limitations of the SAX parser. What you can do to avoid this issue is to return XML sub documents a level (in the looping structure) at a time, then parse those with tExtractXMLField components.
Basically, if you use the SAX parser to break down your large XML into meaningful sub documents (using the technique above), then you can use an in memory parser (DOM) to get the details without running into memory issues. Divide and Conquer.
Anonymous
Not applicable
Author

Hi rhal_2.0,
I'm not sure also about Talend " v562" which you refer to. Mine is the latest version that currently can be downloaded from the official site:
https://www.talend.com/download/talend-open-studio#t4 - TOS_DI-20151214_1327-V6.1.1.zip.
I also couldn't understand: have you tried the given example from my initial post ( "attribute" )?
If not please use it as it simple and clear, so we know we are talking about the same thing.
I'm attaching the job to make it easier (you just need to get the example xml from the first post and update the path).
TestProduct.zip
Regarding your thoughts about SAX + DOM - yes, that was the approach I'm suggesting at the end of my initial post ("The solution").
Best Regards,
Mirko
Anonymous
Not applicable
Author

I have tested your attribute filtering example and you are right, it doesn't work. But my point was that this is not a flaw in Talend, this is just how XPath works (or doesn't) when using a SAX parser. It's a bit like expecting a car to drive on water as it does on the road when you decide to go from London to the Isle of Wight without using a boat. 
Your suggested solution is fine, but that would be very hard for Talend to engineer so that it would "just work" for every possible XML structure (impossible maybe?) . But, the tools that are currently available allow you to create that solution for yourself for any XML structure, no matter how simple or convoluted it is. You just need the skills to build it. 
Anonymous
Not applicable
Author

Hello,
Actually, it's very simple, it is universal and there is no problem to "engineering" it (just checked the Talend java code) 0683p000009MACn.png
And is also what you were writing about (in "Divide and Conquer" post):
1. Parse with SAX only till you get the looped element 
(get it as one whole document, do not try to fetch the attributes or use expressions with SAX!) 
2. After you get the whole loop element, then create a DOM Document from it! 
There is where you execute the XPath queries to get the attributes!
If it's a patch on top of current Talend code could be something like:
(of course proper solution will place the code in the related classes)
// pseudo code patch, should be moved to proper classes:
org.talend.xml.sax.SAXLooper looper_tFileInputXML_1 = new org.talend.xml.sax.SAXLooper("/my/loop/element", new String, new boolean[]{true} );
...
java.util.Iterator<java.util.Map<String, String>> it_tFileInputXML_1 = looper_tFileInputXML_1.iterator();
while (it_tFileInputXML_1.hasNext()) {
   ...
   row1 = new row1Struct();
try {
       String nodeXML = row_tFileInputXML_1.get(".");
       Document nodeDOM = org.dom4j.DocumentHelper.parseText(nodeXML);
       for (String path in queryPaths_tFileInputXML_1) {
           String value = nodeDOM.selectSingleNode(path);
           row1.put(path, value);
       }
       ...

Best Regards,
Mirko
Anonymous
Not applicable
Author

This is basing a solution on top of an already poor (but just about workable) aspect of Talend. Talend assumes that there will only be a single loop in XML. If you want to extract data from XML which is contained in multiple loops (nested and non nested), you cannot do this adequately with a single XML component (you can do it in Java, but we are talking about XML components here). It is a pain, but if you know how to work with the tools, you can get around this.
Building your solution on top of this, would only cater for a small number of XML structures. For the majority (which don't fall into this simple structure) people would have to make use of combining components. You would also potentially have issues with memory using your suggested patch. What if the initial loop segment that is specified in the component is required to get reference data, but the bulk of the data is in nested loops within that segment? These could be humongous. 
I can see how your patch idea would solve your problem, but I am saying the reason that it hasn't been (and likely won't be) implemented is that it is too niche and can be built using the existing components by a skilled Talend developer. There is also the Talend Data Mapper (that comes with the Enterprise Edition) which they are positioning as "the way" to deal with complex XML structures. 
Anonymous
Not applicable
Author

Hello rhall_2.0, 
This was only a quick snippet to demonstrate the idea, as I just had no time to dig into Talend "SimpleSAXLooper" - the place where this logic should be applied (then you won't have the memory drawbacks you are referencing).
I still don't understand how having full XPath support, rather than the current (very) limited one, is worse for the user?
Speaking of cases that it would cover:
I don't have a problem - I made this simple example to illustrate Talend problem!
And the logic would be the opposite: if it can't cope with simple structures, with complex ones things can just go worse...
My wish was to help, but I feel some hard resistance in acknowledging obvious problems and that you are basically trying to say that: 
Talend should not fix the bugs and problems if yet there is any (even if cumbersome and/or with poor performance) workaround.
Then probably there is no point to tell about the other base problems, like: 

Repository XML schema - attributes of type "Document" will lead to errors thrown by Talend
(you can work around it by switching to "built in" schema)

XMLMap: due to GUI issue you just can not scroll to reach all attributes of an XML schema on right side.
(you can open the generated Talend  file in a text editor and change it in the raw code)

Not to mention about the architectural issues / limitations, like the one, that components having multiple inputs can not work if the flows are originating from a common source (for example the multiple outputs of tFilInputMSXML/Map can't go into a join/map/xmlmap, etc... yes, you can save them to temporary location and then read them again.) and many, many others...


In case this position represents Talend views, It's my bad and I only can be sorry for the time and investments spent from our company in Talend (which we obviously have to rethink).

Best Wishes,
Mirko
Anonymous
Not applicable
Author

Wow. For your reference I have over 10 years in all aspects of data and application integration using pure SQL, Java + SQL, Informatica, many other tools and Talend. I KNOW this domain. I was simply pointing out the flaws in your argument in relation to the domain and Talend.
Talend is a component based tool set that is meant to aid in speed of development, reusability and metadata handling (amongst other things). Your initial complaint about XPath was partially right, but since you said that it just didn't work, was not entirely correct. You seem to understand Java. As such I would expect you to be able to investigate the Java tools you are complaining about. SAX and XPath do not play nicely. That is not a Talend issue. Your suggestion to fix that was OK, but only catered to a small variety of problems. However (and here is the important thing), these problems can be fixed with the components that are already there...IF you know what you are doing. To give you an analogy, your solution was the equivalent of a brick factory supplying fully formed houses to builders who don't know how to put the bricks together. 
I am sure your initial post was 8 parts trying to offer a solution and 2 parts trying to show how you could do better. However, when I challenged your assertions (from A LOT of experience with the product and the domain), you went into full on "Mine is bigger than yours" because you are a Java developer. Great. But you clearly do not understand the domain, which can be seen from some of your "complaints". I suggest that you get someone who is experienced in data integration to explain the domain to you.
By the way, these are NOT the opinions of Talend, just the opinions of someone who knows the domain and the product very well.