Re: Extract entire XML element (with subelements) - Qlik Community

Anonymous · ‎2011-10-26

Hello,
I'm new to Talend (just a week) and I have a situation that I cannot solve. I'm working with XML files, trying to read and write stuff.
Description of the problem:
I have and XML that can have 3 types of elements (not at the same time) at the same location in the tree.
For example:

<bill>
<product>
<price>22</price>
<book>Bla bla1</book>
</product>
<product>
<price>23</price>
<wine>Bla bla2<wine>
</product>
<product>
<price>2</price>
<cheese>Bla bla3<cheese>
</product>
</bill>

What I want to do is to cycle through the <product>, filter the results and output the entire XML without the products that are <book>, for example.
The idea is that I cannot output the info normally using a schema because at the same location in <product> I can have <book> or <wine> or <cheese>.
The output XML would look like this:

<bill>
<product>
<price>23</price>
<wine>Bla bla2<wine>
</product>
<product>
<price>2</price>
<cheese>Bla bla3<cheese>
</product>
</bill>

I already cycle through products and I've already put in the condition.
I've also already tried to extract the node <product> as a 'Document' and output it to the other XML, but it doesn't work and throws this exception:

Error on line 2 of document  : Content is not allowed in prolog. Nested exception: Content is not allowed in prolog.

Can I extract the <product> element with all its subelements and output it to another XML?
I'm kinda out of ideas. The only thing I can think of is to copy the entire XML file and somehow delete the nodes that I don't want, but that would only be a temporary solution. I still want to "compose" the output XML.
So, do you have any suggestions?
P.S.: Any idea is appreciated.

Anonymous · ‎2011-10-26

Hi,
When reading the source xml, make sure you click the "Get node" checkbox in the tFileInputXML (see screenshot)
This makes is possible to read through all products and filter out any product that hasn't got a <book> element.
Regards,
Arno

Anonymous · ‎2011-10-26

Will try that right now.
I was always wondering what's with that checkbox.
Thanks for the suggestion and the quick reply!

Anonymous · ‎2011-10-26

You're welcome.
Hope it helps.... if not: I'd be glad to help you out further.
Best regards,
Arno

Anonymous · ‎2011-10-26

Ok, I've tried it and it doesn't quite work. I've tried to import the node as a Document and write it back as a Document, but the problem is that I can only put it as the element value (between the tags).
More precisely it outputs the XML node as text into the output XML. And the "<" and ">" are transformed into "<" and ">".
So now I'm back to square one, I guess.
A wild try would be to actually do a one-on-one association between the input file and the output file (using all the tags) and then select the option to exclude non existant tags in the output file. This would be a "brute force" attempt.
Any other ideas? Maybe a Talend component that I haven't used?
P.S.: Talend is a great product and it has a great community. Thank you.

Anonymous · ‎2011-10-26

I have the same issue as mentioned in this post:
https://community.talend.com/t5/Design-and-Development/XML-Parsing-data-with-XML-tags/td-p/57514

Anonymous · ‎2011-10-27

Hi,
I still think the solution with get nodes should work.
Screen shot 1 is an overview of the complete extraction job. On screen 2 you'll see the output of the tLogRow, you could replace the tLogRow with a tMap to transform (and filter out rows with a test in the product column)
Screen 3 shows the settings for the tInputFileXML while the last screenshot shows the settings for the tExtractXMLField component.
Hope this helps.
Regards,
Arno

Anonymous · ‎2011-10-27

Hi andreibranescu,
Use tFileInputXML:
Create you columns:
Price, Book, Wine, Cheese.
Loop XPath Query = "/bill/product"
Column | XPath query
Price | "Price"
Book | "Book"
Wine | "Wine"
Cheese | "Cheese"
Join this by a main row into tFilterRow:
In TFilterRow add the column Book
and the condition will be:
InputColumn = Book
Function = Empty
Operator = Equals
Value = null
Then attach the filter row to what you want for testing purposes add it to a tLogRow.
you do not need to bring in the nodes (As this will look if there is a value in the Book field it will not bring it through).
Regards,
Brandon

Anonymous · ‎2011-10-27

Hello,
First of all, thank you for your suggestions. They are very helpful.
I have some questions for you, and I will show you my ideas.

I still think the solution with get nodes should work.

What is the type of your 'product' variable (the one that you get as node)? Is it a Document, or a String, or something else?

Join this by a main row into tFilterRow:

I don't know exactly what you mean by this. Did you mean that I should use tJoin, or just connect the XMLinput to the filterRow?
I have two solutions, although none of them is perfect.
Solution 1 - simple
This solution is illustrated in the images 4 through 6. What I do is try to read all the tags (like brandon_stubbs93 suggested) from the XML and then filter the books. I then output all the tags with the advancedXML module and I uncheck the "Create empty element if needed" option, so the fields that are empty won't be shown. This gives me the right result, but if I do want some fields that are empty to appear, this solution will not work anymore.

Solution 2 - more complex
In this solution (images 1 through 3) I also read all the tags, I filter the rows and than I filter the columns. So, I end up with tables where only the wanted elements appear, and not the empty ones. I can then put these tables into the XML using advancedXML in append mode. The only problem is that in append mode, I am forced to select a Group element, and if the loop element is directly in the root node, I cannot do that, because the root node cannot be selected as a Group element. To test this, I simply nested the <product> into another element and I got the result I wanted. But this was for the test file. In the real file, the loop element is directly into the root, so I can't use this solution.
What do you think about these solutions?
In the complex solution, can I replace advancedXML with writeXMLField? If I could, how would I do that? I didn't yet understand how writeXMLField works.
In the first one, can I limit the elements that will not be written prior to passing them to the advancedXML? Can I accomplish this with tMap?
Thanks again!

Anonymous · ‎2011-10-27

Hi Sorry I meant just connect the XMLinput to the filterRow

I uncheck the "Create empty element if needed" option, so the fields that are empty won't be shown. This gives me the right result, but if I do want some fields that are empty to appear, this solution will not work anymore.

All you do for this is add a tMap before your tAdvancedFileOutputXML.
In the tMap the fields you want to display even if they have null values just add the following:
row2.book==null?"":row2.book
You will need to still keep "Create empty element if needed" UNCHECKED or else all the nulls will create the elements.
I hope this helps.
Regards
Brandon

Extract entire XML element (with subelements)

Other

Talend Data Integration

XML