Hello Talend Team,
First, thanks for your efforts on Talend Studio development!
The concept and most components are great and well thought.
As with any software product, there is of course room for improvement and I'm sure you are aiming it.
So, if you want, I may share my impressions and the issues I've met during my work with Talend.
1. XPath
The biggest issue I've met is the lack of XPath in XML related components (tFileInputXML, tXMLmap, etc.).
I know
tFileInputXML has "XPath query" where you can enter XPath, but it does not work when SAX parser is chosen,
(which will be the case for real world usage where you have big documents and loading/parsing them in memory is just not possible).
I also know XPath requires does not go native with SAX, but there an easy and elegant solution to that (please read bellow).
Here is an basic example illustrating the problem.
Imaging you have very simple XML:
<?xml version="1.0" encoding="UTF-8"?>
<products>
<product>
<ID>1</ID>
<name>product 1</name>
<attribute id="color">red</attribute>
<attribute id="size">S</attribute>
</product>
<product>
<ID>2</ID>
<name>product 2</name>
<attribute id="color">green</attribute>
<attribute id="size">M</attribute>
</product>
<product>
<ID>3</ID>
<name>product 3</name>
<attribute id="color">blue</attribute>
<attribute id="size">L</attribute>
</product>
</products>
You'd likely want to extract the data, using the following simple job:
And expect something like:
|=-+-----+---=|
|ID|color|size|
|=-+-----+---=|
|1 |red |S |
|2 |green|M |
|3 |blue |L |
'--+-----+----'
Well, unfortunately this simple task is not possible!
If the XML file is big and if you switch to SAX parser you get:
(with Dom4J you get the exptected result)
|=-+-----+---=|
|ID|color|size|
|=-+-----+---=|
|1 |null |null|
|2 |null |null|
|3 |null |null|
'--+-----+----'
The job:
Here I'm attaching the job for convenience:
TestProduct.zip
The solution:
(there could be a way with by fetching all attributes as different records and then try to apply some filter and aggregate functions but this would be bad both in convenience and performance, so I'm skipping this option)
An elegant solution which I'd implement as native processing algorithm Talend would be to:
Parse with SAX (I'm not sure even that support of other is worth) and
when you get the "Lookup XPath query" document, then perform the user "XPath query"s against it.
This way you get the benefits of both SAX + XPath while keeping perfect performance (the difference is negligible).
That's also what I currently do but with custom code (i.e. configure the component to fetch just the whole loop document and next parse if and perform the needed xpath queries in subsequent tJavaFlex component.)
I'd be glad to hear your thoughts.
Best Regards,
Mirko