Infinite recursive xml nodes: how to read them?

spr654 · ‎2019-06-06

HI,

I have an xml that contains a self-repeating structure (dataGroup):

<dataGroup>
  <dataItem/>
  <dataGroup>
    <dataItem/>
    <dataItem/>
		<dataGroup>
			<dataItem/>
...
		</dataGroup>
	</dataGroup>
</dataGroup>

I would like to extract this structure dataGroup by dataGroup, providing an id for each dataGroup, but I am not getting there.

with this xml content:

To see the whole post, download it here

fdenis · ‎2019-06-06

think about how do you want to use them.
then create a jo who use document as input context (it's one node).
read and split your node and call yoursef.
ad an top job wo read your file and run your job.

spr654 · ‎2019-06-06

Hi Idenis,

The issue about splitting an infinitely recursive node is that you need to consider that exact case. Your solution, in this case, won't work.

As I use Open Studio, perhaps I am going the wrong way. Perhaps I should use pure java.

Any help is welcome.

Anonymous · ‎2019-06-06

For Reading and creating an Id for each Datagroup, you can use Talend Data Mapper. There are functions to set the sequence number everytime an element is processed. Do you have the total structure or XSD associated to the xml. Secondly, do you have to convert your datagroup into a JSON format?

spr654 · ‎2019-06-06

Hi subhadip13,

Thanks for your answer. Unfortunately, I am using Open Studio, so I cannot count on Talend Data Mapper.

The xsd for the datagroup is as follows:

<xs:complexType name="DataGroupRDEntryRepresentationType">
		<xs:annotation>
			<xs:documentation>A data group</xs:documentation>
		</xs:annotation>
		<xs:sequence>
			<xs:element name="dataItem" type="dataitemrdentryrepresentationtype:DataItemRDEntryRepresentationType" minOccurs="0" maxOccurs="unbounded"/>
			<xs:element name="DataGroup" type="DataGroupRDEntryRepresentationType" minOccurs="0" maxOccurs="unbounded"/>
		</xs:sequence>
		<xs:attribute name="name" type="tagnametypecontenttype:TagNameTypeContentType" use="optional"  />
	</xs:complexType>

It is not necessary to transform the structure into JSON.

thanks for your help and support.

Anonymous · ‎2019-06-06

If you are proficient in Java, it might just be the best solution for this problem. However, there is another way you could try without using Java. Since you do not know how deep this loop might go, you might be able to use a tLoop component with a tExtractXMLField to solve this problem. I don't really have the time to try this out now (but will at some point since I am interested in how this might work), but what I would recommend is to extract your outer loop first as an XML node (Document type) and storing it in the globalMap. Then using a tLoop, drill into the globalMap document exhaustively (using a tExtractXMLField) until there is nothing more to retrieve. You can keep the tLoop iterating using a WHERE or FOR loop. You will need to figure out the logic for that based on your goal.

Sorry for the high level description, but as I said I haven't tried this. In the past I have tended to use a bit of Java to deal with freakishly complicated XML....but I am a Java guy at heart :-). The Data Mapper could also make this easier, but as you said you do not have access to that. If you find a solution (before I get round to this) please do publish it here. This will not be a unique issue and I am sure it will benefit other users of Talend.

spr654 · ‎2019-06-07

Hi,

I really don't see how to do this with tLoop and tExtractXMLfield. Each dataGroup can contain one or several dataItems and one or several dataGroup. If you can provide an example, that would be great.

Thanks in advance.

Anonymous · ‎2019-06-07

I decided to give this a go and found an easier way of doing this without the tLoop. It is still a little complicated. The job I created can be seen below.....

All I am doing here is printing out the "dataItems" with their parent "dataGroup" names. For the "dataItems" without a parent "dataGroup" I am hard coding "Root" as the "dataGroup" name.

The first component (tFileInputXML) is configured as below...

It is simply getting the whole XML document.

The tExtractXMLField_1 is configured as below. I have also set "Ignore the namespaces" on the Advanced Settings for this one (and all other tExtractXMLField components).

The above returns every element within "RDEntry" and its element name ("dataItem" and "dataGroup" for example). The Loop XPath query is set to a wildcard. That is how this works.

The next component I use is the tMap. tMap_2 is configured as below ....

Here I am simply splitting the flow. If the ElementName is "dataItem" it goes to one output, if the ElementName is "dataGroup" it takes the other route.

The tHashOutput_2 is basically configured as a repository for all dataItems. This will get added to throughout the process. It is configured as below (the schema is simply what the tMap passes to it)...

The next component is the tJavaFlex_2. The code I am using here can be seen below....

Start Code

java.util.ArrayList<Document> dataGroupArray = new java.util.ArrayList<Document>();

Main Code

dataGroupArray.add(DataGroupStore.Element);

End Code

globalMap.put("dgs", dataGroupArray);

Here I am creating an ArrayList to hold the dataGroups that are found. This Arraylist will be added to in this section.

In the next section it will be read from and added to. The next SubJob is essentially the same type of flow as above, but this time we are extracting the dataGroups and adding them to the end of the list we are reading from.

So the next SubJob starts with tJavaFlex_4. This reads from the ArrayList we just added dataGroups to. The code can be seen below...

Start Code

java.util.ArrayList<Document> dataGroupArray = (java.util.ArrayList<Document>)globalMap.get("dgs");

for(int i = 0; i<dataGroupArray.size(); i++){

Main Code

row3.Element = dataGroupArray.get(i);

End Code

This will keep returning dataGroups until there are no more.

The next component needs no configuration. It is a default tFlowToIterate. It simply adds each Element (dataGroup) to the globalMap where it will be sent to the tFixedFlowInput. The configuration of this can be seen below....

The next component is the tExtractXMLField_2. This component is doing something very similar to the first tExtractXMLField component described above. The XPaths are slightly adjusted and I have added a column to return the dataGroup attribute "name". I figured you would need this.

The tMap_1 that follows simply splits the data path depending on whether the Element found is a dataItem or a dataGroup. This can be seen below....

The tHashOutput that follows this is linked to the tHashOutput in the first Subjob that collects the dataItems. It simply appends to that list.

The tJavaFlex_3 that follows this simply has one line of code to add new dataGroups to the end of the ArrayList being iterated over at the beginning of the SubJob. The code for that is solely in the Main Code section. It can be seen below.

dataGroupArray.add(copyOfDataGroupStore.Element);

Once you have got this far, your job is practically done. The tHashInput_2 is linked to the tHashOutput_2. This holds all of the dataItems and their dataGroup parent names.

Hopefully this solves your problem 🙂

Talend Data Integration